Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Index creation error: ambuildempty: not yet implemented #190

Open
Mickael-van-der-Beek opened this issue Jan 8, 2025 · 4 comments
Labels
bug Something isn't working community pgvectorscale

Comments

@Mickael-van-der-Beek
Copy link

Mickael-van-der-Beek commented Jan 8, 2025

What happened?

About 11 hours after the beginning of the creation of the diskann index I see in the PostgreSQL logs the following error: ERROR: ambuildempty: not yet implemented.

The process ID logged with the error corresponds to the index creation process and the error seems to be thrown by the following line of code in pgvectorscale:

panic!("ambuildempty: not yet implemented")

By querying the database, I verified that PostgreSQL indicates the index as being not valid:

SELECT 
  pgi.indisvalid
FROM
  pg_stat_all_indexes
    AS psai
INNER JOIN
  pg_index
    AS pgi
    ON pgi.indrelid = psai.relid
WHERE
  psai.relname = 'mytable_embeddings'
;

Through a query on pg_stat_activity I can also confirm that the index creation query ended.

My table setup:

  • a mytable_embeddings UNLOGGED table containing about 21M rows
  • each row has an embedding column of type VECTOR(1024) (BERT-type embedding output)
  • both the table and index are created on the same tablespace ssd_02, which is a dedicated tablespace and physical SSD for this test with 3.5 TB free space.

My hardware setup is a bare metal server with the following config:

  • 96 cores, 192 threads
  • 512 GB of RAM
  • 4x SSDs of 3.5 TB

pgvectorscale extension affected

0.7.4

PostgreSQL version used

16.4

What operating system did you use?

Ubuntu 24.04 LTS on AMD x64

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

2025-01-07 18:35:46.573 GMT [postgres/psql/3288723] WARNING:  Inserted ItemPointer { block_number: 296474, offset: 10 } but it became an orphan
2025-01-07 22:28:20.572 GMT [postgres/psql/3288723] WARNING:  Inserted ItemPointer { block_number: 765846, offset: 11 } but it became an orphan
2025-01-08 05:38:00.124 GMT [postgres/psql/3288723] WARNING:  Indexed 21165914 tuples
2025-01-08 05:38:15.706 GMT [postgres/psql/3288723] ERROR:  ambuildempty: not yet implemented
2025-01-08 05:38:15.706 GMT [postgres/psql/3288723] STATEMENT:  CREATE INDEX CONCURRENTLY mytable_embeddings_l2_diskann_idx ON mytable_embeddings USING DISKANN (embedding VECTOR_L2_OPS) WITH (search_list_size = 100) TABLESPACE "ssd_02";

How can we reproduce the bug?

CREATE INDEX CONCURRENTLY
  mytable_embeddings_l2_diskann_idx
ON
  mytable_embeddings
USING
  DISKANN
  (
    embedding VECTOR_L2_OPS
  )
WITH
  (
    search_list_size = 100
  )
TABLESPACE
  "ssd_02"
;

Are you going to work on the bugfix?

🆘 No, could someone else please work on the bugfix?

@cevian
Copy link
Collaborator

cevian commented Jan 17, 2025

is this an unlogged table? The docs say ambuildempty is

Build an empty index, and write it to the initialization fork (INIT_FORKNUM) of the given relation. This method is called only for unlogged indexes; the empty index written to the initialization fork will be copied over the main relation fork on each server restart.

If it's not unlogged then it may be because of CONCURRENTLY, and we should take a look.

@Mickael-van-der-Beek
Copy link
Author

Mickael-van-der-Beek commented Jan 20, 2025

@cevian Hello, thanks for your response.

Yes, it is indeed an UNLOGGED table because I wanted to create this test table without generating too many WAL log files.

I'll try to recreate the index without the CONCURRENTLY statement and come back to you on that.

@Mickael-van-der-Beek
Copy link
Author

Mickael-van-der-Beek commented Jan 21, 2025

@cevian I created a fresh regular table (not temporary, not unlogged) and followed this by a regular index creation (not concurrent) on the L2 vector operations.

The index took about 7h - 8h to create and is indicated as valid by PostgreSQL's pg_index table.

There were two warnings but the index still seems correct (imo):

2025-01-20 19:02:10.805 GMT [postgres/psql/1935137] WARNING:  Inserted ItemPointer { block_number: 936931, offset: 11 } but it became an orphan
2025-01-21 01:35:13.623 GMT [postgres/psql/1935137] WARNING:  Indexed 22918300 tuples

However, my queries don't seem to use the index both:

Query:

EXPLAIN (
  ANALYZE,
  VERBOSE,
  COSTS,
  SETTINGS,
  BUFFERS,
  WAL,
  TIMING,
  FORMAT JSON
)
SELECT
  fpe.idx
FROM
  fact_pages_embedding
    AS fpe
ORDER BY
  (fpe.embedding <-> ARRAY[-0.059306838, ..., -0.013372703]::VECTOR(1024)) DESC
LIMIT
  100000
;

I VACUUM ANALYZED the table multiple times before the queries.

Would you have any idea on why PostgreSQL wouldn't pick up on the index?

@cevian
Copy link
Collaborator

cevian commented Jan 21, 2025

@Mickael-van-der-Beek I believe the problem is you are using DESC in the ORDER BY clause. The index support ASC not DESC. Given that <-> is a distance, you probably want the smallest (closest) value -- meaning an ascending (ASC) order.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working community pgvectorscale
Projects
None yet
Development

No branches or pull requests

2 participants