Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: getting irrelevant closest items when using dtype='bf16' (default) instead of dtype='f16' #505

Open
2 of 3 tasks
alexyalunin opened this issue Oct 14, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@alexyalunin
Copy link

alexyalunin commented Oct 14, 2024

Describe the bug

Just updated library from 2.12.0 to 2.15.3
Now I'm getting irrelevant closest items.

I expect for an anchor item to be closest to it by any metric (I use cos), you can see in Expected behavior screenshot. Now after update anchor item is not closest + I get a lot of irrelevant items for it (see the screenshot).

It turns out the difference is in dtype='bf16' which is default in new version. After reverting it back to dtype='f16' I get an expected behabior.

Screenshot 2024-10-14 at 17 03 45

Steps to reproduce

Here is the code I use

from usearch.index import Index
item_count, dimension = train_matrix.shape
index = Index(
    ndim=dimension, # Define the number of dimensions in input vectors
    metric='cos', # Choose 'l2sq', 'haversine' or other metric, default = 'ip'
    dtype='bf16', # Quantize to 'f16' or 'i8' if needed, default = 'f32'
    # connectivity=16, # Optional: Limit number of neighbors per graph node
    # expansion_add=128, # Optional: Control the recall of indexing
    # expansion_search=64, # Optional: Control the quality of the search
    # multi=False, # Optional: Allow multiple vectors per key, default = False
)

_ = index.add(list(range(item_count)), train_matrix)

k = 100
res = index.search(train_matrix, count=k+1)  # +1 because same product is closest
train_similars = train_ids[res.keys]

Expected behavior

using dtype='f16'
Screenshot 2024-10-14 at 17 05 53

USearch version

2.15.3

Operating System

Ubuntu 20.04.6 LTS

Hardware architecture

x86

Which interface are you using?

Python bindings

Contact Details

[email protected]

Are you open to being tagged as a contributor?

  • I am open to being mentioned in the project .git history as a contributor

Is there an existing issue for this?

  • I have searched the existing issues

Code of Conduct

  • I agree to follow this project's Code of Conduct
@alexyalunin alexyalunin added the bug Something isn't working label Oct 14, 2024
@ashvardanian
Copy link
Contributor

Hi @alexyalunin! Did you only change the dtype in the constructor or somewhere else?

@ashvardanian
Copy link
Contributor

Can you please log the type of the input matrix and the hardware_capabilities of the index?

@alexyalunin
Copy link
Author

train_matrix.dtype
dtype('float32')

Idk how to log hardware_capabilities, here is index

usearch.Index

  • config
    -- data type: ScalarKind.BF16
    -- dimensions: 64
    -- metric: MetricKind.Cos
    -- multi: False
    -- connectivity: 16
    -- expansion on addition :128 candidates
    -- expansion on search: 64 candidates
  • binary
    -- uses OpenMP: 0
    -- uses SimSIMD: 1
    -- supports half-precision: 1
    -- uses hardware acceleration: haswell
  • state
    -- size: 945,655 vectors
    -- memory usage: 406,848,192 bytes
    -- max level: 4
    --- 0. 945,655 nodes
    --- 1. 58,816 nodes
    --- 2. 3,632 nodes
    --- 3. 264 nodes
    --- 4. 24 nodes

@alexyalunin
Copy link
Author

I didn't change dtype of index by myself.
In the example I changed dtype to bf16 to reproduce the problem

@alexyalunin
Copy link
Author

I have just rebuilt the index 5 times in a loop and look at 5 anchor items (i.e. 25 examples).

It seems like even with f16 the problem still exists, it is just less often than with bf16.
It also seems like with the old version (2.12.0) the problem doesn't exist, so I guess I will just stay with an old version.

@ashvardanian
Copy link
Contributor

Interesting 🤔
Thank you, @alexyalunin! I will look into it!

@ashvardanian ashvardanian self-assigned this Oct 14, 2024
@ashvardanian
Copy link
Contributor

Found the issue in the underlying SimSIMD library. Investigating. Hope to merge within 24h.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants