Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial WASM support. #242

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Sero1000
Copy link

I finally had some free time and started to work on the WASM SIMD support. Emscripten translates NEON intrinsics to WASM_SIMD intrinsincs, while not all the operations are ported, it's good initial step i guess.

@ashvardanian
Copy link
Owner

Thank you, @Sero1000! Any chance you have performance benchmarks comparing WASM performance to native code? Is there a programmatic API to check if NEON is enabled at runtime?

@Sero1000
Copy link
Author

I don't think there is a way to see if NEON is enabled at runtime. At least I haven't seen it in the documentation, regarding the benchmark I am looking into bench.cxx. I just wanted to open a PR to get some feedback and discussion started, since I have touched some part of the interface.

@Sero1000
Copy link
Author

Sero1000 commented Dec 29, 2024

I ran some benchmarks. In every method the SIMD is faster, besides hamming_b8 and jaccard_b8.

-------------------------------------------------------------------------------------------------------------
Benchmark WASM                                            Time             CPU   Iterations UserCounters...
-------------------------------------------------------------------------------------------------------------
dot_f16_neon<1536d>/min_time:10.000/threads:1            3757 ns         3757 ns      3646765 abs_delta=6.36803n bytes=67.9446M/s pairs=266.174k/s relative_error=733.293n
dot_f32_neon<1536d>/min_time:10.000/threads:1             297 ns          297 ns     47534715 abs_delta=7.05499n bytes=303.964M/s pairs=3.37118M/s relative_error=1.49816u
dot_f16c_neon<1536d>/min_time:10.000/threads:1           8669 ns         8669 ns      1620263 abs_delta=6.87199n bytes=194.345M/s pairs=115.347k/s relative_error=913.666n
dot_f32c_neon<1536d>/min_time:10.000/threads:1            597 ns          597 ns     23325223 abs_delta=7.02965n bytes=144.441M/s pairs=1.67616M/s relative_error=1.15408u
cos_f16_neon<1536d>/min_time:10.000/threads:1            4100 ns         4100 ns      3423223 abs_delta=21.6182n bytes=274.464M/s pairs=243.887k/s relative_error=21.6787n
cos_f32_neon<1536d>/min_time:10.000/threads:1             322 ns          322 ns     43503778 abs_delta=7.43692n bytes=142.509M/s pairs=3.1022M/s relative_error=7.51849n
l2sq_f16_neon<1536d>/min_time:10.000/threads:1           3750 ns         3750 ns      3649013 abs_delta=382.132n bytes=69.0376M/s pairs=266.666k/s relative_error=193.164n
l2sq_f32_neon<1536d>/min_time:10.000/threads:1            296 ns          296 ns     47203587 abs_delta=213.066n bytes=15.5219M/s pairs=3.37502M/s relative_error=107.258n
hamming_b8_neon<1536d>/min_time:10.000/threads:1         8672 ns         8672 ns      1621155 abs_delta=0 bytes=48.7387M/s pairs=115.31k/s relative_error=0
jaccard_b8_neon<1536d>/min_time:10.000/threads:1        17235 ns        17235 ns       811441 abs_delta=0 bytes=178.242M/s pairs=58.0215k/s relative_error=0
kl_f32_neon<1536d>/min_time:10.000/threads:1             1800 ns         1800 ns      7801078 abs_delta=nan bytes=97.5784M/s pairs=555.484k/s relative_error=nan
js_f32_neon<1536d>/min_time:10.000/threads:1             2972 ns         2972 ns      4716064 abs_delta=nan bytes=150.995M/s pairs=336.465k/s relative_error=nan
dot_f16_serial<1536d>/min_time:10.000/threads:1          8687 ns         8687 ns      1609608 abs_delta=13.1164n bytes=92.9335M/s pairs=115.111k/s relative_error=1.91143u
dot_f32_serial<1536d>/min_time:10.000/threads:1          1101 ns         1101 ns     12749638 abs_delta=13.9628n bytes=145.953M/s pairs=908.294k/s relative_error=2.21015u
dot_f16c_serial<1536d>/min_time:10.000/threads:1        14876 ns        14876 ns       950814 abs_delta=9.16103n bytes=218.72M/s pairs=67.2219k/s relative_error=1045.37n
dot_f32c_serial<1536d>/min_time:10.000/threads:1         1517 ns         1517 ns      9269163 abs_delta=7.53501n bytes=11.785M/s pairs=659.312k/s relative_error=1034.78n
cos_f16_serial<1536d>/min_time:10.000/threads:1         11239 ns        11239 ns      1263726 abs_delta=28.8175n bytes=244.266M/s pairs=88.9747k/s relative_error=29.0959n
cos_f32_serial<1536d>/min_time:10.000/threads:1          1141 ns         1141 ns     12289692 abs_delta=24.6526n bytes=49.3568M/s pairs=876.712k/s relative_error=24.9056n
l2sq_f16_serial<1536d>/min_time:10.000/threads:1        10551 ns        10551 ns      1316720 abs_delta=1.25749u bytes=273.153M/s pairs=94.7746k/s relative_error=633.407n
l2sq_f32_serial<1536d>/min_time:10.000/threads:1         1120 ns         1120 ns     12474250 abs_delta=873.252n bytes=211.832M/s pairs=892.801k/s relative_error=439.914n
hamming_b8_serial<1536d>/min_time:10.000/threads:1        805 ns          805 ns     17305283 abs_delta=0 bytes=116.465M/s pairs=1.24241M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1       1584 ns         1584 ns      8825746 abs_delta=0 bytes=96.074M/s pairs=631.419k/s relative_error=0
kl_f32_serial<1536d>/min_time:10.000/threads:1          21264 ns        21264 ns       657289 abs_delta=nan bytes=270.58M/s pairs=47.0277k/s relative_error=nan
js_f32_serial<1536d>/min_time:10.000/threads:1          33607 ns        33607 ns       417679 abs_delta=nan bytes=59.662M/s pairs=29.7557k/s relative_error=nan

--------------------------------------------------------------------------------------------------------------
Benchmark Native(Haswell)                                  Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------
dot_f16_haswell<1536d>/min_time:10.000/threads:1           239 ns          239 ns     57788068 abs_delta=4.49447n bytes=25.7194G/s pairs=4.18609M/s relative_error=838.368n
dot_f32_haswell<1536d>/min_time:10.000/threads:1           229 ns          229 ns     61695066 abs_delta=4.05283n bytes=53.7205G/s pairs=4.37178M/s relative_error=618.113n
dot_f16c_haswell<1536d>/min_time:10.000/threads:1          485 ns          485 ns     29233506 abs_delta=5.26774n bytes=25.3399G/s pairs=2.06217M/s relative_error=1053.19n
dot_f32c_haswell<1536d>/min_time:10.000/threads:1          479 ns          479 ns     29155417 abs_delta=5.35522n bytes=51.3474G/s pairs=2.08933M/s relative_error=914.997n
cos_f16_haswell<1536d>/min_time:10.000/threads:1           248 ns          248 ns     56688499 abs_delta=21.2755n bytes=24.7348G/s pairs=4.02585M/s relative_error=21.336n
cos_f32_haswell<1536d>/min_time:10.000/threads:1           245 ns          245 ns     57090484 abs_delta=4.06642n bytes=50.2399G/s pairs=4.08853M/s relative_error=4.10406n
l2sq_f16_haswell<1536d>/min_time:10.000/threads:1          243 ns          243 ns     58014704 abs_delta=306.799n bytes=25.3349G/s pairs=4.12353M/s relative_error=154.647n
l2sq_f32_haswell<1536d>/min_time:10.000/threads:1          232 ns          232 ns     60353965 abs_delta=110.947n bytes=53.0122G/s pairs=4.31415M/s relative_error=56.0062n
hamming_b8_haswell<1536d>/min_time:10.000/threads:1        103 ns          103 ns    137056450 abs_delta=0 bytes=29.7774G/s pairs=9.69315M/s relative_error=0
jaccard_b8_haswell<1536d>/min_time:10.000/threads:1        141 ns          141 ns     99841876 abs_delta=0 bytes=21.8257G/s pairs=7.10474M/s relative_error=0
dot_f16_serial<1536d>/min_time:10.000/threads:1           6924 ns         6922 ns      2023513 abs_delta=12.4463n bytes=887.546M/s pairs=144.457k/s relative_error=1.70755u
dot_f32_serial<1536d>/min_time:10.000/threads:1           1063 ns         1063 ns     13210677 abs_delta=14.2338n bytes=11.5633G/s pairs=941.021k/s relative_error=2.37246u
dot_f16c_serial<1536d>/min_time:10.000/threads:1         14483 ns        14480 ns       972223 abs_delta=9.16103n bytes=848.606M/s pairs=69.0598k/s relative_error=1045.37n
dot_f32c_serial<1536d>/min_time:10.000/threads:1          2303 ns         2303 ns      6104887 abs_delta=6.96473n bytes=10.6735G/s pairs=434.304k/s relative_error=942.332n
cos_f16_serial<1536d>/min_time:10.000/threads:1           7174 ns         7174 ns      1940213 abs_delta=30.3359n bytes=856.47M/s pairs=139.399k/s relative_error=30.6709n
cos_f32_serial<1536d>/min_time:10.000/threads:1           1110 ns         1110 ns     12660383 abs_delta=25.3824n bytes=11.0713G/s pairs=900.988k/s relative_error=25.6467n
l2sq_f16_serial<1536d>/min_time:10.000/threads:1          7130 ns         7128 ns      1969190 abs_delta=1.23103u bytes=861.91M/s pairs=140.285k/s relative_error=620.233n
l2sq_f32_serial<1536d>/min_time:10.000/threads:1          1068 ns         1068 ns     13112453 abs_delta=876.925n bytes=11.5097G/s pairs=936.663k/s relative_error=441.684n
hamming_b8_serial<1536d>/min_time:10.000/threads:1         733 ns          733 ns     18818066 abs_delta=0 bytes=4.18988G/s pairs=1.36389M/s relative_error=0
jaccard_b8_serial<1536d>/min_time:10.000/threads:1        1175 ns         1175 ns     11951942 abs_delta=0 bytes=2.61374G/s pairs=850.828k/s relative_error=0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants