-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Memory Segment API for aligned vector loads. #132
Comments
Thanks, Jatin! @tjake did you test MemorySegment vectors instead of float[], or am I thinking of something else? |
I did, @jatin-bhateja look at #90 you can run the JMH yourself. |
Thanks for the link @tjake , I will take a look and get back. |
Hi @tjake , I ran SimilarityBench JMH micro included with PR #90 with following modifications on Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Icelake Server.
Following are the results along with relevant PMU counters. Benchmarking was done over unmodified jvector-80-native-vectors branch after some minor build fixes. As can be seen with array based backing storage around 78% of vector loads are split across the cache lines. Split penalty significantly improves with memory segments as we see almost negligible split loads compared to total number of loads. There is around 15% improvement in throughput. Will spend more time to analyze NativeVectorizationProvider. Best Regards, |
Hey @jatin-bhateja thanks for taking a look! So looks like the ValueLayout isn't aligned and allocateDirect is? Am I reading it right? |
Yes, JDK 21 introduced a new API Arean.allocate to allocate aligned memory segments. |
Hi @jatin-bhateja I was able to reproduce the split-load drop with aligned memory but I don't see a 15% bump. I only see a ~5% improvement over arrays. Any idea why? Also, since these are vector embedings mostly 1024 is pretty large. When I run with 128 floats I see a 2% loss over arrays. With 1536 (openai embedding size) I see 11% improvement. |
Hi @tjake I will study your implementation in detail and happy to contribute. Best Regards |
Hi All,
Most of the vectorized code in SimdOps.java is using fromArray API to load the contents into vector.
With JDK-20+ Vector API added the support for loading and storing vectors from MemorySegments.
Using from/intoMemorySegment APIs one can ensures aligned vector load / store, given that most of the code is using SPECIES_PREFERRED which means the vector size (64 bytes) will match with the cacheline size on X86 AVX-512 targets.
Thus if the first vector load in the vector loop happens from an address which is not a multiple of cacheline / vector size then each successive vector load will span across the cache line, this may have significant performance penalty.
Following PMU events can be used to count the number of SPLIT loads against total number of memory loads.
Best Regards,
Jatin
The text was updated successfully, but these errors were encountered: