You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
So notably XXH3 assumes you got unaligned access, as it has become kinda standard on modern processors. However, it doesn't have to be as bad as it is for strict alignment targets.
With RISC-V gaining traction and NOT having unaligned access, we have a modern 64-bit target that will be running scalar with strict alignment.
One thing that could be clearly optimized is the secret access.
When we remove SIMD and look at XXH3 on an 8 byte granularity, we can see that we have a FIFO queue. (Half scale)
This could be the secret (heh) to a possible massive performance increase on strict alignment scalar without an O(n) buffer, as with 8 byte granularity, the overlapping reads are entirely avoidable, and we have 1/8 the unaligned accesses at the cost of a ring buffer which could possibly be done in registers as well (but that would be detrimental to dumber compilers).
This would only make sense on strict alignment/big endian with slow swaps.
The text was updated successfully, but these errors were encountered:
I believe RISC-V implementations are allowed unaligned access, so be careful how such code is enabled, so as to not slow down the cores that support unaligned.
Cyan4973
changed the title
Optimize scalar secret access
Optimize scalar secret access for strict-align cpus
Dec 27, 2024
So notably XXH3 assumes you got unaligned access, as it has become kinda standard on modern processors. However, it doesn't have to be as bad as it is for strict alignment targets.
With RISC-V gaining traction and NOT having unaligned access, we have a modern 64-bit target that will be running scalar with strict alignment.
One thing that could be clearly optimized is the secret access.
When we remove SIMD and look at XXH3 on an 8 byte granularity, we can see that we have a FIFO queue. (Half scale)
We could create a ring buffer of ready-to-use values:
This could be the secret (heh) to a possible massive performance increase on strict alignment scalar without an O(n) buffer, as with 8 byte granularity, the overlapping reads are entirely avoidable, and we have 1/8 the unaligned accesses at the cost of a ring buffer which could possibly be done in registers as well (but that would be detrimental to dumber compilers).
This would only make sense on strict alignment/big endian with slow swaps.
The text was updated successfully, but these errors were encountered: