Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize scalar secret access for strict-align cpus #870

Open
easyaspi314 opened this issue Jul 17, 2023 · 1 comment
Open

Optimize scalar secret access for strict-align cpus #870

easyaspi314 opened this issue Jul 17, 2023 · 1 comment

Comments

@easyaspi314
Copy link
Contributor

easyaspi314 commented Jul 17, 2023

So notably XXH3 assumes you got unaligned access, as it has become kinda standard on modern processors. However, it doesn't have to be as bad as it is for strict alignment targets.

With RISC-V gaining traction and NOT having unaligned access, we have a modern 64-bit target that will be running scalar with strict alignment.

One thing that could be clearly optimized is the secret access.

When we remove SIMD and look at XXH3 on an 8 byte granularity, we can see that we have a FIFO queue. (Half scale)

| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
[   stripe 1    ]
    [    stripe 2   ]
        [    stripe 3   ]
            [    stripe 4   ]
                [    stripe 5   ]

We could create a ring buffer of ready-to-use values:

XXH3_accumulate(..., xsecret)
{
    u64 secret_buffer[8];
    int secret_buffer_pos = 0;
    for (i = 0; i < 7; i++) { // one less
        secret_buffer[i] = XXH_readLE64(xsecret + 8 * i);
     }
     for each stripe {
         secret_buffer[(secret_buffer_pos + 7) % 8] = XXH_readLE64(xsecret + ...);
         XXH3_accumulate_512(...);
         secret_buffer_pos = (secret_buffer_pos + 1) % 8;
     }
}

This could be the secret (heh) to a possible massive performance increase on strict alignment scalar without an O(n) buffer, as with 8 byte granularity, the overlapping reads are entirely avoidable, and we have 1/8 the unaligned accesses at the cost of a ring buffer which could possibly be done in registers as well (but that would be detrimental to dumber compilers).

This would only make sense on strict alignment/big endian with slow swaps.

@sh1boot
Copy link

sh1boot commented Sep 30, 2023

I believe RISC-V implementations are allowed unaligned access, so be careful how such code is enabled, so as to not slow down the cores that support unaligned.

@Cyan4973 Cyan4973 changed the title Optimize scalar secret access Optimize scalar secret access for strict-align cpus Dec 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants