Feature request: p256 avx2 affine point table selection #148

ctz · 2024-09-26T12:45:38Z

Hello,

I was hoping you might consider offering a function like this. This is mainly for discussion rather than a real contribution; notably, I have not written any proof for this or even added it to any makefiles (but I'm happy to do both those things, though would certainly need help with the proof!). It is effectively a specialisation of bignum_copy_row_from_table using AVX2 to improve throughput.

Background: I've been building upon s2n-bignum to make a new crypto library in Rust. For p256 base point multiplication I'm using the same approach as BoringSSL/AWS-LC which work through the exponent in 7-bit chunks, in wNAF form, and have a table of 37x64 multiples of the curve generator, at a cost of ~148KB of data. (Most of this predates the new scalarmul functions; I haven't measured these to see how they compare yet.)

I was using bignum_copy_row_from_table for this, but identified that it was relatively slow for such a large amount of data.

Compared to using bignum_copy_row_from_table the performance improvement looks like:

p256-keygen/graviola    time:   [8.2675 µs 8.2902 µs 8.3212 µs]
                        thrpt:  [120.18 Kelem/s 120.62 Kelem/s 120.96 Kelem/s]
                 change:
                        time:   [-39.295% -38.577% -37.704%] (p = 0.00 < 0.05)
                        thrpt:  [+60.524% +62.805% +64.730%]
                        Performance has improved.

For comparison:

p256-keygen/aws-lc-rs   time:   [8.3274 µs 8.3402 µs 8.3556 µs]
                        thrpt:  [119.68 Kelem/s 119.90 Kelem/s 120.09 Kelem/s]

(BoringSSL/AWS-LC has a similar, but more aggressive version; see ecp_nistz256_select_w7, this implementation is not derived from that one.)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

jargh · 2024-09-28T06:12:57Z

Thank you, this is indeed a very good idea. On ARM we already make use of the NEON instructions here, and this makes it clear we can substantially improve performance of table selection on x86 using AVX2. As well as the general bignum_copy_row_from_table function, there are now a lot of table selections embedded in the scalar multiplication functions, which we should also be able to speed up using this sort of code, and it would be even more important for the kind of large-table basepoint multiplication you are implementing.

The only thing to discourage us from adopting this right away is that we don't yet have semantics or instruction decoding set up for AVX2 in order to tackle the proof, so there will be a one-time effort of doing so. Still, I don't think it should be too bad, so we'll look into it very soon :-)

jimgrundy · 2024-11-17T17:22:15Z

Would it make sense to unlock this by inviting Joe to extend this by also contributing a model of the necessary AVX2 instructions? Our ambition going forward is to encourage more contributions from cryptographers. Perhaps Joe might comment on what documentation or examples would help him make such a contribution if he were interested?

jargh · 2024-11-19T19:32:17Z

That could make a lot of sense if Joe is motivated to try it. But there is some one-time enabling for AVX2 that we (June and I) should probably do first, since it isn't very easy for someone new to the code to follow all the decoding setup. We have that pretty much ready to go - we wanted AVX2 both for this type of constant-time selection and also for current and future work involving ML-KEM. Once that is done, I hope it could be tractable for others to build up the model without too much pain. To get a picture of what work is involved, this was the last change to the x86 model:
7cf2fa4
while this was a recent change to ARM, which is for some SIMD instructions and hence relevant:
d32508b

ctz · 2024-11-19T19:56:07Z

Yeah I'd be also interested in contributing these, which are in a similar vein to existing code:

That is probably an easier contribution to start with, as they're just variations on existing code/proofs.

(I would probably begin with avoiding the 1-based indexing in my calling code, whereby bignum_aff_point_select_p256 would fall away for the existing bignum_copy_row_from_table_8n_neon, and the others would be, i guess, bignum_copy_row_from_table_12n_neon and bignum_copy_row_from_table_18n_neon.)

jargh · 2024-11-24T19:27:03Z

Finally this PR should do the basic AVX2 enabling when approved/fixed up/merged: #164 I somewhat arbitrarily chose the first instruction in your code (VPXOR) as the single example to enable, but most of this is about setting things up in general, so adding new instructions should now be much easier. One remaining thing to do, at least before actually making use of AVX2 instructions, is to update the simulator (x86/proofs/simulator.* etc.) to deal with the AVX2 state, but that shouldn't be too hard.

p256: add avx2 affine point table selection

e07f73f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature request: p256 avx2 affine point table selection #148

Feature request: p256 avx2 affine point table selection #148

ctz commented Sep 26, 2024

jargh commented Sep 28, 2024

jimgrundy commented Nov 17, 2024

jargh commented Nov 19, 2024

ctz commented Nov 19, 2024

jargh commented Nov 24, 2024

Feature request: p256 avx2 affine point table selection #148

Are you sure you want to change the base?

Feature request: p256 avx2 affine point table selection #148

Conversation

ctz commented Sep 26, 2024

jargh commented Sep 28, 2024

jimgrundy commented Nov 17, 2024

jargh commented Nov 19, 2024

ctz commented Nov 19, 2024

jargh commented Nov 24, 2024