x86 sse2/avx2: rewrite loongarch immediate operand shift instruction #1247

HecaiYuan · 2024-12-12T07:54:11Z

The x86 instructions like _mm_srli_epi8() can accept variable or immediate operand as the second parm, however, the corresponding loongarch instructions like __lsx_vsrli_b only accept immediate operand as the second parm, so we need to rewirite them to avoid compilation error.

jinboson · 2024-12-12T12:36:54Z

Looks good to me.

mr-c · 2024-12-12T12:55:37Z

@jinboson @HecaiYuan Thanks. Do you know why the CI didn't catch this already?

jinboson · 2024-12-13T03:09:48Z

@jinboson @HecaiYuan Thanks. Do you know why the CI didn't catch this already?

Hi, @mr-c . In the CI test , for example, when running test for simde_mm_slli_epi16(), the second shift parm passed is immediate value, like below:

test/x86/sse2.c:7866:test_simde_mm_slli_epi16(SIMDE_MUNIT_TEST_ARGS) {
test/x86/sse2.c:7908: simde__m128i r = simde_mm_slli_epi16(test_vec[i].a, 3);
test/x86/sse2.c:7911: r = simde_mm_slli_epi16(test_vec[i].a, 0);
test/x86/sse2.c:7914: r = simde_mm_slli_epi16(test_vec[i].a, 32);
test/x86/sse2.c:7917: r = simde_mm_slli_epi16(test_vec[i].a, 33);

however, we found that in some projects(like ,aom) that user would pass varibales to the second shift parm, like below:

aom_dsp/x86/highbd_loopfilter_sse2.c:35: *blt = _mm_slli_epi16(x, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:38: *lt = _mm_slli_epi16(x, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:41: *thr = _mm_slli_epi16(x, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:59: *blt_out = _mm_slli_epi16(x0, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:64: *lt_out = _mm_slli_epi16(x0, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:69: *thr_out = _mm_slli_epi16(x0, shift);
aom_dsp/x86/highbd_loopfilter_sse2.c:198: th = _mm_slli_epi16(th, bd - 8);
aom_dsp/x86/highbd_loopfilter_sse2.c:208: th = _mm_slli_epi16(th, bd - 8);

In this case loongarch implementation for simde_mm_slli_epi16() would emit error at compile time before the PR.

It seems other arch ran into the problem already, see here :　#905 (comment)

jinboson · 2024-12-13T03:27:26Z

simde/x86/sse2.h

@@ -6163,7 +6163,7 @@ simde_mm_sll_epi16 (simde__m128i a, simde__m128i count) {
    #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
      r_.neon_u16 = vshlq_u16(a_.neon_u16, vdupq_n_s16(HEDLEY_STATIC_CAST(int16_t, count_.u64[0])));
    #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
-      r_.lsx_i64 = __lsx_vslli_h(a_.lsx_i64, count_.u64[0]);
+      r_.lsx_i64 = __lsx_vsll_h(a_.lsx_i64, __lsx_vreplgr2vr_h(count_.u64[0]));


Move the implementation to the line 6160, Othewise the codepath will step into #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) branch.

jinboson · 2024-12-13T03:29:12Z

simde/x86/sse2.h

@@ -6199,7 +6199,7 @@ simde_mm_sll_epi32 (simde__m128i a, simde__m128i count) {
    #elif defined(SIMDE_ARM_NEON_A32V7_NATIVE)
      r_.neon_u32 = vshlq_u32(a_.neon_u32, vdupq_n_s32(HEDLEY_STATIC_CAST(int32_t, count_.u64[0])));
    #elif defined(SIMDE_LOONGARCH_LSX_NATIVE)
-      r_.lsx_i64 = __lsx_vslli_w(a_.lsx_i64, count_.u64[0]);
+      r_.lsx_i64 = __lsx_vsll_w(a_.lsx_i64, __lsx_vreplgr2vr_w(count_.u64[0]));


Move the implementation to the line 6196, Othewise the codepath will step into #if defined(SIMDE_VECTOR_SUBSCRIPT_SCALAR) branch.

rosbif · 2024-12-13T10:38:25Z

I haven't looked at SIMDe recently but I feel that I should chip in on this.
The initial premise is false: the Intel s[rl][al]i intrinsics should only accept an immediate constant for the shift count.
This is clear from the Intel Intrinsics Guide and the 'i' in the intrinsic stands for immediate.
It is even clearer from the assembler instruction which should be generated.
The problem is that unfortunately some compilers accept a variable.
This presents a dilemma for SIMDe: a variable shift count should be refused but in this case users will complain that their compiler accepts a variable but SIMDe doesn't.

jinboson · 2024-12-16T01:12:15Z

I haven't looked at SIMDe recently but I feel that I should chip in on this. The initial premise is false: the Intel s[rl][al]i intrinsics should only accept an immediate constant for the shift count. This is clear from the Intel Intrinsics Guide and the 'i' in the intrinsic stands for immediate. It is even clearer from the assembler instruction which should be generated. The problem is that unfortunately some compilers accept a variable. This presents a dilemma for SIMDe: a variable shift count should be refused but in this case users will complain that their compiler accepts a variable but SIMDe doesn't.

That's really a dilemma, thank you for setting me straight.

The x86 instructions like _mm_srli_epi8() can accept variable or immediate operand as the second parm, however, the corresponding loongarch instructions like __lsx_vsrli_b only accept immediate operand as the second parm, so we need to rewirite them to avoid compilation error.

rosbif · 2024-12-16T08:59:02Z

If one really must allow a variable shift count, I can see two possible workarounds.

If the target architecture has a shift instruction with the shift counts in a vector, you can splat the shift count across a vector and use this. Obviously this has a performance penalty if the shift count really is a constant.
Use one of the SIMDE_CONSTIFY macros. They are basically a switch over all reasonable shift counts. If the shift count really is a constant (or an expression evaluating to a constant), any decent compiler will optimize away the switch. Otherwise there will be a big impact on code size and performance.

But, for me, the real question is whether SIMDe should accept the invalid use of an intrinsic.
I think not, particularly in view of the potential performance penalty for those who use the intrinsic correctly.
Otherwise I suggest a macro such as SIMDE_ALLOW_VARIABLE_SHIFT_COUNTS to condition this.
Michael? Evan?

jinboson · 2024-12-19T10:38:14Z

Hi, @mr-c . This is ready for review, then is there anything else needed to move this forward ?

HecaiYuan · 2025-01-02T07:00:17Z

Hi, @mr-c . Regarding this issue, do you have any good suggestions? What should we do next?

HecaiYuan force-pushed the master branch 2 times, most recently from 3afebc4 to 1555b14 Compare December 12, 2024 08:48

jinboson reviewed Dec 13, 2024

View reviewed changes

HecaiYuan force-pushed the master branch from 1555b14 to 28eb4e6 Compare December 16, 2024 01:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

x86 sse2/avx2: rewrite loongarch immediate operand shift instruction #1247

x86 sse2/avx2: rewrite loongarch immediate operand shift instruction #1247

HecaiYuan commented Dec 12, 2024

jinboson commented Dec 12, 2024

mr-c commented Dec 12, 2024

jinboson commented Dec 13, 2024 •

edited

Loading

jinboson Dec 13, 2024

jinboson Dec 13, 2024

rosbif commented Dec 13, 2024 •

edited

Loading

jinboson commented Dec 16, 2024

rosbif commented Dec 16, 2024 •

edited

Loading

jinboson commented Dec 19, 2024

HecaiYuan commented Jan 2, 2025

x86 sse2/avx2: rewrite loongarch immediate operand shift instruction #1247

Are you sure you want to change the base?

x86 sse2/avx2: rewrite loongarch immediate operand shift instruction #1247

Conversation

HecaiYuan commented Dec 12, 2024

jinboson commented Dec 12, 2024

mr-c commented Dec 12, 2024

jinboson commented Dec 13, 2024 • edited Loading

jinboson Dec 13, 2024

Choose a reason for hiding this comment

jinboson Dec 13, 2024

Choose a reason for hiding this comment

rosbif commented Dec 13, 2024 • edited Loading

jinboson commented Dec 16, 2024

rosbif commented Dec 16, 2024 • edited Loading

jinboson commented Dec 19, 2024

HecaiYuan commented Jan 2, 2025

jinboson commented Dec 13, 2024 •

edited

Loading

rosbif commented Dec 13, 2024 •

edited

Loading

rosbif commented Dec 16, 2024 •

edited

Loading