You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
eve::load support the conversion decorator so we can load and convert in one scoop.
SVE supports this with:
svld1ub_* and svld1sb_* for (un)signed char into larger integers
svld1uh_* and svld1sh_* for (un)signed short into larger integers
svld1uw_* and svld1sw_* for (un)signed int into larger integers
E.G load(char*, as<int>{}) would call svld1sb_s32
Deinterleaving load and store
SVE has svld2,svld3,svld4 that load 2,3,4 vector full of data at once from a single
pointer and perform deinterleaving on the fly. Similar intrinsic exist for storing such tuple
like registers.
autof(float* ptr)
{
auto v = svld2(eve::detail::sve_true<float>(),ptr);
returnwide<kumi::tuple<float,float>> { wide<float>(svget2(v,0))
, wide<float>(svget2(v,1))
};
}
intmain()
{
float data[] = {1,2,3,4,10,20,30,40};
auto t = f(data);
std::cout << t << "\n";
std::cout << get<0>(t) << "\n";
std::cout << get<1>(t) << "\n";
}
svreb(0xAABBCCDD) gives 0xDDCCBBAA per lanes.
The two others do the same per half word or per word.
Helps for reverse and swap_adjacent_group.
SVCNOT
svcnot turns integer into 0 if they are not 0 and into 1 if they are 0.
Optimize binarize_not in integer.
SVINSR
svinsr(v,s) shift the element of v one to the left and insert the scalar s in its place.
Do we have any pattern benefiting from this ?
SVANDV, SVEORV, SVORRV
Bitwise operations has a vector-wide reduction form. We don't use those for all, any or none
as we use the optimized count_true implementation.
Do we have any pattern benefiting from this ?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Load/Store
Converting Loads
eve::load support the conversion decorator so we can load and convert in one scoop.
SVE supports this with:
svld1ub_*
andsvld1sb_*
for (un)signed char into larger integerssvld1uh_*
andsvld1sh_*
for (un)signed short into larger integerssvld1uw_*
andsvld1sw_*
for (un)signed int into larger integersE.G
load(char*, as<int>{})
would callsvld1sb_s32
Deinterleaving load and store
SVE has svld2,svld3,svld4 that load 2,3,4 vector full of data at once from a single
pointer and perform deinterleaving on the fly. Similar intrinsic exist for storing such tuple
like registers.
outputs
We don't currently have an abstraction for that but I think we should to exploit such intrinsics as loading homogenous tuple-like type is very common
Note: ARM NEON has similar intrinsic too so this is not a one-off
Arithmetic Operations
SVABD
svabd
computes the difference of absolute values. This should be used foreve::dist
instead of the max-min implementation.ONLY FOR INTEGER.
SVMULH
svmulh
computes the high bits ofa * b
, probably useful for #1501.SVNMAD, SVNMLA
Missing fnma and fnam implementation without ressorting to contortion with the other FMA like operations. ONLY FOR FLOAT.
SVNMSB, SVNMLS
Missing fnms and fnsm implementation without ressorting to contortion with the other FMS like operations. ONLY FOR FLOAT.
SVMINNM, SVMAXNM
Doesn't take NaN into account, gives us optimized
numeric(min)
andnumeric(max)
.SVMINNMV, SVMAXNMV
Doesn't take NaN into account, gives us optimized
numeric(minimum)
andnumeric(maximum)
.SVSCALE
Gives us optimized
ldexp
.SVCMPUO
Gives us optimized
is_unordered
andis_ordered
.SVACLT, SVACLE, AVACGT, SVACGE
Used to optimize
maxmag
andminmag
.SVTSMUL, TVSMAD
Those functions collaborates to compute sin and cos in quarter_circle with full hardware optimisation.
See https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/FTMAD--Floating-point-trigonometric-multiply-add-coefficient- for details.
SVEXPA
Accelerates exp(x) series.
See https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/FEXPA--Floating-point-exponential-accelerator- for details.
SVRECX + FMULX
Those two combined gives us optimized mantissa and frexp.
Gives
Bitwise Operations
SVCLS
svcls(x)+1
gives us optimizedeve::countl_zero
.SVREVB,SVREVH, SVREVW
svreb(0xAABBCCDD)
gives0xDDCCBBAA
per lanes.The two others do the same per half word or per word.
Helps for
reverse
andswap_adjacent_group
.SVCNOT
svcnot
turns integer into 0 if they are not 0 and into 1 if they are 0.Optimize
binarize_not
in integer.SVINSR
svinsr(v,s)
shift the element of v one to the left and insert the scalar s in its place.Do we have any pattern benefiting from this ?
SVANDV, SVEORV, SVORRV
Bitwise operations has a vector-wide reduction form. We don't use those for all, any or none
as we use the optimized count_true implementation.
Do we have any pattern benefiting from this ?
Beta Was this translation helpful? Give feedback.
All reactions