Missing SVE Intrinsics #1557

jfalcou · 2023-02-10T16:20:58Z

jfalcou
Feb 10, 2023
Maintainer

Load/Store

Converting Loads

eve::load support the conversion decorator so we can load and convert in one scoop.
SVE supports this with:

svld1ub_* and svld1sb_* for (un)signed char into larger integers
svld1uh_* and svld1sh_* for (un)signed short into larger integers
svld1uw_* and svld1sw_* for (un)signed int into larger integers

E.G load(char*, as<int>{}) would call svld1sb_s32

Deinterleaving load and store

SVE has svld2,svld3,svld4 that load 2,3,4 vector full of data at once from a single
pointer and perform deinterleaving on the fly. Similar intrinsic exist for storing such tuple
like registers.

auto f(float* ptr)
{
  auto v = svld2(eve::detail::sve_true<float>(),ptr);
  return wide<kumi::tuple<float,float>> { wide<float>(svget2(v,0))
                                        , wide<float>(svget2(v,1))
                                        };
}

int main()  
{
  float data[] = {1,2,3,4,10,20,30,40};

  auto t = f(data);
  std::cout << t << "\n";
  std::cout << get<0>(t) << "\n";
  std::cout << get<1>(t) << "\n";
}

outputs

(( 1 2 ), ( 3 4 ), ( 10 20 ), ( 30 40 ))
(1, 3, 10, 30)
(2, 4, 20, 40)

We don't currently have an abstraction for that but I think we should to exploit such intrinsics as loading homogenous tuple-like type is very common

Note: ARM NEON has similar intrinsic too so this is not a one-off

Arithmetic Operations

SVABD

svabd computes the difference of absolute values. This should be used for eve::dist instead of the max-min implementation.
ONLY FOR INTEGER.

SVMULH

svmulh computes the high bits of a * b, probably useful for #1501.

SVNMAD, SVNMLA

Missing fnma and fnam implementation without ressorting to contortion with the other FMA like operations. ONLY FOR FLOAT.

SVNMSB, SVNMLS

Missing fnms and fnsm implementation without ressorting to contortion with the other FMS like operations. ONLY FOR FLOAT.

SVMINNM, SVMAXNM

Doesn't take NaN into account, gives us optimized numeric(min) and numeric(max).

SVMINNMV, SVMAXNMV

Doesn't take NaN into account, gives us optimized numeric(minimum) and numeric(maximum).

SVSCALE

Gives us optimized ldexp.

SVCMPUO

Gives us optimized is_unordered and is_ordered.

SVACLT, SVACLE, AVACGT, SVACGE

Used to optimize maxmag and minmag.

SVTSMUL, TVSMAD

Those functions collaborates to compute sin and cos in quarter_circle with full hardware optimisation.

See https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/FTMAD--Floating-point-trigonometric-multiply-add-coefficient- for details.

SVEXPA

Accelerates exp(x) series.
See https://developer.arm.com/documentation/ddi0602/2022-12/SVE-Instructions/FEXPA--Floating-point-exponential-accelerator- for details.

SVRECX + FMULX

Those two combined gives us optimized mantissa and frexp.

#include <iostream>
#include <eve/module/core.hpp>
#include <eve/module/math.hpp>

using namespace eve;

wide<float> f(wide<float> a)
{
  return svrecpx_x(eve::detail::sve_true<float>(),a);
}

int main()  
{
  wide<float> v{128.65};
  std::cout << v                << "\n";
  std::cout << f(v)             << "\n";
  std::cout << (f(v) * v)/2     << "\n";
  std::cout << eve::mantissa(v) << "\n";
  std::cout << (f(v) * v)/4     << "\n";
  std::cout << eve::frexp(v)    << "\n";
}

Gives

(128.65, 128.65, 128.65, 128.65)
(0.015625, 0.015625, 0.015625, 0.015625)
(1.00508, 1.00508, 1.00508, 1.00508)
(1.00508, 1.00508, 1.00508, 1.00508)
(0.502539, 0.502539, 0.502539, 0.502539)
( (0.502539, 0.502539, 0.502539, 0.502539) (8, 8, 8, 8) )

Bitwise Operations

SVCLS

svcls(x)+1 gives us optimized eve::countl_zero.

SVREVB,SVREVH, SVREVW

svreb(0xAABBCCDD) gives 0xDDCCBBAA per lanes.
The two others do the same per half word or per word.
Helps for reverse and swap_adjacent_group.

SVCNOT

svcnot turns integer into 0 if they are not 0 and into 1 if they are 0.
Optimize binarize_not in integer.

SVINSR

svinsr(v,s) shift the element of v one to the left and insert the scalar s in its place.
Do we have any pattern benefiting from this ?

SVANDV, SVEORV, SVORRV

Bitwise operations has a vector-wide reduction form. We don't use those for all, any or none
as we use the optimized count_true implementation.
Do we have any pattern benefiting from this ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Missing SVE Intrinsics #1557

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Missing SVE Intrinsics #1557

jfalcou Feb 10, 2023 Maintainer

Load/Store

Converting Loads

Deinterleaving load and store

Arithmetic Operations

SVABD

SVMULH

SVNMAD, SVNMLA

SVNMSB, SVNMLS

SVMINNM, SVMAXNM

SVMINNMV, SVMAXNMV

SVSCALE

SVCMPUO

SVACLT, SVACLE, AVACGT, SVACGE

SVTSMUL, TVSMAD

SVEXPA

SVRECX + FMULX

Bitwise Operations

SVCLS

SVREVB,SVREVH, SVREVW

SVCNOT

SVINSR

SVANDV, SVEORV, SVORRV

Replies: 0 comments

jfalcou
Feb 10, 2023
Maintainer