scalar horner algo using simd computations #1173

jtlap · 2022-01-03T15:03:24Z

jtlap
Jan 3, 2022
Collaborator

I wrote this:

template<eve::floating_scalar_value T> inline
auto scalar_reverse_horner(std::vector<T> const & a, T x) noexcept
{
  using w_t =   eve::wide<T>;
  auto zz =  T(1);
  w_t xx([&zz, x](auto i,  auto){ auto z = zz; zz *= x; return z;}); //exclusive scan -> [1,x,..,x^(N-1)} 
  int N = eve::cardinal_v<w_t>;
  auto xc = zz;
  int nb = a.size()/N;
  w_t s(0);
  for(int i=0; i < nb*N; i+= N )
  {
    auto v =  eve::load(&a[i]);
    s = eve::fam(s, v, xx);
    xx *= xc;
  }
  auto rs = eve::reduce(s);

  auto xl = xx.get(0);
  for(int i=nb*N; i < a.size(); ++i)
  {
    rs = eve::fam(rs, a[i], xl);
    xl *= x;
  }
  return rs;
}

This is an internally simd version of the reverse horner algorithm which computes the value of a polynomial at a scalar floating point value x using a vector containing the polynomial coefficients in increasing order.
In the purely scalar version there is a loop is to compute from
s=0;
looping from 0 to siz-1 by 1 (siz being the size of a)

 s=s+x*a[i] ;

over the coefficients.

In this version the loop is here replaced by a shorter one from 0 to siz-1 by N (instead of 1)

s = s+ [1,x,..., x^(N-1)]*{a[i],...a[i+N-1]};
xx *= x^N;

where N is the number of lanes of the simd vector ( Of course [1,x,..., x^(N-1)] and x^N are computed once.)
This loop is then followed by a reduce to sum the N lanes and perhaps by a scalar ending. if N does not divides siz.

Can this be written using eve algorithms or with the extension of some algorithms?

At least it seems it will need a slight extension of reduce using 2 different op one for the simd accumulation and tone for he final wide reduce (will be eve::add here), but I tried that and was not fully successful. (Of course the algo::reduce first pass (the main loop) must be unaligned as spurious leading zeros in a are not wanted)

DenisYaroshevskiy · 2022-01-08T19:36:45Z

DenisYaroshevskiy
Jan 8, 2022
Collaborator

This

    xx *= xc;

Seems wrong

I suspect, you were meant to scan it.

Can you just post a scalar version of the algorithm and then it'd be easier to figure out.

0 replies

DenisYaroshevskiy · 2022-01-08T19:55:33Z

DenisYaroshevskiy
Jan 8, 2022
Collaborator

I found this online: https://en.wikipedia.org/wiki/Horner%27s_method

Is this the algorithm you wanted?

Didn't test anything, but I think the code should looks smth like:

double horners_method(std::span<const double> coefficients, double point)
{  
   eve::wide<double> xs (x);
   
   double x_to_the_nth = x;
   for (int i = eve::wide<double>::size(); i; --i) x_to_the_nth * x;
   
   eve::wide<double> xs_step (x_to_the_nth);

   xs.set(0, 1.0);
   xs = eve::scan(xs, eve::mul);
   
   eve::wide<double> sums(0.0);
   // remove aligning to avoid dealing with not matching begin.
   // we can get the aligning and it's a good idea, because we are read only but this is simpler for a sketch
   eve::algo::for_each[eve::algo::no_aligning](coeffients, [](auto it, auto ignore) {
     auto coeffs = eve::load[ignore.else_(0.0)](it);
     sums = eve::fam(coeffs, xs);
     xs *= xs_step;
   });
   
   return eve::reduce(sums);
}

0 replies

jtlap · 2022-01-08T22:24:05Z

jtlap
Jan 8, 2022
Collaborator Author

Ok it seems to be like that and I will try using scan and for_each.
But it seems to me that xs = eve::scan(xs, eve::mul); produces a wide full of 1 ? Am I wrong ?
mine get {1,x,xx,xxx} if the cardinal is 4 and at each iteration I multiply by xc that is just the scalar xxxx
multiplying by the right coeffs and
summing all that gives 4 partial sums of the polynomial that I finally give to reduce.
Here is my last version of this reverse_horner algorithm with simd guts :
(the else part is evaluating scalar polynomial on simd elements which existed yet)

EVE_FORCEINLINE constexpr auto reverse_horner_(EVE_SUPPORTS(cpu_)
, T x, R const & r) noexcept
{
constexpr size_t N = eve::cardinal_v<w_t>;
const size_t sz = r.size();
if (sz >= N && scalar_value && scalar_value)
{
using w_t = eve::wide;
size_t N = eve::cardinal_v<w_t>;
size_t const sz = r.size();
{
size_t const nb = sz/N;
size_t nbN = nb*N;
auto zz = T(1);
w_t xx([&zz, x](auto , auto){ auto z = zz; zz = x; return z;});
size_t N = eve::cardinal_v<w_t>;
auto xc = zz;
w_t s(0);
for(size_t i=0; i < nbN; i+= N )
{
auto v = eve::load(&r[i]);
s = eve::fam(s, v, xx);
xx= xc;
}
if(auto d = sz-nbN)
{
auto v = eve::loadignore_last(N-d);
s = eve::fam(s, v, xx);
}
auto rs = eve::reduce(s);
return rs;
}
else
{
return detail::reverse_horner_impl(regular_type(), x, r);
}
}

0 replies

jtlap · 2022-01-08T22:35:01Z

jtlap
Jan 8, 2022
Collaborator Author

the polynom coeffs are in increasing degree order (this is reverse)
it works fine in and rapidly (I bench it) if the the vector containing the coefficients of the polynomial is wider than the current expected wide of the archi.
I have no time just now, but I will soon make a branch with the implementation, you will be able to test.
As I wanted to write it with the eve:algo I wrote a reduce version with two op parameters. One for the main reduction the other to finish the process. I will put that also and the using this reduce2 version.
I noticed that the number of iterations was greater with that version than with the above one

0 replies

DenisYaroshevskiy · 2022-01-08T22:50:06Z

DenisYaroshevskiy
Jan 8, 2022
Collaborator

eve::algo is for generic stl like algorithms written in a certain style.
The code copypaste failed I guess.

I am not sure what's reduce 2. We generally follow the standard algorithms so far.

However, let's see a clean version and then we can figure out an apropriate for algo generalization.

0 replies

jfalcou · 2022-01-17T22:02:00Z

jfalcou
Jan 17, 2022
Maintainer

Let's move this to discussion

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scalar horner algo using simd computations #1173

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

scalar horner algo using simd computations #1173

jtlap Jan 3, 2022 Collaborator

Replies: 6 comments

DenisYaroshevskiy Jan 8, 2022 Collaborator

DenisYaroshevskiy Jan 8, 2022 Collaborator

jtlap Jan 8, 2022 Collaborator Author

jtlap Jan 8, 2022 Collaborator Author

DenisYaroshevskiy Jan 8, 2022 Collaborator

jfalcou Jan 17, 2022 Maintainer

jtlap
Jan 3, 2022
Collaborator

DenisYaroshevskiy
Jan 8, 2022
Collaborator

DenisYaroshevskiy
Jan 8, 2022
Collaborator

jtlap
Jan 8, 2022
Collaborator Author

jtlap
Jan 8, 2022
Collaborator Author

DenisYaroshevskiy
Jan 8, 2022
Collaborator

jfalcou
Jan 17, 2022
Maintainer