inclusive_scan

My implementation is based on StackOverflow answer by Z boson.
So far only have an inplace version - replace elements with running sum - didn't support writing to a different array.
Only tried aligned loads and stores
The best unrolling seems to be `2` for shorts and '1' for chars/ints.
The main routine that does the work is `eve_extra::inclusive_scan_wide` that computes (as the name suggests)
`inclusive_scan` for one wide register.

Our theoretical win comes from us doing log additions from our size.
So for:
* 32 chars - we do 5 additions instead of 31 => cound be a 6 times speed up.
* 16 shorts we do 4 additions intead of 15 => could be a 4 times speed up.
* 8 ints we do 3 additions as oppose to 7 of scalar => could be a 2 times speed up.

40 bytes summary

For 40 bytes we don't win for anything except for chars.
Originally I implemented `store(ignore)` with `maskmoveu` that was really bad,
but then I used memcpy for chars and shorts and it got better, see the whole story on StackOverflow
I tried to solve this with different iteration patterns but was not successful: StackOveflow

40 bytes, data

1000 bytes summary

5 times win for chars, almost 3 times win for shors and 2 times for ints.
So not quite 6, 4 and 2 but not too off.

inclusive_scan

40 bytes summary

40 bytes, data

1000 bytes summary

1000 bytes, data

10'000 bytes summary

10'000 bytes, data

Total benchmark