reduce

`std::reduce/std::transform_reduce` are two algorithms consistenlty vectorized by clang
(regardless of the shape of the call - it can be written by hand or via std::accumulate).
Because of this there is no scalar baseline.

There are two interesting variations on reduce:
a) We are reducing to the same type as the element type.
b) We are reducing to a bigger type to deal with a potential overflow.

Reducing to the same type

This is relatively straight forward.
The best results is achived when unrolling 4 times.
In this case we are consistenlty massively winning - at most 5.6 times for 1000 bytes of chars.
I don't know why the big difference in behaviour of `std::reduce` - the assembly looks identical
and I'm controlling for code alignment.
The only idea I have is that punishment for loop peeling for chars is this big,
I have seen mutliple times quite poor perfromance of scalar code for chars.

With respect to code alignment, `unsq_eve` version cares very little - at most 15% for 1000 bytes of chars.
`std::reduce` - not so lucky - at 10k swings up to 44% and this does not seem to be the scalar code's code fault.

reducing to the same type, data

reducing to the same type, code alignment, unsq_eve

reducing to the same type, code alignment, `std::reduce`

Reducing to a different type

When reducing to a different type, we need to somehow convert from the
array type, to the type we want to do our operations in.
`std::reduce` for this generates a really nice assembly
(chars reducing to shorts).

```
vpmovsxbw ymm4, xmmword ptr [rdi + rdx]
vpaddw ymm0, ymm0, ymm4
vpmovsxbw ymm4, xmmword ptr [rdi + rdx + 16]
vpaddw ymm1, ymm1, ymm4
vpmovsxbw ymm4, xmmword ptr [rdi + rdx + 32]
```

This is essentially: `_mm_cvtepi8_epi16 ` called directly on the address -

I do the same trick in eve by loading a smaller `eve::wide` and
then calling `eve::convert` on it. We end up winning primarally because of not peeling the loops.

reducing 40 bytes

On 40 bytes not peeling loops gives the most effect.
adding chars to `short` is 3.3 times faster and adding to `int` ~30%.
adding shorts to `int` is 1.7 times faster.

reducing chars 40 bytes data

reducing shorts 40 bytes data

reducing 1000 bytes

On 1000 bytes not peeling loops still has an effect, though not that big.
adding chars to `short` is 1.5 times faster and adding to `int` is roughly the same.
adding shorts to `int` is 1.13 times faster.

reducing chars 1000 bytes data

reducing shorts 1000 bytes data

reducing 10000 bytes

On 10'000 bytes loop peeling stops being important.

reducing chars 1000 bytes data

Total benchmark