transform

So far only have an inplace version.
My benchmark: double every element `x = x + x`.
`std::transform` is consistently vectorized by clang (godbolt) so no purely scalar baseline.
Clang peels loop while I don't.
Clang also uses unaligned loads and stores, even for an inplace version.
I experminented with different more fancy iteration patterns - didn't have much success: StackOverflow.

40 bytes summary

For 40 bytes of data cannot beat loop peeling for int.
2 times win for char, nothing for short loose about 30% for int.
Code alignment is a pain for both me and standard but for different types,
chars for me - 1.6 times, shorts - 2 times for standard.

40 bytes, data

40 bytes, code alignment

1000 bytes summary

Not peeling is a good win on a 1000 bytes. 4 times chars, 2 times shorts, 1.5 ints.
Code alignment for unsq_eve is about 20% for chars and shorts.
For std ints misbehavae, showing abou 1.5 swings.

1000 bytes, data

1000 bytes, code alignment

10'000 bytes summary

10'000 bytes behaves roughly identical.
std suffers from code alignment issues - 1.7 times swings.
I suspect unaligned loads/stores (see again StackOverflow).

10'000 bytes, data

10'000 bytes, code alignment

Total benchmark