reverse

So far only have an inplace version. (no reverse_copy)
Unrolling by hand does not do anything useful.

40 bytes summary

I am not too sure if the slowest part is active in any of these cases but
Seems like for 10 integers we can't beat the scalar perfromace.
I have no idea why the scalar is this fast though.
For chars makes some sense - we get a two times speed up
With code alignment - not as cool as other simd algorithms -
there are a few close branches in the beginning. (especially for integers).
I can transform one of them in cmov but I don't care enough at the moment to try.

40 bytes, data

40 bytes, code alignment

1000 bytes summary

Nice results: 17.8 times faster for chars, 9.1 for shorts, 7.6 for ints.
Code alignment can decrease perf almost 1.75 times for integers -
again blame those two branches. Not sure I care.

1000 bytes, data

1000 bytes, code alignment

10000 bytes summary

Nice results: 17.8 times faster for chars, 9 times for shorts, 6.5 for ints.
Code alignment impact is almost negligeable - about 10% for ints and nothing for others.