My first SIMD
Implementing STL Algorithms Using SIMD Extensions
Denis Yaroshevskiy
Disclaimers
- Performance is tricky.
- No ARM measurements.
- Some of my statements are my opinions.
The Plan
- What's SIMD and how to get it?
- std::strlen
- Some other algorithms.
What's SIMD and how to get it?
Vector Processor Extensions
- x86
- 128 bits: SSE2, SSE3, SSSE3, SSE4, SSE4.1, SSE4.2
- 256 bits: AVX, AVX2, XOP
- 512 bits: AVX512 and its myriad of sub-genre
- ARM
- 128 bits: NEON, ASIMD
- SVE (VLS/VLA)
- PowerPC
- WASM
Tell your compiler
- -march=native
- Compile for specific architectures
- Runtime selection (hard)
Now get some SIMD code
- auto-vectorization
- openmp
- by hand with intrinsics or Assembly
Ideas
- allocations happen in pages
- aligned addresses are safe
Dynamic analysis
- unsafe(load)
- __attribute__((no_sanitize_address))
Measuring
- Google benchmark
- code alignment
- 40, 1000, 10'000 bytes
- char, short, int
Notes on find/find_unguarded
reduce (same type)
- replace(wide, ignore, wide)
- reduce(wide, op) -> wide
reduce (different type)
- wide<T, fixed<N>>
- convert
Notes on reduce
- sse2-sse4.2 char -> int reduction
- load[ignore.else_(x)]
inclusive_scan (inplace)
- store[ignore](wide, ptr)
- scan(wide, op, zero)
remove
- safe/unsafe compress_store[ignore]
General notes
- massive speed ups
- code alignment impact on SIMD
- data impact on SIMD
- aligned/unaligned memory access
What we didn't cover?
- multiple range algorithms
- set operations
- partition (in place) / reverse
- floating point / math
- cache effects
- gather
Thanks to
Joel Falcou, Jean-Thierry Lapresté
Stack Overflow: @aqrit, @Peter Cordes, @Z boson, @Stephen Canon