My first SIMD

Implementing STL Algorithms Using AVX2 Extensions

Denis Yaroshevskiy

So, what's this about?

https://tinyurl.com/y5hn6e3c

Disclaimers

  • I am NOT an expert in SIMD.
  • One compiler, one machine.
  • Everything is x86 specific.
  • Some of my statements are my opinions.

The Plan

  • What's SIMD and how to get it?
  • std::strlen
  • Some other algorithms.

What's SIMD and how to get it?

x86 vector extensions

  • SSE 1-3 (Pentium 4)
  • SSE 4-4.2
  • AVX (PS4)
  • AVX-2 (MacBook Air 2015)
  • AVX-512 (servers only)

Tell your compiler

  • -march=native
  • Compile for specific architectures
  • Runtime selection (hard)

Now get some SIMD code

  • auto-vectorization
  • openmp
  • by hand with intrinsics or Assembly

On using intrinsics

Wrap it!

Wrap it!

  • nsimd, tsimd, xsimd, vc
  • write my own
  • eve

strlen

Ideas

  • allocations happen in pages
  • aligned addresses are safe

strlen code

					
std::size_t strlen(const char* s) {
  using wide = eve::wide<char>;

  const wide zeroes{0};

  aligned_ptr aligned_s =
    previous_aligned_address<sizeof(wide)>(s);
  wide cur{aligned_s};

  ignore_first ignore(s - aligned_s.get());
  std::optional match = first_true(cur == zeroes, ignore);

  while (!match) {
    aligned_s += wide::static_size;
    cur = wide{aligned_s};
    match = first_true(cur == zeroes, ignore_none);
  }

  return static_cast<std::size_t>(aligned_s.get() + *match - s);
}
					
				

Dynamic analysis

  • load_unsafe(ptr, as_<wide>)
  • __attribute__((no_sanitize_address))

strlen code

					
std::size_t strlen(const char* s) {
  using wide = eve::wide<char>;

  const wide zeroes{0};

  aligned_ptr aligned_s =
    previous_aligned_address<sizeof(wide)>(s);
  wide cur = load_unsafe(aligned_s, as_<wide>{});

  ignore_first ignore(s - aligned_s.get());
  std::optional match = first_true(cur == zeroes, ignore);

  while (!match) {
    aligned_s += wide::static_size;
    cur = load_unsafe(aligned_s, as_<wide>{});
    match = first_true(cur == zeroes, ignore_none);
  }

  return static_cast<std::size_t>(aligned_s.get() + *match - s);
}
					
				

Measuring

  • Google benchmark
  • code alignment
  • 40, 1000, 10'000 bytes
  • char, short, int

Notes on find/find_unguarded

reduce

reduce (same type)

  • replace_ignored(wide, ignore, wide)
  • reduce_wide(wide, op) -> wide

reduce (different type)

  • wide<T, fixed<N>>
  • convert

inclusive_scan

inclusive_scan (in place)

  • store(ptr, wide, ignore)
  • inclusive_scan_wide(wide, zero)

Notes on inclusive_scan

remove

remove

  • compress_store_unsafe(wide, T*, logical, ignore) -> T*
  • compress_store_precise
  • if_not_

Notes on remove

General notes

  • massive speed ups
  • code alignment impact on SIMD
  • data impact on SIMD
  • aligned/unaligned memory access

What we didn't cover?

  • multiple range algorithms
  • partition (in place) / reverse
  • floating point / math
  • cache effects
  • gather

Don't know how to

  • merge/set_union
  • insertion sort
  • operating on partial structs

Thanks to

  • Joel Falcou
  • Peter Cordes

Links

write strlen

Questions?

https://denisyaroshevskiy.github.io/my_first_simd_1_presentation

Dealing with code alignment

  • __attribute__((always_inline))
  • -mllvm -align-all-functions=7

Dealing with code alignment


template <typename Slide, /*...*/>
NOINLINE void benmark_body(Slide slide, benchmark::State& state, /* ... */) {
  noop_slide(slide);  // asm volatile("nop");

  for (auto _ : state) {
    // measure this
  }
}
        

std::remove, alignment