Optimizing generic algorithms

Denis Yaroshevskiy

Disclaimers

  • No affiliation with the mentioned projects
  • What I know is pieced together from my experiments and the Internet
  • One compiler, one machine
  • Beware of bugs and issues
  • Some of my statements are opinions

Goal

What can you do if your performance bottleneck is a general purpose utilitly

Why listen to me about this?

  • Have a couple of accepted patches to libc++
    (and more not accepted)
  • A few interesting contributions to
    Chromium's base library
  • Microbenchmark algorithms (quite a lot)

Overview

  • microbenchmarking
  • partition_point(lower_bound)
  • merge
  • stable_sort

Microbenchmarking

Biggest problem of microbenchmarking

live_demo
Live demo didn't work Before
Live demo didn't work Before
Live demo didn't work Before

Bad news

Good news

  • Difference is huge
  • Very unstable

What to do about it

  • One benchmark per binary
  • Tweak your code (run for similar types)
  • On Linux, there are tools
  • Denis Bakhvalov:Code alignment issues

partition_point

(lower_bound)

What affects performance of a binary search?

  • Range size
  • Which element are we looking for?
  • Predicate
  • Distribution of data
  • Non-templated operations
  • Cache locality of data
  • Binary size

Start with an optimization

  • Pick 1 or 2 "fundamental" benchmarks
  • Add a benchmark for your optimization
  • Add benchmarks that could have been negatively affected

Suggested optimization

Suggested optimization (no CE)

Before

Suggested optimization (no CE)

After

What did we affect?

  • Non templated operations
  • Binary size (-2 instructions)

Basic binary search benchmark

  • Binary search a 1,000 integers
  • Parameterized on element to look up
  • Double-check on int64 and doubles

merge

What affects performance of merge?

  • Total size
  • How elements are mixed
  • Comparison
  • Move
  • Non-templated operations
  • Cache locality of data
  • Binary size

What did we affect?

  • Non-templated operations
  • Binary size (+ 11 instructions ~20%)

Basic merge benchmark

  • Merge 2,000 integers
  • Parameterized on 2 sizes (represent distribution)
  • Double-check on int64 and doubles

What did we affect?

  • Extra comparison/assignment
  • Non-templated operations
  • Binary size for ints (+ 10 instructions ~20%)

What to do about binary size?

We need to benchmark stable_sort

  • Denis Bakhvalov's post on branches
  • libc++ patch didn't land
  • Compiler specific issues
  • I have a similar set_union~ish algorithm

stable_sort

stable_sort is

  • Memory adaptive merge sort
  • Insertion sort (under 8 in libc++)
  • Merge sort
  • Merge sort "rotating middles"

What affects performance of sort?

  • Total size
  • How sorted is data already
  • How many duplicates do we have
  • Comparison
  • Move
  • Non templated operations
  • Cache locality of data
  • Binary size

Basic sort benchmark

  • Sort 1,000 integers
  • Parameterized on how sorted they are
  • I don't know how to address duplicates
  • Double-check against code alignment
  • Remove the baseline from benchmark

From sorted to reverse sorted?

  • Lexicographical permutations (nth_permutation)
  • Biased shuffle (bad) with reverse

Possible bigger types

  • fake_url: https://{number}.com
  • fake_url pair
  • noinline_int

Sort pointers

  • Rearrange underlying elements based on rearrangement of pointers
  • Elements of programming, chapter 10: Rearrangements

What did we affect?

  • Move
  • Non templated operations
  • Cache locality of data
  • Binary size

Could we/should we do it?

  • Noinline move could be an intrinsic
  • Move is bigger than copy bytes
  • I need better implementations
  • Pointer based rearrangement is powerful

Conclusions

How to optimize a generic component

  • Analyze what can affect the performance
  • Come up with an idea
  • Build benchmarks for the basic case
  • Build benchmarks for your optimization
  • Build benchmarks for what could have been negatively affected
  • Make the decision on the change

Some points

  • Have a rationale for your optimization
  • Benchmarks are very error-prone
  • Small changes matter
  • std::rotate could be improved

Where to learn about hardware/tools?

Where to learn about algorithms?

Thanks to:

  • Oleg Fatkhiev
  • Denis Bakhvalov

Questions?

Code: https://github.com/DenisYaroshevskiy/algorithm_dumpster

Slides: https://tinyurl.com/y45sp8t2