Optimizing generic algorithms
Denis Yaroshevskiy
Disclaimers
 No affiliation with the mentioned projects
 What I know is pieced together from my
experiments and the Internet
 One compiler, one machine
 Beware of bugs and issues
 Some of my statements are opinions
Goal
What can you do if your performance bottleneck is a
general purpose utilitly
Why listen to me about this?

Have a couple of accepted patches to libc++
(and more not accepted)

A few interesting contributions to
Chromium's base library
 Microbenchmark algorithms (quite a lot)
Overview
 microbenchmarking
 partition_point(lower_bound)
 merge
 stable_sort
Biggest problem of microbenchmarking
live_demo
Live demo didn't work
Live demo didn't work
Live demo didn't work
Good news
 Difference is huge
 Very unstable
What to do about it
 One benchmark per binary
 Tweak your code (run for similar types)
 On Linux, there are tools
 Denis Bakhvalov:Code
alignment issues
partition_point
(lower_bound)
What affects performance of a binary search?
 Range size
 Which element are we looking for?
 Predicate
 Distribution of data
 Nontemplated operations
 Cache locality of data
 Binary size
Start with an optimization
 Pick 1 or 2 "fundamental" benchmarks
 Add a benchmark for your optimization
 Add benchmarks that could have been negatively
affected
Suggested optimization (no CE)
Suggested optimization (no CE)
What did we affect?
 Non templated operations
 Binary size (2 instructions)
Basic binary search benchmark
 Binary search a 1,000 integers
 Parameterized on element to look up
 Doublecheck on int64 and doubles
What affects performance of merge?
 Total size
 How elements are mixed
 Comparison
 Move
 Nontemplated operations
 Cache locality of data
 Binary size
What did we affect?
 Nontemplated operations
 Binary size (+ 11 instructions ~20%)
Basic merge benchmark
 Merge 2,000 integers
 Parameterized on 2 sizes (represent distribution)
 Doublecheck on int64 and doubles
What did we affect?
 Extra comparison/assignment
 Nontemplated operations
 Binary size for ints (+ 10 instructions ~20%)
What to do about binary size?
We need to benchmark stable_sort
 Denis Bakhvalov's post
on branches
 libc++ patch didn't land
 Compiler specific issues
 I have a similar set_union~ish algorithm
stable_sort is
 Memory adaptive merge sort
 Insertion sort (under 8 in libc++)
 Merge sort
 Merge sort "rotating middles"
What affects performance of sort?
 Total size
 How sorted is data already
 How many duplicates do we have
 Comparison
 Move
 Non templated operations
 Cache locality of data
 Binary size
Basic sort benchmark
 Sort 1,000 integers
 Parameterized on how sorted they are
 I don't know how to address duplicates
 Doublecheck against code alignment
 Remove the baseline from benchmark
From sorted to reverse sorted?
 Lexicographical permutations (nth_permutation)
 Biased shuffle (bad) with reverse
Possible bigger types
 fake_url: https://{number}.com
 fake_url pair
 noinline_int
Sort pointers
 Rearrange underlying elements based on
rearrangement of pointers
 Elements of programming, chapter 10:
Rearrangements
What did we affect?
 Move
 Non templated operations
 Cache locality of data
 Binary size
Could we/should we do it?
 Noinline move could be an intrinsic
 Move is bigger than copy bytes
 I need better implementations
 Pointer based rearrangement is powerful
How to optimize a generic component
 Analyze what can affect the performance
 Come up with an idea
 Build benchmarks for the basic case
 Build benchmarks for your optimization
 Build benchmarks for what could have been
negatively affected
 Make the decision on the change
Some points
 Have a rationale for your optimization
 Benchmarks are very errorprone
 Small changes matter
 std::rotate could be improved
Where to learn about hardware/tools?
Where to learn about algorithms?
Thanks to:
 Oleg Fatkhiev
 Denis Bakhvalov
Questions?
Code: https://github.com/DenisYaroshevskiy/algorithm_dumpster
Slides: https://tinyurl.com/y45sp8t2