Optimizing generic algorithms
Denis Yaroshevskiy
Disclaimers
- No affiliation with the mentioned projects
- What I know is pieced together from my
experiments and the Internet
- One compiler, one machine
- Beware of bugs and issues
- Some of my statements are opinions
Goal
What can you do if your performance bottleneck is a
general purpose utilitly
Why listen to me about this?
-
Have a couple of accepted patches to libc++
(and more not accepted)
-
A few interesting contributions to
Chromium's base library
- Microbenchmark algorithms (quite a lot)
Overview
- microbenchmarking
- partition_point(lower_bound)
- merge
- stable_sort
Biggest problem of microbenchmarking
live_demo
Live demo didn't work
Live demo didn't work
Live demo didn't work
Good news
- Difference is huge
- Very unstable
What to do about it
- One benchmark per binary
- Tweak your code (run for similar types)
- On Linux, there are tools
- Denis Bakhvalov:Code
alignment issues
partition_point
(lower_bound)
What affects performance of a binary search?
- Range size
- Which element are we looking for?
- Predicate
- Distribution of data
- Non-templated operations
- Cache locality of data
- Binary size
Start with an optimization
- Pick 1 or 2 "fundamental" benchmarks
- Add a benchmark for your optimization
- Add benchmarks that could have been negatively
affected
Suggested optimization (no CE)
Suggested optimization (no CE)
What did we affect?
- Non templated operations
- Binary size (-2 instructions)
Basic binary search benchmark
- Binary search a 1,000 integers
- Parameterized on element to look up
- Double-check on int64 and doubles
What affects performance of merge?
- Total size
- How elements are mixed
- Comparison
- Move
- Non-templated operations
- Cache locality of data
- Binary size
What did we affect?
- Non-templated operations
- Binary size (+ 11 instructions ~20%)
Basic merge benchmark
- Merge 2,000 integers
- Parameterized on 2 sizes (represent distribution)
- Double-check on int64 and doubles
What did we affect?
- Extra comparison/assignment
- Non-templated operations
- Binary size for ints (+ 10 instructions ~20%)
What to do about binary size?
We need to benchmark stable_sort
- Denis Bakhvalov's post
on branches
- libc++ patch didn't land
- Compiler specific issues
- I have a similar set_union~ish algorithm
stable_sort is
- Memory adaptive merge sort
- Insertion sort (under 8 in libc++)
- Merge sort
- Merge sort "rotating middles"
What affects performance of sort?
- Total size
- How sorted is data already
- How many duplicates do we have
- Comparison
- Move
- Non templated operations
- Cache locality of data
- Binary size
Basic sort benchmark
- Sort 1,000 integers
- Parameterized on how sorted they are
- I don't know how to address duplicates
- Double-check against code alignment
- Remove the baseline from benchmark
From sorted to reverse sorted?
- Lexicographical permutations (nth_permutation)
- Biased shuffle (bad) with reverse
Possible bigger types
- fake_url: https://{number}.com
- fake_url pair
- noinline_int
Sort pointers
- Rearrange underlying elements based on
rearrangement of pointers
- Elements of programming, chapter 10:
Rearrangements
What did we affect?
- Move
- Non templated operations
- Cache locality of data
- Binary size
Could we/should we do it?
- Noinline move could be an intrinsic
- Move is bigger than copy bytes
- I need better implementations
- Pointer based rearrangement is powerful
How to optimize a generic component
- Analyze what can affect the performance
- Come up with an idea
- Build benchmarks for the basic case
- Build benchmarks for your optimization
- Build benchmarks for what could have been
negatively affected
- Make the decision on the change
Some points
- Have a rationale for your optimization
- Benchmarks are very error-prone
- Small changes matter
- std::rotate could be improved
Where to learn about hardware/tools?
Where to learn about algorithms?
Thanks to:
- Oleg Fatkhiev
- Denis Bakhvalov
Questions?
Code: https://github.com/DenisYaroshevskiy/algorithm_dumpster
Slides: https://tinyurl.com/y45sp8t2