Introduction
To be able to debug and profile your own programs is a crucial skill when programming. This book will look into different ways to debug, profile and benchmark your programs in different languages.
License
Debugging and Profiling by Markus Vieth is marked with CC0 1.0
Debugging
Testing and test driven development
Profiling
Profiling is a form of dynamic program analysis that measures the flow of your program (for example the execution time of parts of your program or the used space). We will concentrate on using profiling to find parts of a program which could optimize the runtime. For this we will use so called profiler which will execute the program with debugging information to collect the information we need.
More information how a profiler could collect such information can be learned in the „Program analysis“ lecture by Prof. Erdweg.
In this course you will see how you can analyze your program and find parts with long runtime which have the potential to be optimized and impact the runtime of your program. Without profiling you will probably optimize a part of your program that was already rather fast and has a very low potential to father reduce the runtime.
Benchmarking
Introduction
Debugging
pdb
faulthandler
trace
tracehandler
tracemalloc
Profiling
profile
cProfile
Benchmarking
timeit
Introduction
Debugging
gdb
valgrind
memcheck
Profiling
C++ is a compiled programming language. So you will need to compile your program in such a way, that your profiler can know what is happening.
Assignments
You should try the following assignments with the different profiling tools you will learn today. The code example will work on Linux and should work on MacOS. If you are using Windows try WSL, a VM or the Linux-Remote-Maschine of the university.
Compare busybox sort and GNU sort
You will need busybox sort on your machine for this. Install it from your preferred package manager or load it from busybox itself. Alternatively implement your own sorting function (or multiple).
shuf -i 0-999999 -n 1000 > thousand
shuf -i 0-999999 -n 1000000 > million
Compare runtime (the > /dev/null will hide the output. The interesting part is the user time.)
time busybox sort -n million > /dev/null
time sort -n million > /dev/null
Compare the profiling
<profiler> busybox sort -n thousand > /dev/null
<profiler> sort -n thousand > /dev/null
Compare your own implementation of algorithm
functions with std
For example try to write a min_element
function for an array and compare it to min_element.
You can create a vector
with random numbers with the following code:
#include <algorithm>
#include <limits>
#include <random>
#include <type_traits>
#include <vector>
template <typename T>
static std::vector<T> generate_float_data(size_t size) {
static std::default_random_engine generator;
std::vector<T> data(size);
static std::uniform_real_distribution<T> distribution(
std::numeric_limits<T>::min(),
std::numeric_limits<T>::max());
std::generate(data.begin(), data.end(), []() { return distribution(generator); });
return data;
}
template <typename T>
static std::vector<T> generate_int_data(size_t size) {
static std::default_random_engine generator;
std::vector<T> data(size);
static std::uniform_int_distribution<T> distribution(
std::numeric_limits<T>::min(),
std::numeric_limits<T>::max());
std::generate(data.begin(), data.end(), []() { return distribution(generator); });
return data;
}
Things you could try to implement:
Compare runtime of different data types
This is more just for fun. Try some of the functions from the assignment before but use different data types like
uint32_t
vsuint64_t
int16_t
vsint_fast16_t
float
vsint64_t
vsdouble
gprof
gprof is the profiler you will probably find on every linux machine with a minimum of C++ development environment.
Please consult https://www.thegeekstuff.com/2012/08/gprof-tutorial/ to use gprof for yourself. After that work on the assignments.
Intel VTune
Intel VTune is a free profiler which also works on AMD hardware but has additional features on Intel hardware. It is able to profile multiple languages (C++, Java, Python, …) and hardware (CPU, GPU, FPGA, …) and works on Linux, MacOS and Windows.
Intel itself offers a rather good introduction: https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-vtune/top.html
Coz
Coz is a bit different from other profiler. For optimal use you should use the debug flag (-d
in most cases) when compiling. Because it is not widely used there are not a lot of information about Coz. Try the assignments with the following command
coz run --- {PATH_TO_EXECUTABLE}
More information about Coz can be found on GitHub especially the White Paper. Maybe take a look at the provided benchmarks and the web service to plot the profile.
perf
perf is a tool you will find on most linux machines. Please take a look at the following guides and work on the assignments.
- https://dev.to/etcwilde/perf—perfect-profiling-of-cc-on-linux-of
- https://www.brendangregg.com/perf.html (advanced)
Valgrind
Valgrind is a set of tools to debug and profile programs. We will take a look at
- cachegrind which will check how often memory lies not in the cache if used which slows down programs.
- callgrind + kcallgrind to profile the execution time of functions.
cachegrind
Cachegrind counts the number of cache-misses and where they occur. If you have a lot of cache-misses it may be possible that you should start to thing about memory access patterns.
You can find a short introduction to Cachegrind here.
Callgrind
Callgrind is a profiling tool of the Valgrind set of tools. The most basic usage would be to see when, how often and how long a function is called to see where you could improve your execution time.
Stanford University has a nice introduction to Callgrind: https://web.stanford.edu/class/archive/cs/cs107/cs107.1222/resources/callgrind
For the visualization with KCachegrind take a look at the KDE KCachegrind Handbook and the following article (the part after “The real way”).
gperftools
gperftools are multiple profiling tools which have a nice graphical output. Please read the following website and work on the assignments.
https://developer.ridgerun.com/wiki/index.php?title=Profiling_with_GPerfTools
Benchmarking
gperftools
Google Benchmark
Celero
Introduction
Debugging
Profiling
Profiling on CUDA is simple and complicated at the same time. While it is not easy to profile over thousands of threads, NVIDIA provides a set of tools which profile, evaluate and plot the result. For this we will take a look at Nsight Compute and Nsight Systems.
For the workflow we will assume that the GPU-powered machine is remote and without a graphical user interface. Both tools provide the possibility to run on the command line, output a file and use this file on a different machine with graphical user interface. This approach would be a typical workflow for working on MOGON.
Nsight Compute
While the screenshots are from an old version, the example (matrix multiplication) and the tutorial in this article are very good to start learning using Nsight Compute:
https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/
For more material visit this 3 part tutorial
- https://developer.nvidia.com/blog/analysis-driven-optimization-preparing-for-analysis-with-nvidia-nsight-compute-part-1/
- https://developer.nvidia.com/blog/analysis-driven-optimization-analyzing-and-improving-performance-with-nvidia-nsight-compute-part-2
- https://developer.nvidia.com/blog/analysis-driven-optimization-finishing-the-analysis-with-nvidia-nsight-compute-part-3
If you are more the visual type the following article contains some demo videos about Nsight Compute
https://developer.nvidia.com/blog/sc20-demos-new-nsight-systems-and-nsight-compute-demos/
Nsight Systems
Nsight Systems visualizes what your GPU and CPU is doing and when. It will help to understand where you can overlap execution or memory transfer, where you have an overhead and more.
A good article about using Nsight Systems can be found here
https://developer.nvidia.com/blog/sc20-demos-new-nsight-systems-and-nsight-compute-demos/
If you are more the visual type the following article contains some demo videos about Nsight Systems
https://developer.nvidia.com/blog/sc20-demos-new-nsight-systems-and-nsight-compute-demos/
Benchmarking
License
Debugging and Profiling by Markus Vieth is marked with CC0 1.0