Introduction

To be able to debug and profile your own programs is a crucial skill when programming. This book will look into different ways to debug, profile and benchmark your programs in different languages.

License

Debugging and Profiling by Markus Vieth is marked with CC0 1.0

Debugging

Testing and test driven development

Profiling

Profiling is a form of dynamic program analysis that measures the flow of your program (for example the execution time of parts of your program or the used space). We will concentrate on using profiling to find parts of a program which could optimize the runtime. For this we will use so called profiler which will execute the program with debugging information to collect the information we need.

More information how a profiler could collect such information can be learned in the „Program analysis“ lecture by Prof. Erdweg.

In this course you will see how you can analyze your program and find parts with long runtime which have the potential to be optimized and impact the runtime of your program. Without profiling you will probably optimize a part of your program that was already rather fast and has a very low potential to father reduce the runtime.

Benchmarking

Introduction

Debugging

pdb

faulthandler

trace

tracehandler

tracemalloc

Profiling

profile

cProfile

Benchmarking

timeit

Introduction

Debugging

gdb

valgrind

memcheck

Profiling

C++ is a compiled programming language. So you will need to compile your program in such a way, that your profiler can know what is happening.

Assignments

You should try the following assignments with the different profiling tools you will learn today. The code example will work on Linux and should work on MacOS. If you are using Windows try WSL, a VM or the Linux-Remote-Maschine of the university.

Compare busybox sort and GNU sort

You will need busybox sort on your machine for this. Install it from your preferred package manager or load it from busybox itself. Alternatively implement your own sorting function (or multiple).

shuf -i 0-999999 -n 1000 > thousand
shuf -i 0-999999 -n 1000000 > million

Compare runtime (the > /dev/null will hide the output. The interesting part is the user time.)

time busybox sort -n million > /dev/null
time sort -n million > /dev/null

Compare the profiling

<profiler> busybox sort -n thousand > /dev/null
<profiler> sort -n thousand > /dev/null

Compare your own implementation of algorithm functions with std

For example try to write a min_element function for an array and compare it to min_element.

You can create a vector with random numbers with the following code:

#include <algorithm>
#include <limits>
#include <random>
#include <type_traits>
#include <vector>

template <typename T>
static std::vector<T> generate_float_data(size_t size) {
    static std::default_random_engine generator;
    std::vector<T> data(size);

    static std::uniform_real_distribution<T> distribution(
        std::numeric_limits<T>::min(),
        std::numeric_limits<T>::max());
    std::generate(data.begin(), data.end(), []() { return distribution(generator); });

    return data;
}

template <typename T>
static std::vector<T> generate_int_data(size_t size) {
    static std::default_random_engine generator;
    std::vector<T> data(size);

    static std::uniform_int_distribution<T> distribution(
        std::numeric_limits<T>::min(),
        std::numeric_limits<T>::max());
    std::generate(data.begin(), data.end(), []() { return distribution(generator); });

    return data;
}

Things you could try to implement:

Compare runtime of different data types

This is more just for fun. Try some of the functions from the assignment before but use different data types like

  • uint32_t vs uint64_t
  • int16_t vs int_fast16_t
  • float vs int64_t vs double

gprof

gprof is the profiler you will probably find on every linux machine with a minimum of C++ development environment.

Please consult https://www.thegeekstuff.com/2012/08/gprof-tutorial/ to use gprof for yourself. After that work on the assignments.

Intel VTune

Intel VTune is a free profiler which also works on AMD hardware but has additional features on Intel hardware. It is able to profile multiple languages (C++, Java, Python, …) and hardware (CPU, GPU, FPGA, …) and works on Linux, MacOS and Windows.

Intel itself offers a rather good introduction: https://www.intel.com/content/www/us/en/develop/documentation/get-started-with-vtune/top.html

Coz

Coz is a bit different from other profiler. For optimal use you should use the debug flag (-d in most cases) when compiling. Because it is not widely used there are not a lot of information about Coz. Try the assignments with the following command

coz run --- {PATH_TO_EXECUTABLE}

More information about Coz can be found on GitHub especially the White Paper. Maybe take a look at the provided benchmarks and the web service to plot the profile.

perf

perf is a tool you will find on most linux machines. Please take a look at the following guides and work on the assignments.

  • https://dev.to/etcwilde/perf—perfect-profiling-of-cc-on-linux-of
  • https://www.brendangregg.com/perf.html (advanced)

Valgrind

Valgrind is a set of tools to debug and profile programs. We will take a look at

  • cachegrind which will check how often memory lies not in the cache if used which slows down programs.
  • callgrind + kcallgrind to profile the execution time of functions.

cachegrind

Cachegrind counts the number of cache-misses and where they occur. If you have a lot of cache-misses it may be possible that you should start to thing about memory access patterns.

You can find a short introduction to Cachegrind here.

Callgrind

Callgrind is a profiling tool of the Valgrind set of tools. The most basic usage would be to see when, how often and how long a function is called to see where you could improve your execution time.

Stanford University has a nice introduction to Callgrind: https://web.stanford.edu/class/archive/cs/cs107/cs107.1222/resources/callgrind

For the visualization with KCachegrind take a look at the KDE KCachegrind Handbook and the following article (the part after “The real way”).

gperftools

gperftools are multiple profiling tools which have a nice graphical output. Please read the following website and work on the assignments.

https://developer.ridgerun.com/wiki/index.php?title=Profiling_with_GPerfTools

Benchmarking

gperftools

Google Benchmark

Celero

Introduction

Debugging

Profiling

Profiling on CUDA is simple and complicated at the same time. While it is not easy to profile over thousands of threads, NVIDIA provides a set of tools which profile, evaluate and plot the result. For this we will take a look at Nsight Compute and Nsight Systems.

For the workflow we will assume that the GPU-powered machine is remote and without a graphical user interface. Both tools provide the possibility to run on the command line, output a file and use this file on a different machine with graphical user interface. This approach would be a typical workflow for working on MOGON.

Nsight Compute

While the screenshots are from an old version, the example (matrix multiplication) and the tutorial in this article are very good to start learning using Nsight Compute:

https://developer.nvidia.com/blog/using-nsight-compute-to-inspect-your-kernels/

For more material visit this 3 part tutorial

  1. https://developer.nvidia.com/blog/analysis-driven-optimization-preparing-for-analysis-with-nvidia-nsight-compute-part-1/
  2. https://developer.nvidia.com/blog/analysis-driven-optimization-analyzing-and-improving-performance-with-nvidia-nsight-compute-part-2
  3. https://developer.nvidia.com/blog/analysis-driven-optimization-finishing-the-analysis-with-nvidia-nsight-compute-part-3

If you are more the visual type the following article contains some demo videos about Nsight Compute

https://developer.nvidia.com/blog/sc20-demos-new-nsight-systems-and-nsight-compute-demos/

Nsight Systems

Nsight Systems visualizes what your GPU and CPU is doing and when. It will help to understand where you can overlap execution or memory transfer, where you have an overhead and more.

A good article about using Nsight Systems can be found here

https://developer.nvidia.com/blog/sc20-demos-new-nsight-systems-and-nsight-compute-demos/

If you are more the visual type the following article contains some demo videos about Nsight Systems

https://developer.nvidia.com/blog/sc20-demos-new-nsight-systems-and-nsight-compute-demos/

Benchmarking

License

Debugging and Profiling by Markus Vieth is marked with CC0 1.0