Skip to main content

Testing C++ code with pytest, through cppyy/PyROOT

One of the things that have been extremely helpful in developing the bamboo package, and especially some of its C++ components, are a collection of unit and regression tests. Of course the value of automatic or easy-to-run tests is well-known in software engineering, but they are not as commonly used in code used for high-energy physics research as they could be: underlying frameworks and critical algorithms are often covered quite well, but the code used for final few steps to the results in a publication, and some of the code supporting this final analysis, often much less so, or not at all. Therefore this post is another advertisement for pytest, especially how it can also be used for testing C++ code through the automatic python bindings generated by cppyy.

For pytest in general I can only recommend to have a look at its documentation: it is very easy to use, most of the time setting up a test is really as simple as writing something like

def test_calculation_1():
    assert do_calculation(some_inputs) == expected_result

to a test_<something>.py file, and running pytest—it is also easy to select just some tests to run with pytest -k 'expression', or by passing the name of a test file, add helper methods in the tests, reuse an object for several tests, or conditionally skip some tests if e.g. a dependency may be absent, to name just a few features that are used in the bamboo tests.

Getting to the point of this post: when using a mix of python and C++ code, and in case (Py)ROOT_ already used somehow (otherwise it is a rather large dependency to add just for this), it takes very little extra effort to include tests for the C++ code in pytest. The key point is that cppyy, the underlying technology used by PyROOT, automatically generates python bindings for C++ code that is loaded in the cling interpreter, as is described in this part of the PyROOT manual. As a first example the folling code will JIT-compile a C++ function, and use it in a few tests:

import pytest
from math import isclose

@pytest.fixture(scope="session")
def my_square():
    import ROOT as gbl
    gbl.gInterpreter.Declare(
        "double my_square(double a) {\n"
        "  return a*a;\n"
        "}")
    yield gbl.my_square

def test_square_0(my_square):
    assert isclose(my_square(0.), 0.)

def test_square_1(my_square):
    assert isclose(my_square(1.), 1.)

def test_square_m3(my_square):
    assert isclose(my_square(-3.), 9.)

This is a little bit too simplistic, because we only tested the code that was defined inside the test file, but the principle remains the same.

C++ headers can be loaded (JIT-compiled) into PyROOT and cling with gbl.gROOT.ProcessLine('#include "myheader.h"'), and shared libraries with gbl.gSystem.Load('libname'). It may be useful to append include and linker paths, which can be done with gbl.gInterpreter.AddIncludePath and gbl.gSystem.AddDynamicPath, respectively (bamboo includes some helper functions to make this more convenient, and a loadDependency helper method that combines all of them).

It is worth stressing that, since PyROOT is based on cling, the exact same function calls make externally defined C++ code available for use in JITted C++ strings elsewhere in ROOT, e.g. in RDataFrame. It should also be possible to use standalone cppyy instead of PyROOT (and cppyy.include and cppyy.load_library instead of the ROOT methods), but I have no experience with that.

How to compile the shared library is beyond the scope of this post, but the main constraints are that the same C++ ABI must be used as by cling, and the symbols that are used must be visible (that is the default setting on GNU/Linux, but may need extra care for cross-platform packages).

A more typical example set of unit tests for a C++ class may then look like this:

import pytest
from math import isclose

@pytest.fixture(scope="session")
def loadMyClass():
    import ROOT as gbl
    gbl.gSystem.Load("lib/libMyClass.so")
    gbl.gInterpreter.ProcessLine('#include "include/MyClass.h"')
    yield

@pytest.fixture(scope="module")
def myclass_hello42(loadMyClass):
    import ROOT as gbl
    yield gbl.MyClass("hello", 42)

def test_myclass_hello42_get(myclass_hello42):
    assert myclass_hello42.getMagic() == 42

def test_myclass_hello42_msg(myclass_hello42):
    assert str(myclass_hello42.toString()) == "Hello, the magic number is 42"

The fixtures in pytest allow to efficiently implement more complicated tests. Some more examples may be found in the bamboo tests. An interesting case are the tests to make sure that the included C++ implementation of the recipes to calculate variations of measured jet and MET momenta (reconstructed quantities in LHC collisions, the variations are used to estimate the effect of imprecise knowledge of some corrections on the final results) gives the same results as the reference implementation. This is done by including a file with all the inputs and the outputs from the reference implementation, and then inside the tests compare the calculated results with the reference results. All the code for this can be found in tests/test_jmesystcalc.py; these calculators, and their tests, are also available as a standalone package pieterdavid/CMSJMECalculators.

A few remarks about drawing histograms

For all the time we spend in (collider) high-energy physics on filling and drawing histograms, it is a bit surprising that the way histograms are drawn by default in ROOT, and some of its peculiarities, are rarely discussed. The main point I want to make is that in the case of variable-sized binnings that is less obvious and natural than is often assumed, and since this comes up in discussions with colleagues from time to time, I decided to make a few plots, and write up why I keep pointing this out.

Bin area or height?

For a definition of a histogram, we can use the Wikipedia page:

Purpose: To roughly assess the probability distribution of a given variable by depicting the frequencies of observations occurring in certain ranges of values

and, a bit below

However, bins need not be of equal width; in that case, the erected rectangle is defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the frequency but frequency density—the number of cases per unit of the variable on the horizontal axis.

Except, this is not what is done in ROOT: it always makes the bin height proportional to its contents, also in variable-width binning. How that causes confusion is what half of this page is about.

A very simple illustration is the display of a uniform distribution with variable bin size

In [1]:
import numpy as np
import ROOT as gbl
h_uni = gbl.TH1F("h1_uni", "A histogram with uniform data",
                 10, 0., 1.)
vBins = np.array([0., 0.1, 0.2, 0.3, 0.5, 0.7, 1.])
h_var = gbl.TH1F("h1_var", "Also a histogram with uniform data?",
                 vBins.shape[0]-1, vBins)
for x in np.random.uniform(size=10000):
    h_uni.Fill(x)
    h_var.Fill(x)
h_var.SetLineColor(gbl.kRed)
c1 = gbl.TCanvas("c1")
h_var.Draw()
h_var.GetYaxis().SetRangeUser(0., 4000.)
h_uni.Draw("SAME")
c1.Draw()
Welcome to JupyROOT 6.22/06

One might claim that, with variable-sized bins, one would not use the histogram option, because it does not show the size of the bins, but even with the more conventional option, the basic problem remains that the histogram rises, while the actual distribution is flat.

In [2]:
c2 = gbl.TCanvas("c2")
h_var.Draw("PE")
h_var.GetYaxis().SetRangeUser(0., 4000.)
h_uni.Draw("PE,SAME")
c2.Draw()

In principle, the solution is to make the bin area proportional to the count, so the height should be the count divided by the bin width.

In [3]:
h_var_w = h_var.Clone("h1_var_w")
h_uni_w = h_uni.Clone("h1_uni_w")
for h in (h_var_w, h_uni_w):
    h.Sumw2(True)
    for i in range(1, h.GetNbinsX()+1):
        h.SetBinContent(i, h.GetBinContent(i)/h.GetBinWidth(i))
        h.SetBinError(i, h.GetBinError(i)/h.GetBinWidth(i))
c3 = gbl.TCanvas("c3")
h_var_w.Draw("PE")
h_var_w.GetYaxis().SetRangeUser(0., 12500.)
h_uni_w.Draw("PE,SAME")
c3.Draw()

Not very convenient in ROOT, but possible. In my (not very polished) mplbplot package for drawing ROOT histograms with matplotlib, I made this an option to the drawing commands (volume; False, as in ROOT, is the default).

In [4]:
from matplotlib import pyplot as plt
import mplbplot.decorateAxes
from mplbplot.plothelpers import histHandlerMap
fig,ax = plt.subplots(1,2, figsize=(12,5))
ax[0].rhist(h_uni, histtype="step", color="b", label="Uniform")
ax[0].rhist(h_var, histtype="step", color="r", label="Variable")
ax[0].set_xlim(0., 1.)
ax[0].legend(handler_map=histHandlerMap, loc="upper left")
ax[1].rhist(h_uni, histtype="step", color="b", label="Uniform",
            volume=True)
ax[1].rhist(h_var, histtype="step", color="r", label="Variable",
            volume=True)
ax[1].set_xlim(0., 1.)
ax[1].legend(handler_map=histHandlerMap, loc="lower right")
Out[4]:
<matplotlib.legend.Legend at 0x7f62ebb92310>

Also note that, when dividing by the bin width, the integral of the histogram (taking into account the units on the x axis) matches the actual integral of the histogram. This is true independently of the binning choice then, which makes it easier to estimate how many counts there are in a part of the distribution (by multiply the x range, instead of the number of bins, with the height).

Rebinning or transforming the x axis variable?

A consequence of the above is that from time to time, rebinning is used to get distributions into a specific shape, e.g. to obtain a flat distribution. In my opinion this only adds to the confusion above: the shape of a distribution should not depend on the binning choice. It is perfectly valid to use a binning with an equal count per bin, but that should not cause it to be drawn as a flat distribution.

The alternative is to use a transformation of the x axis variable, which changes the shape if the transformation is nonlinear, due to the Jacobian factor.

$$ \int_\text{bin} f(x) dx = \int_\text{bin} f(x(y)) \left| \frac{dx}{dy} \right| dy $$

(for the same bin, so a binning that is uniform in x will not be non-uniform in y if $\frac{dx}{dy}$ is not constant). A well-known example is the distribution of the distance of a cloud of points from a reference point in a plane: if the distribution is homogeneous in the plane, the distance distribution will rise linearly due to the Jacobian, whereas the distribution of the distance will be flat. If the distribution in the plane is peaks around the reference point and falls with distance, the distance distribution will have a peak due to the combination of the falling distribution and rising Jacobian, as illustrated below

In [5]:
h_uni_d = gbl.TH1F("h_uni_d", "Distance distribution", 20, 0., 3.)
h_uni_d2 = gbl.TH1F("h_uni_d2", "Squared distance distribution",
                    20, 0., 9.)
for x,y in zip(np.random.uniform(-3., 3., size=10000),
               np.random.uniform(-3., 3., size=10000)):
    d2 = x**2+y**2
    h_uni_d.Fill(np.sqrt(d2))
    h_uni_d2.Fill(d2)
h_norm_d = gbl.TH1F("h_norm_d", "Distance distribution", 20, 0., 3.)
h_norm_d2 = gbl.TH1F("h_norm_d2", "Squared distance distribution",
                     20, 0., 9.)
for x,y in zip(np.random.normal(0., 1., size=10000),
               np.random.normal(0., 1., size=10000)):
    d2 = x**2+y**2
    h_norm_d.Fill(np.sqrt(d2))
    h_norm_d2.Fill(d2)
In [6]:
fig,ax = plt.subplots(1,2, figsize=(12,5))
ax[0].rhist(h_uni_d, histtype="step", color="b",
            label="Uniform", volume=True)
ax[0].rhist(h_norm_d, histtype="step", color="r",
            label="Bivariate normal", volume=True)
ax[0].set_xlabel("Distance")
ax[0].set_xlim(0., 3.)
ax[0].legend(handler_map=histHandlerMap)
ax[1].rhist(h_uni_d2, histtype="step", color="b",
            label="Uniform", volume=True)
ax[1].rhist(h_norm_d2, histtype="step", color="r",
            label="Bivariate normal", volume=True)
ax[1].set_xlabel("Distance squared")
ax[1].set_xlim(0., 9.)
ax[1].legend(handler_map=histHandlerMap)
Out[6]:
<matplotlib.legend.Legend at 0x7f633c250730>

Since this is well-defined and does not cause any "problems" with non-uniform binnings in ROOT, I would advocate the use of a transformation to get a distribution in the desired shape, especially when it is not a measured quantity, but a summary variable or the output of a multivariate classifier (e.g. neural network, BDT).

Searching and browsing CMSSW: use git

CMSSW, the software that the CMS experiment uses to process collision data, from the detector readout to a format suitable for analysis, is contained in one big git repository, cms-sw/cmssw.git. Since documentation is not always kept up to date, or missing some details, it occurs quite often that the only way to figure out how some numbers came about is to check the code itself—this is even more the case when doing development.

For saving time during development, git's sparse checkout feature is heavily used, such that only the minimally required (modified, and depending) packages are checked out and built. The other targets—mostly shared libraries, executables, and python modules—are taken from the corresponding release area, which is distributed through the CernVM-FS (or cvmfs) filesystem. Since also the git objects are distributed in this way (to all CMS colleagues who have not already done so: set CMSSW_GIT_REFERENCE=/cvmfs/cms.cern.ch/cmssw.git.daily when working on a machine with cvmfs), the cost of setting up a working area is very small (it typically takes a few seconds to run cmsrel).

A less known feature of such a setup is that it also allows for very fast (and flexible) searching of the full repository, parts of it, and the complete git history, with git grep and git show. The rest of this post describes my favourite tricks and aliases.

git grep and ack

There are many online introductions to using git grep, see e.g. the "Search a git repo like a ninja" post by Travis Jeffery, which also introduces a git ack alias to format the results like ack (if you do not know ack yet: it is like grep but with better defaults, switches to search only some file types etc.—really handy, and there is a one-file install option). I would recommend the following customization (in ~/.gitconfig):

[alias]
  ack = grep --break --heading --line-number --color=auto
[grep]
  extendRegexp = true
  lineNumber = true
  color = auto

The short version: run git grep from the root of the repository ($CMSSW_BASE/src) with one of

git [grep or ack] pattern <tree> -- <path>

where both the <tree> and the -- path parts, to specify the revision, tag or branch, and a part of the repository that should be searched, respectively, are optional.

As a final example, the following command searches for the definition of the the jetId variable in a specific release:

git ack jetId CMSSW_10_2_9 -- PhysicsTools/NanoAOD

git show with vim as a pager, and syntax higlighting

Nothing too special by itself, but a useful helper command to know about: git show revision:path can be used as the equivalent of cat, less (or your favourite plain-text pager) for any revision of any file (also when they are not in the working directory).

Of course cat and less are not the most convenient tools to browse large source files. vim can be used as a pager (with the less.vim script). Syntax highlighting does not work out of the box, however, because ftplugin mostly relies on the filename extension for detecting the filetype. But when calling git show, we have a filename, so we can use a trick: first ask ftplugin which filetype it would resolve, and then pass it when opening vim. This is what is done in the filetypeless.sh script (which should be installed in ~/.vim/macros/), the gist of it is

ftcmd=$(vim -es --cmd "filetype on" "${1}" -c ':exec ":set filetype?" | exec ":q!"')
vim -R --cmd 'let no_plugin_maps = 1' -c "set ${ftcmd}" -c 'runtime! macros/less.vim'

The following git alias (using the trick from this recipe) will run the script on the output of git show

[alias]
  view = "!v() { git show $1 | ~/.vim/macros/filetypeless.sh $1; }; v"

and that's all: git view HEAD:some_file.cpp now shows the file in vim, with (in this case C++) syntax highlighting.

Welcome to my blog

Hello,

I am a particle physicist, currently working on the CMS experiment at the CERN LHC (previously LHCb). Most of my work is about data analysis, and detector and reconstruction software. Physics results go into papers, but the more technical bits of the job (tools, techniques, computing tricks) often do not, so I made this blog to share some of those (and maybe some other things that come up) with colleagues near and far, and with anyone who finds them useful or interesting.

Happy reading!