A few remarks about drawing histograms

2020-12-18 14:13

For all the time we spend in (collider) high-energy physics on filling and drawing histograms, it is a bit surprising that the way histograms are drawn by default in ROOT, and some of its peculiarities, are rarely discussed. The main point I want to make is that in the case of variable-sized binnings that is less obvious and natural than is often assumed, and since this comes up in discussions with colleagues from time to time, I decided to make a few plots, and write up why I keep pointing this out.

Bin area or height?¶

For a definition of a histogram, we can use the Wikipedia page:

Purpose: To roughly assess the probability distribution of a given variable by depicting the frequencies of observations occurring in certain ranges of values

and, a bit below

However, bins need not be of equal width; in that case, the erected rectangle is defined to have its area proportional to the frequency of cases in the bin. The vertical axis is then not the frequency but frequency density—the number of cases per unit of the variable on the horizontal axis.

Except, this is not what is done in ROOT: it always makes the bin height proportional to its contents, also in variable-width binning. How that causes confusion is what half of this page is about.

A very simple illustration is the display of a uniform distribution with variable bin size

In [1]:

import numpy as np
import ROOT as gbl
h_uni = gbl.TH1F("h1_uni", "A histogram with uniform data",
                 10, 0., 1.)
vBins = np.array([0., 0.1, 0.2, 0.3, 0.5, 0.7, 1.])
h_var = gbl.TH1F("h1_var", "Also a histogram with uniform data?",
                 vBins.shape[0]-1, vBins)
for x in np.random.uniform(size=10000):
    h_uni.Fill(x)
    h_var.Fill(x)
h_var.SetLineColor(gbl.kRed)
c1 = gbl.TCanvas("c1")
h_var.Draw()
h_var.GetYaxis().SetRangeUser(0., 4000.)
h_uni.Draw("SAME")
c1.Draw()

Welcome to JupyROOT 6.22/06

No description has been provided for this image

One might claim that, with variable-sized bins, one would not use the histogram option, because it does not show the size of the bins, but even with the more conventional option, the basic problem remains that the histogram rises, while the actual distribution is flat.

In [2]:

c2 = gbl.TCanvas("c2")
h_var.Draw("PE")
h_var.GetYaxis().SetRangeUser(0., 4000.)
h_uni.Draw("PE,SAME")
c2.Draw()

In principle, the solution is to make the bin area proportional to the count, so the height should be the count divided by the bin width.

In [3]:

h_var_w = h_var.Clone("h1_var_w")
h_uni_w = h_uni.Clone("h1_uni_w")
for h in (h_var_w, h_uni_w):
    h.Sumw2(True)
    for i in range(1, h.GetNbinsX()+1):
        h.SetBinContent(i, h.GetBinContent(i)/h.GetBinWidth(i))
        h.SetBinError(i, h.GetBinError(i)/h.GetBinWidth(i))
c3 = gbl.TCanvas("c3")
h_var_w.Draw("PE")
h_var_w.GetYaxis().SetRangeUser(0., 12500.)
h_uni_w.Draw("PE,SAME")
c3.Draw()

Not very convenient in ROOT, but possible. In my (not very polished) mplbplot package for drawing ROOT histograms with matplotlib, I made this an option to the drawing commands (volume; False, as in ROOT, is the default).

In [4]:

from matplotlib import pyplot as plt
import mplbplot.decorateAxes
from mplbplot.plothelpers import histHandlerMap
fig,ax = plt.subplots(1,2, figsize=(12,5))
ax[0].rhist(h_uni, histtype="step", color="b", label="Uniform")
ax[0].rhist(h_var, histtype="step", color="r", label="Variable")
ax[0].set_xlim(0., 1.)
ax[0].legend(handler_map=histHandlerMap, loc="upper left")
ax[1].rhist(h_uni, histtype="step", color="b", label="Uniform",
            volume=True)
ax[1].rhist(h_var, histtype="step", color="r", label="Variable",
            volume=True)
ax[1].set_xlim(0., 1.)
ax[1].legend(handler_map=histHandlerMap, loc="lower right")

Out[4]:

<matplotlib.legend.Legend at 0x7f62ebb92310>

Also note that, when dividing by the bin width, the integral of the histogram (taking into account the units on the x axis) matches the actual integral of the histogram. This is true independently of the binning choice then, which makes it easier to estimate how many counts there are in a part of the distribution (by multiply the x range, instead of the number of bins, with the height).

Rebinning or transforming the x axis variable?¶

A consequence of the above is that from time to time, rebinning is used to get distributions into a specific shape, e.g. to obtain a flat distribution. In my opinion this only adds to the confusion above: the shape of a distribution should not depend on the binning choice. It is perfectly valid to use a binning with an equal count per bin, but that should not cause it to be drawn as a flat distribution.

The alternative is to use a transformation of the x axis variable, which changes the shape if the transformation is nonlinear, due to the Jacobian factor.

$$ \int_\text{bin} f(x) dx = \int_\text{bin} f(x(y)) \left| \frac{dx}{dy} \right| dy $$

(for the same bin, so a binning that is uniform in x will not be non-uniform in y if $\frac{dx}{dy}$ is not constant). A well-known example is the distribution of the distance of a cloud of points from a reference point in a plane: if the distribution is homogeneous in the plane, the distance distribution will rise linearly due to the Jacobian, whereas the distribution of the distance will be flat. If the distribution in the plane is peaks around the reference point and falls with distance, the distance distribution will have a peak due to the combination of the falling distribution and rising Jacobian, as illustrated below

In [5]:

h_uni_d = gbl.TH1F("h_uni_d", "Distance distribution", 20, 0., 3.)
h_uni_d2 = gbl.TH1F("h_uni_d2", "Squared distance distribution",
                    20, 0., 9.)
for x,y in zip(np.random.uniform(-3., 3., size=10000),
               np.random.uniform(-3., 3., size=10000)):
    d2 = x**2+y**2
    h_uni_d.Fill(np.sqrt(d2))
    h_uni_d2.Fill(d2)
h_norm_d = gbl.TH1F("h_norm_d", "Distance distribution", 20, 0., 3.)
h_norm_d2 = gbl.TH1F("h_norm_d2", "Squared distance distribution",
                     20, 0., 9.)
for x,y in zip(np.random.normal(0., 1., size=10000),
               np.random.normal(0., 1., size=10000)):
    d2 = x**2+y**2
    h_norm_d.Fill(np.sqrt(d2))
    h_norm_d2.Fill(d2)

In [6]:

fig,ax = plt.subplots(1,2, figsize=(12,5))
ax[0].rhist(h_uni_d, histtype="step", color="b",
            label="Uniform", volume=True)
ax[0].rhist(h_norm_d, histtype="step", color="r",
            label="Bivariate normal", volume=True)
ax[0].set_xlabel("Distance")
ax[0].set_xlim(0., 3.)
ax[0].legend(handler_map=histHandlerMap)
ax[1].rhist(h_uni_d2, histtype="step", color="b",
            label="Uniform", volume=True)
ax[1].rhist(h_norm_d2, histtype="step", color="r",
            label="Bivariate normal", volume=True)
ax[1].set_xlabel("Distance squared")
ax[1].set_xlim(0., 9.)
ax[1].legend(handler_map=histHandlerMap)

Out[6]:

<matplotlib.legend.Legend at 0x7f633c250730>

Since this is well-defined and does not cause any "problems" with non-uniform binnings in ROOT, I would advocate the use of a transformation to get a distribution in the desired shape, especially when it is not a measured quantity, but a summary variable or the output of a multivariate classifier (e.g. neural network, BDT).

Pieter David's blog

Bin area or height?¶

Rebinning or transforming the x axis variable?¶

Comments