DataSets and Training¶
Working with machine learning can be broadly divided into 3 phases: training, testing, and application. On this page we discuss training with pybdt.
DataSet objects¶
pybdt uses pybdt.ml.DataSet
objects to access your data.
Generating a DataSet object is very straightforward. The data should
be placed into numpy arrays with one
element per event. Then, you use a python dict to construct the
DataSet. For example,:
# first, load variables a, b, and c, and event weights, from
# wherever they are stored
from pybdt import ml
sim_data = ml.DataSet (dict (a=a, b=b, c=c, weight=weight))
Data columns can be accessed using the __getitem__
operator,
i.e.:
a = sim_data['a']
weight = sim_data['weight']
It can be convenient to save a livetime for a DataSet, which may be
passed in as a single float
like so:
...
livetime = 3.15e7
data = ml.DataSet (dict (a=a, b=b, c=c, livetime=livetime))
The livetime may be read or changed using the
pybdt.ml.DataSet.livetime
property.
Training¶
The simplest way to train is to use the included
pybdt/resources/scripts/train.py
script. Let’s have a look at the
help output.
Usage: train.py {options} [comma-sep'd variables] \
[training signal] [training background] [bdt filename]
Options:
-h, --help show this help message and exit
Decision Tree options:
-d N, --depth=N make trees N levels deep
-L, --nonlinear-cuts
use nonlinear cut spacing
-c N, --num-cuts=N try N cuts per var per node
-s N, --min-split=N
do not split if a node contains fewer than N events
-p STRENGTH, --prune-strength=STRENGTH
use STRENGTH prune strength
-v N, --num-random-variables=N
use N randomly selected variables at each node (0 to
use every var at every node)
Forest options:
-t N, --num-trees=N
use N trees
-b BETA, --beta=BETA
use BETA boost parameter
-e FRAC, --frac-random-events=FRAC
use FRAC randomly selected fraction of events in each
tree (1, the default, to use every event in every
tree)
Data options:
--sig-weight=COLNAME
the name of the variable in which the signal weights
are stored
--bg-weight=COLNAME
the name of the variable in which the bg weights are
stored
An example of the usage of this script can be found in pybdt/resources/examples/train_sample_bdt.sh. For a real-world example, see the the ml_train() function from the IC79 northern \(\nu_\mu\) analysis.