Classifier Application¶
Once a classifier has been trained and validated to the your liking, you are ready to use it to classify events of otherwise unknown class. This can be done for individual samples or for whole ensembles at a time. Here we go over methods of classifier application provided by pybdt.
Scoring with IceTray¶
For IceCube data, it may be convenient to store a classifier score as an
I3Double in each Physics frame. This can be done relatively simply with
icecube.pybdtmodule.PyBDTModule
. For the ABC example, you can calculate and store the scores like so:
from icecube.icetray import I3Tray
from icecube import dataio
from icecube.pybdtmodule import PyBDTModule
tray = I3Tray()
[...]
def varsfunc (frame):
a = frame['A'].value
b = frame['B'].value
c = frame['C'].value
out = dict (a=a, b=b, c=c)
return out
tray.AddModule (PyBDTModule, 'bdt',
BDTFilename='path/to/sample.bdt',
varsfunc=varsfunc,
OutputName='Score'
)
[...]
It is up to the user to provide varsfunc()
, which extracts the relevant
features from the frame for use by icecube.pybdtmodule.PyBDTModule
.
An alternative implemmentation would inspect each frame directly for
I3Double’s corresponding to each required feature, but that approach would
require unnecessary pollution of the frames when features are nested deep in
frame objects. Consider possible features zenith = SplineMPE.dir.zenith
or plogl = SPEFitFitParams.logl / (SPEFitFitParams.ndof + 1.5)
. The use
of varsfunc()
simplifies scoring when training features are derived
from, but not exactly corresponding to, variables produced by the rest of
the processing chain.
Scoring without IceTray¶
In other contexts, it is useful to calculate scores outside of the IceTray framework. This can be done for individual events or ensembles at a time. An individual event can be scored like so:
score = bdt.score_event ({'a': 2.7, 'b': -0.31, 'c': 116})
The dict
argument can of course be calculated by any method so long as
the keys match the features used by the classifier and the values are real
numbers.
Suppose one has an ensemble of events with numpy arrays a
, b
and
c
holding one value per event. Then the per-event scores can be
calculated like so:
scores = bdt.score_dict (dict (a=a, b=b, c=c))
If a pybdt.ml.DataSet
object data
is already available, the
per-event scores can be calculated like so:
scores = bdt.score_DataSet (data)
In these last two exaples, the resulting scores
are a numpy array with
dtype=float
.
A convienience method pybdt.ml.BDTModel.score()
is provided which
automatically calls the correct one of the above methods, given the input.
Thus the following calls all work as expected:
score = bdt.score ({'a': 2.7, 'b': -0.31, 'c': 116})
scores = bdt.score (dict (a=a, b=b, c=c))
scores = bdt.score (data)
Finally, note that the leaf purity \(p\) can be used to scale the contribution of each tree in the model (see Introduction to decision tree classifiers). This is especially common for forests in which the trees are differentiated only by randomization but not by boosting. To enable this type of scoring, change the above calls to:
score = bdt.score ({'a': 2.7, 'b': -0.31, 'c': 116}, use_purity=True)
scores = bdt.score (dict (a=a, b=b, c=c), use_purity=True)
scores = bdt.score (data, use_purity=True)