Introduction to decision tree randomization¶
Decision tree classifiers are powerful because they find the best cut for each region in the many dimensional parameter space. BDT classifiers are an improvement over single decision trees because they can provide good classification for events in the tails of the variable distributions without becoming overtrained on fluctuations in those distributions.
A more recent innovation is to generate so-called Random Forests. These classifiers, like BDT classifiers, also make use of a forest of decision trees to provide a score for events. However, rather than using boosting to differentiate the individual trees, an element of randomness is introduced.
Typically, one uses either boosting with no randomization, or randomization with no boosting (boost strength of 0). However, there is no technical reason why these techniques cannot be combined, and so the implementation in pybdt allows the user to use both if desired.
pybdt allows the following two types of randomization.
cut variable randomization
The user provides an integer
num_random_variables
, which must be less than the total number of variables being used. Then during training, at each node, onlynum_random_variables
variables are randomly selected to be considered for choosing a cut.training event randomization
The user provides a fraction
frac_random_events
between 0.0 and 1.0. Then during training, for each tree, onlyfrac_random_events
fraction of the full training sample is used for training.
In the ABC example, training event randomization is used to reduce training sample overtraining. By using different events to train each tree, we avoid tuning to fluctuations in the training sample.