Tree and Forest Inspection¶
One of the design goals of pybdt is to make it relatively straightforward for users to inspect the classifier. On this page we discuss some features that make this easy.
Direct Acces¶
The most basic way to learn about a trained classifier is to check its
properties in IPython. If a BDTModel
is
loaded as bdt
, then the number of constituent DTModel
s is
bdt.n_dtmodels
(an int
). The per-tree relative weights are given by
bdt.alphas
(numpy array with dtype float
). The DTModel
s
themselves are stored in bdt.dtmodels
(numpy array with dtype
object
). The cut parameters used by the classifier are
bdt.feature_names
(list of string
s).
Now, suppose we store the first tree as dt = bdt.dtmodels[0]
. The root
DTNode
is node = dt.root
; you can jump from a node to its
children with DTNode.left
and DTNode.left
(until you reach a
leaf node, for which both children are None
). Split nodes apply a cut
of the form node.feature_name
< node.feature_val
; passing events
descend left and failing events descend right. Other details are also
available – see the DTNode
reference.
Visualization¶
pybdt includes a module pybdt.viz
for easily visualizing trees.
For a text-only printout, we can say:
from pybdt import viz
print viz.dtmodel_to_text (dt)
For a graphical visualization, we can instead say:
pic = viz.dtmodel_to_graphviz (dt)
pic.write_png (filename + '.png')
pic.write_pdf (filename + '.pdf')
and so forth – many write_*
methods are provided by graphviz. This
feature requires the pydot interface to graphviz to be installed. On
ubuntu, it’s provided by the python-pydot
package.
Variable Importance¶
The features described above make it reasonably easy to dig into the details of individual trees, and the plotting facilities (Validator Usage) offer a number of options for quantifying the overall classifier performance. However, one question remains to be answered: to what extent does each feature participate in the classifier? This can be addressed with the following calls:
print viz.variable_importance_to_text (bdt.variable_importance (True, True)) #1
print viz.variable_importance_to_text (bdt.variable_importance (False, False)) #2
print viz.variable_importance_to_text (bdt.variable_importance (True, False)) #3
print viz.variable_importance_to_text (bdt.variable_importance (False, True)) #4
pybdt.ml.BDTModel.variable_importance()
produces a dict
with
string keys (the variable names) and floating point values (the relative
importance, between 0 and 1). pybdt.viz.variable_importance_to_text()
converts this dict
to a string consisting of an easily-readable table,
sorted by descending variable importance.
variable_importance()
measures the importance of variables by counting
the split nodes using each feature. The first agrument tells whether the
count should be weighted by the separation gain (node.sep_gain
) achieved
by each split. The second argument tells whether the count should be
weighted by the overall weight (alpha
) of the tree in which each split
occurs.
Here is the variable importance for the classifier in the ABC example – first with both weightings enabled, and then with both disabled:
print viz.variable_importance_to_text (bdt.variable_importance (True, True))
1. b : 0.556785
2. c : 0.259359
3. a : 0.183856
print viz.variable_importance_to_text (bdt.variable_importance (False, False))
1. c : 0.387212
2. b : 0.349272
3. a : 0.263517
In general, enabling weighting causes splits in earlier trees and splits
closer to the roots of trees to be counted more strongly, relative to when
weighting is disabled. Thus the weighted variable importance can be
interpreted as a measure that is biased in favor of those variables
responsible for the classification of the bulk of events. The unweighted
variable importance is, by comparison, more fair to variables used in later
trees or closer to leaf nodes; it can be interpreted as a measure of which
variables are responsible for classifying the trickiest events. It is less
clear how the (True, False)
and (False, True)
variable importance
weightings should be interpreted, but the former will have more of a bias
towards the roots of trees and the latter will have more of a bias towards
earlier trees in the forest.