pybdt.validate module¶
Validation suite for PyBDT forests.
- class pybdt.validate.Validator(bdt)¶
Bases:
object
Test and validate BDTs.
- class Proxy(validator, d, getfunc)¶
Bases:
object
A class to map data set keys directly to the some object, which may require an arbitrary extra dereferencing step.
- add_data(key, data, label='', scores=True, pscores=False)¶
Add a data set to this Validator.
- add_weighting(arg, key, wkey='default', label='', add_to_mc=False, use_as_data=False, **style_kwargs)¶
Add a weighting to a dataset.
- Parameters:
arg (str, numpy.ndarray, or float) – The name of the weight column, the weights, or the livetime.
key (str) – The key for the desired data set.
wkey (str) – The key for this weighting of the data set.
label (str) – A nice label for this weighting.
add_to_mc (bool) – Whether to include this dataset weighting in “total monte carlo” calculations.
use_as_data (bool) – Whether to use this dataset weighting as the “data” sample in data/mc ratio calculations.
style_kwargs (dict) – Arguments to pass to the
histlight.Style
constructor.
- property bdt¶
The BDTModel for this Validator.
- clear_weightings()¶
Erase any stored weightings.
- create_correlation_matrix_plot(set_spec, exprs=None, fignum=None, cut=None, eval_names={})¶
Create a correlation matrix plot.
- Parameters:
set_spec (str or tuple) – See
Validator.get_key_wkey()
exprs (list) – List of expressions to put on the axes (default: self.bdt.feature_names)
fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.
cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
for cut evaluation.
- Returns:
The new
matplotlib.figure.Figure
.
- create_correlation_ratio_matrix_plot(set_spec1, set_spec2, exprs=None, fignum=None, clog=False, cut=None, eval_names={})¶
Create a correlation matrix plot.
- Parameters:
set_spec1 (str or tuple) – See
Validator.get_key_wkey()
set_spec2 (str or tuple) – See
Validator.get_key_wkey()
exprs (list) – List of expressions to put on the axes (default: self.bdt.feature_names)
fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.
clog (bool) – Whether to use a log color scale (absolute values of ratios will be shown)
cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
for cut evaluation.
- Returns:
The new
matplotlib.figure.Figure
.
- create_overtrain_check_plot(sig_train_set_spec, sig_test_set_spec, bg_train_set_spec, bg_test_set_spec, legend={'loc': 'best'}, legend_side='right', fignum=None, expr='scores', **kwargs)¶
Create an overtraining check plot.
- Parameters:
sig_train_set_spec (str or tuple) – The signal training set. (Each signal or background training or testing set is specified as with
Validator.get_key_wkey()
)bg_train_set_spec (str or tuple) – The background training set.
bg_test_set_spec (str or tuple) – The background testing set.
legend (bool or dict) – If True or non-empty dict, draw a legend. If dict, use as keyword arguments for matplotlib.axes.Axes.legend.
legend_side (str) – Either ‘left’ or ‘right’; on which axes to draw the legend.
fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.
expr (str) – The expression to evaluate on each data set.
The following additional keyword arguments are allowed:
- Parameters:
title (str) – The title of the plot.
xlabel (str) – The xaxis label.
ylabel (str) – The xaxis label.
left_ylabel (str) – The main y axis label.
right_ylabel (str) – The secondary y axis label.
margin_left (float) – Fraction of width to reserve as left margin.
margin_right (float) – Fraction of width to reserve as right margin.
margin_top (float) – Fraction of width to reserve as top margin.
margin_bottom (float) – Fraction of width to reserve as bottom margin.
bins (int) – Number of bins to use in histograms.
- Returns:
A dict with string keys and values of the new
matplotlib.figure.Figure
and each set of axes used. Depending on the above argument, some or all of the following keys will be available:[‘fig’, ‘first_main_ax’, ‘twin_first_main_ax’, ‘first_ratio_ax’, ‘second_main_ax’, ‘twin_second_main_ax’, ‘second_ratio_ax’]
- create_plot(expr, kind, left_set_specs, right_set_specs=[], fignum=None, **kwargs)¶
Create a BDT score distribution, rate plot, or efficiency plot.
- Parameters:
expr (str) – An expression for
Validator.eval()
which returns an numerical array.kind (str) – One of ‘dist’, ‘rate’ or ‘eff’.
left_set_specs (list) – What to plot on the main y axis (see
Validator.get_key_wkey()
).right_set_specs (list) – What to plot on the secondary y axis (see
Validator.get_key_wkey()
).fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.
A new figure is created using
Validator.plot_variable()
. The following keyword arguments can be used to create dual linear/log plots.- Parameters:
dual (bool) – Make a dual figure with a linear y scale on the left and a log y scale on the right.
data_mc (bool) – Include data/mc ratio plot(s).
linear_kwargs (dict) – If given, this dict of keyword arguments supercedes individually passed keyword arguments for the linear plot.
log_kwargs (dict) – If given, this dict of keyword arguments supercedes individually passed keyword arguments for the linear plot.
The following keyword arguments determine the plot appearance.
- Parameters:
title (str) – The title of the plot.
xlabel (str) – The xaxis label (default: expr).
ylabel (str) – The xaxis label.
left_ylabel (str) – The main y axis label.
right_ylabel (str) – The secondary y axis label.
data_mc_ylabel (str) – The data/mc ratio plot y axis label (default: “data/mc ratio”).
grid (bool) – Whether to include grids (default: True)
margin_left (float) – Fraction of width to reserve as left margin.
margin_right (float) – Fraction of width to reserve as right margin.
margin_top (float) – Fraction of width to reserve as top margin.
margin_bottom (float) – Fraction of width to reserve as bottom margin.
aspect (float) – Width / height ratio.
All other keyword arguments are passed through to
Validator.plot_variable()
.- Returns:
A dict with string keys and values of the new
matplotlib.figure.Figure
and each set of axes used. Depending on the above argument, some or all of the following keys will be available:[‘fig’, ‘first_main_ax’, ‘twin_first_main_ax’, ‘first_dm_ax’, ‘second_main_ax’, ‘twin_second_main_ax’, ‘second_dm_ax’]
- create_variable_pair_plot(set_spec, exprx, expry, bins=100, range=None, fignum=None, clog=False, cut=None, eval_names={})¶
Create a variable-variable 2D histogram.
- Parameters:
set_spec (str or tuple) – See
Validator.get_key_wkey()
exprx (str) – Expression to put on the x axis
expry (str) – Expression to put on the x axis
bins (int) – The number of bins to create [default: 100].
range (tuple of tuples of floats) – If given, the x range and y range in the form ((xmin,xmax), (ymin,ymax))
fignum (int) – If given, create a figure with the given number using matplotlib.pyplot; otherwise use matplotlib.figure.Figure.
clog (bool) – Whether to use a log color scale (absolute values of ratios will be shown)
cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
for cut evaluation.
- property data¶
Mapping of keys to DataSets.
- eval(set_spec, expr, names={})¶
Evaluate an expression in terms of variables in a dataset.
- Parameters:
set_spec (str or tuple) – See
Validator.get_key_wkey()
expr (str) – The expression to evaluate.
names (dict) – Names to be passed into eval.
- Returns:
The result of the expression evaluation.
When expr is evaluated, each variable stored in the dataset will be available. The variables ‘scores’, ‘pscores’ and ‘weights’ will also be available. If the dataset has a livetime set, ‘livetime’ will also be available.
Other allowed identifiers are np (Numpy) and scipy , in addition to anything specified in the names parameter.
This method is implemented in terms of
DataSet.eval()
.
- property full_label¶
Mapping of (key,wkey) to full labels.
- get_Hist(set_spec, expr, bins=100, range=None, normed=False, cut=None, eval_names={})¶
Get a
histlite.Hist
for a variable for a given data set and weighting.- Parameters:
set_spec (str or tuple) – See
Validator.get_key_wkey()
expr (str) – An expression for
Validator.eval()
which returns an numerical array.bins (int) – The number of bins to create [default: 100].
range (2-tuple) – The range over which to make the histogram [default: (min value, max value) found in all included data sets].
normed (bool) – Whether to normalize the y axis histograms.
cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
for the main expression or the cut expression.
- Returns:
An instance of
histlite.Hist
.
- get_clone(bdt=None)¶
Construct a copy Validator.
- Parameters:
bdt (str or
BDTModel
) – The BDT model instanceStorableObject
initializer.
- get_correlation(set_spec, expr1, expr2, cut=None, eval_names={})¶
Get the correlation between two variables for a data set.
- Parameters:
set_spec (str or tuple) – See
Validator.get_key_wkey()
expr1 (str) – The first variable or expression.
expr2 (str) – The second variable or expression.
cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
for cut evaluation.
- get_key_wkey(set_spec)¶
Get the key and weighting key for a given data set spec.
- get_kolmogorov_smirnov_probability(set_spec_1, set_spec_2, expr='scores', bins=1000)¶
Calculate the Kolmogorov-Smirnov p value for two distributions. using
kolmogorov_smirnov_probability()
.
- get_range(set_specs, expr, cut=None, eval_names={})¶
Get the range of values of variable (after transform) for given datasets and weightings.
- Parameters:
set_specs (list) – The datasets and weightings (see
Validator.get_key_wkey()
).expr (str) – An expression for
Validator.eval()
which returns an numerical array.cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
.
- Returns:
A (min_val, max_val) tuple.
- get_values_weights(set_spec, expr, cut=None, eval_names={})¶
Evaluate an expression, and get weights and scores.
- Parameters:
set_spec (str or tuple) – See
Validator.get_key_wkey()
expr (str) – An expression for
Validator.eval()
which returns an numerical array.cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
for the main expression or the cut expression.
- Returns:
A ([expression result], weights, scores) tuple
- property label¶
Mapping of keys to labels.
- load_all_data(dbg=False)¶
Load all data from disk into RAM.
- plot_variable(axes, expr, kind, left_set_specs, right_set_specs=[], twin_axes=None, data_mc=False, **kwargs)¶
Create a BDT score distribution, rate plot, or efficiency plot.
- Parameters:
axes (matplotlib.axes.Axes) – The Axes on which to draw the plot.
expr (str) – An expression for
Validator.eval()
which returns an numerical array.kind (str) – One of ‘dist’, ‘rate’ or ‘eff’.
left_set_specs (list) – What to plot on the main y axis (see
Validator.get_key_wkey()
).right_set_specs (list) – What to plot on the secondary y axis (see
Validator.get_key_wkey()
).twin_axes (matplotlib.axes.Axes) – The secondary-y axes, if already created with axes.twinx().
data_mc (bool) – Plot ratio of given curves to total_mc.
If ‘total_mc’ is included in either left_set_specs or right_set_specs, then a total monte carlo line will be added.
The following additional kwargs are allowed.
- Parameters:
legend (bool or dict) – If True or non-empty dict, draw a legend. If dict, use as keyword arguments for matplotlib.axes.Axes.legend.
cut (str) – An expression for
Validator.eval()
which returns an array of bools, where True means “include this event”.eval_names (dict) – Names to be passed to
Validator.eval()
.log (bool) – Whether to use a log-y scale.
normed (bool) – Whether to normalize the y axis histograms.
left_log (bool) – Whether to use a log-y scale on the main y axis.
left_normed (bool) – Whether to normalize the main y axis histograms.
right_log (bool) – Whether to use a log-y scale on the secondary y axis.
right_normed (bool) – Whether to normalize the secondary y axis histograms.
dbg (bool) – Whether to print debugging/logging information while plotting.
- property pscores¶
Mapping of keys to purity-based score arrays.
- property scores¶
Mapping of keys to score arrays.
- setup_total_mc(label='Total MC', **style_kwargs)¶
Setup total monte carlo plotting properties.
- Parameters:
label (str) – The label for total monte carlo lines.
Any additional arguments are passed to the
histlight.Style
constructor.
- property style¶
A mapping of set_specs to
histlight.Style
objects.
- property weight_label¶
Mapping of keys to mappings of weight keys to weight labels.
- property weights¶
Mapping of keys to mappings of weight keys to weight arrays.