Title: | Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE) |
---|---|
Description: | Accumulated Local Effects (ALE) were initially developed as a model-agnostic approach for global explanations of the results of black-box machine learning algorithms. ALE has a key advantage over other approaches like partial dependency plots (PDP) and SHapley Additive exPlanations (SHAP): its values represent a clean functional decomposition of the model. As such, ALE values are not affected by the presence or absence of interactions among variables in a mode. Moreover, its computation is relatively rapid. This package reimplements the algorithms for calculating ALE data and develops highly interpretable visualizations for plotting these ALE values. It also extends the original ALE concept to add bootstrap-based confidence intervals and ALE-based statistics that can be used for statistical inference. For more details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. <arXiv:2310.09877>. <doi:10.48550/arXiv.2310.09877>. |
Authors: | Chitu Okoli [aut, cre] |
Maintainer: | Chitu Okoli <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.3.0.20250217 |
Built: | 2025-02-19 10:31:55 UTC |
Source: | https://github.com/tripartio/ale |
Array or matrix addition in base R sets all sums in an element position to NA if any element in that position is NA in either of the arrays being added. In contrast, this function ignores NA values by default in its addition.
add_array_na.rm(ary1, ary2, na.rm = TRUE)
add_array_na.rm(ary1, ary2, na.rm = TRUE)
ary1 , ary2
|
numeric arrays or matrices. The arrays to be added. They must be of the same dimension. |
na.rm |
logical(1). TRUE (default) if missing values ( |
An array or matrix of the same dimensions as ary1
and ary2
whose values are the sums of ary1
and ary2
in each corresponding element.
Reduce(add_array_na.rm, list(x1, x2, x3))
An ALE
S7 object contains ALE data and statistics. For details, see vignette('ale-intro')
or the details and examples below.
ALE( model, x_cols = list(d1 = TRUE), data = NULL, y_col = NULL, ..., exclude_cols = NULL, parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, output_stats = TRUE, output_conf = TRUE, output_boot_data = FALSE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = "response", p_values = NULL, p_alpha = c(0.01, 0.05), max_num_bins = 10, boot_it = 0, boot_alpha = 0.05, boot_centre = "mean", seed = 0, y_type = NULL, median_band_pct = c(0.05, 0.5), sample_size = 500, .bins = NULL, silent = FALSE )
ALE( model, x_cols = list(d1 = TRUE), data = NULL, y_col = NULL, ..., exclude_cols = NULL, parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, output_stats = TRUE, output_conf = TRUE, output_boot_data = FALSE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = "response", p_values = NULL, p_alpha = c(0.01, 0.05), max_num_bins = 10, boot_it = 0, boot_alpha = 0.05, boot_centre = "mean", seed = 0, y_type = NULL, median_band_pct = c(0.05, 0.5), sample_size = 500, .bins = NULL, silent = FALSE )
model |
model object. Required. Model for which ALE should be calculated. May be any kind of R object that can make predictions from data. |
x_cols , exclude_cols
|
character, list, or formula. Columns names from |
data |
dataframe. Dataset from which to create predictions for the ALE. It should normally be the same dataset on which |
y_col |
character(1). Name of the outcome target label (y) variable. If not provided, |
... |
not used. Inserted to require explicit naming of subsequent arguments. |
parallel |
non-negative integer(1). Number of parallel threads (workers or tasks) for parallel execution of the function. Set |
model_packages |
character. Character vector of names of packages that |
output_stats |
logical(1). If |
output_conf |
logical(1). If |
output_boot_data |
logical(1). If |
pred_fun , pred_type
|
function,character(1). |
p_values |
instructions for calculating p-values and to determine the median band. If |
p_alpha |
numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if |
max_num_bins |
positive integer(1). Maximum number of bins for numeric |
boot_it |
non-negative integer(1). Number of bootstrap iterations for data-only bootstrapping on ALE data. This is appropriate for models that have been developed with cross-validation. For models that have not been validated, full-model bootstrapping should be used instead with the |
boot_alpha |
numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrap intervals; the bootstrap confidence intervals will be the lowest and highest |
boot_centre |
character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be |
seed |
integer(1). Random seed. Supply this between runs to assure that identical random ALE data is generated each time when bootstrapping. Without bootstrapping, ALE is a deterministic algorithm that should result in identical results each time regardless of the seed specified. |
y_type |
character(1) in c('binary', 'numeric', 'categorical', 'ordinal'). Datatype of the y (outcome) variable. Normally determined automatically; only provide if an error message for a complex non-standard model requires it. |
median_band_pct |
numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if |
sample_size |
non-negative integer(1). Size of the sample of |
.bins |
Internal. List of ALE bin and n count vectors. If provided, these vectors will be used to set the intervals of the ALE x axis for each variable. By default (NULL), |
silent |
logical(1), default |
An object of class ALE
with properties distinct
and params
.
Stores the optional ALE data, ALE statistics, and bootstrap data for one or more categories.
The parameters used to calculate the ALE data. These include most of the arguments used to construct the ALE
object. These are either the values provided by the user or used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:
* `max_d`: the highest dimension of ALE data present. If only 1D ALE is present, then `max_d == 1`. If even one 2D ALE element is present (even with no 1D), then `max_d == 2`. * `requested_x_cols`,`ordered_x_cols`: `requested_x_cols` is the resolved list of `x_cols` as requested by the user (that is, `x_cols` minus `exclude_cols`). `ordered_x_cols` is the same set of `x_cols` but arranged in the internal storage order. * `y_cats`: categories for categorical classification models. For non-categorical models, this is the same as `y_col`. * `y_type`: high-level datatype of the y outcome variable. * `y_summary`: summary statistics of y values used for the ALE calculation. These statistics are based on the actual values of `y_col` unless if `y_type` is a probability or other value that is constrained in the `[0, 1]` range. In that case, `y_summary` is based on the predicted values of `y_col` by applying `model` to the `data`. `y_summary` is a named numeric vector. Most of the elements are the percentile of the y values. E.g., the '5%' element is the 5th percentile of y values. The following elements have special meanings: * The first element is named either `p` or `q` and its value is always 0. The value is not used; only the name of the element is meaningful. `p` means that the following special `y_summary` elements are based on the provided `ALEpDist` object. `q` means that quantiles were calculated based on `median_band_pct` because `p_values` was not provided. * `min`, `mean`, `max`: the minimum, mean, and maximum y values, respectively. Note that the median is `50%`, the 50th percentile. * `med_lo_2`, `med_lo`, `med_hi`, `med_hi_2`: `med_lo` and `med_hi` are the inner lower and upper confidence intervals of y values with respect to the median (`50%`); `med_lo_2` and `med_hi_2` are the outer confidence intervals. See the documentation for the `p_alpha` and `median_band_pct` arguments to understand how these are determined. * `model`: same as `ALE@params$model` (see documentation there). * `data`: same as `ALE@params$model` (see documentation there).
The calculation of ALE requires modifying several values of the original data
. Thus, ALE()
needs direct access to the predict
function for the model
. By default, ALE()
uses a generic default predict
function of the form predict(object, newdata, type)
with the default prediction type of 'response'
. If, however, the desired prediction values are not generated with that format, the user must specify what they want. Very often, the only modification needed is to change the prediction type to some other value by setting the pred_type
argument (e.g., to 'prob'
to generated classification probabilities). But if the desired predictions need a different function signature, then the user must create a custom prediction function and pass it to pred_fun
. The requirements for this custom function are:
It must take three required arguments and nothing else:
object
: a model
newdata
: a dataframe or compatible table type such as a tibble or data.table
type
: a string; it should usually be specified as type = pred_type
These argument names are according to the R convention for the generic stats::predict()
function.
It must return a vector or matrix of numeric values as the prediction.
You can see an example below of a custom prediction function.
Note: survival
models probably do not need a custom prediction function but y_col
must be set to the name of the binary event column and pred_type
must be set to the desired prediction type.
For details about the ALE-based statistics (ALED, ALER, NALED, and NALER), see vignette('ale-statistics')
.
Parallel processing using the {furrr}
framework is enabled by default. By default, it will use all the available physical CPU cores (minus the core being used for the current R session) with the setting parallel = future::availableCores(logical = FALSE, omit = 1)
. Note that only physical cores are used (not logical cores or "hyperthreading") because machine learning can only take advantage of the floating point processors on physical cores, which are absent from logical cores. Trying to use logical cores will not speed up processing and might actually slow it down with useless data transfer. If you will dedicate the entire computer to running this function (and you don't mind everything else becoming very slow while it runs), you may use all cores by setting parallel = future::availableCores(logical = FALSE)
. To disable parallel processing, set parallel = 0
.
#' The {ale}
package should be able to automatically recognize and load most packages that are needed, but with parallel processing enabled (which is the default), some packages might not be properly loaded. This problem might be indicated if you get a strange error message that mentions something somewhere about "progress interrupted" or "future", especially if you see such errors after the progress bars begin displaying (assuming you did not disable progress bars with silent = TRUE
). In that case, first try disabling parallel processing with parallel = 0
. If that resolves the problem, then to get faster parallel processing to work, try adding the package names needed for the model
to this argument, e.g., model_packages = c('tidymodels', 'mgcv')
.
Progress bars are implemented with the {progressr}
package, which lets the user fully control progress bars. To disable progress bars, set silent = TRUE
. The first time a function is called in the {ale}
package that requires progress bars, it checks if the user has activated the necessary {progressr}
settings. If not, the {ale}
package automatically enables {progressr}
progress bars with the cli
handler and prints a message notifying the user.
If you like the default progress bars and you want to make them permanent, you can add the following lines of code to your .Rprofile
configuration file and they will become your defaults for every R session; you will not see the message again:
progressr::handlers(global = TRUE) progressr::handlers('cli')
This would apply not only to the {ale}
package but to any package that uses {progressr}
progress bars. For more details on formatting progress bars to your liking, see the introduction to the {progressr}
package.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example) set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM model with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) # Simple ALE without bootstrapping ale_gam_diamonds <- ALE(gam_diamonds) # Plot the ALE data plot(ale_gam_diamonds) # Bootstrapped ALE # This can be slow, since bootstrapping runs the algorithm boot_it times # Create ALE with 100 bootstrap samples ale_gam_diamonds_boot <- ALE( gam_diamonds, boot_it = 100 ) # Bootstrapped ALEs print with confidence intervals plot(ale_gam_diamonds_boot) # If the predict function you want is non-standard, you may define a # custom predict function. It must return a single numeric vector. custom_predict <- function(object, newdata, type = pred_type) { predict(object, newdata, type = type, se.fit = TRUE)$fit } ale_gam_diamonds_custom <- ALE( gam_diamonds, pred_fun = custom_predict, pred_type = 'link' ) # Plot the ALE data plot(ale_gam_diamonds_custom) # How to retrieve specific types of ALE data from an ALE object. ale_diamonds_with__boot_data <- ALE( gam_diamonds, # For detailed options for x_cols, see examples at resolve_x_cols() x_cols = ~ carat + cut + clarity + color:depth + x:y, output_boot_data = TRUE, boot_it = 10 # just for demonstration ) # See ?get.ALE for details on the various kinds of data that may be retrieved. get(ale_diamonds_with__boot_data, ~ carat + color:depth) # default ALE data get(ale_diamonds_with__boot_data, what = 'boot_data') get(ale_diamonds_with__boot_data, stats = 'estimate') get(ale_diamonds_with__boot_data, stats = 'aled') get(ale_diamonds_with__boot_data, stats = 'all') get(ale_diamonds_with__boot_data, stats = 'conf_regions') get(ale_diamonds_with__boot_data, stats = 'conf_sig')
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example) set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM model with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) # Simple ALE without bootstrapping ale_gam_diamonds <- ALE(gam_diamonds) # Plot the ALE data plot(ale_gam_diamonds) # Bootstrapped ALE # This can be slow, since bootstrapping runs the algorithm boot_it times # Create ALE with 100 bootstrap samples ale_gam_diamonds_boot <- ALE( gam_diamonds, boot_it = 100 ) # Bootstrapped ALEs print with confidence intervals plot(ale_gam_diamonds_boot) # If the predict function you want is non-standard, you may define a # custom predict function. It must return a single numeric vector. custom_predict <- function(object, newdata, type = pred_type) { predict(object, newdata, type = type, se.fit = TRUE)$fit } ale_gam_diamonds_custom <- ALE( gam_diamonds, pred_fun = custom_predict, pred_type = 'link' ) # Plot the ALE data plot(ale_gam_diamonds_custom) # How to retrieve specific types of ALE data from an ALE object. ale_diamonds_with__boot_data <- ALE( gam_diamonds, # For detailed options for x_cols, see examples at resolve_x_cols() x_cols = ~ carat + cut + clarity + color:depth + x:y, output_boot_data = TRUE, boot_it = 10 # just for demonstration ) # See ?get.ALE for details on the various kinds of data that may be retrieved. get(ale_diamonds_with__boot_data, ~ carat + color:depth) # default ALE data get(ale_diamonds_with__boot_data, what = 'boot_data') get(ale_diamonds_with__boot_data, stats = 'estimate') get(ale_diamonds_with__boot_data, stats = 'aled') get(ale_diamonds_with__boot_data, stats = 'all') get(ale_diamonds_with__boot_data, stats = 'conf_regions') get(ale_diamonds_with__boot_data, stats = 'conf_sig')
Not exported. The following statistics are calculated based on a vector of ALE y values:
ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = "numeric")
ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = "numeric")
y |
numeric. Vector of ALE y values. |
bin_n |
numeric. Vector of counts of rows in each ALE bin. Must be the same length as |
y_vals |
numeric. Entire vector of y values. Needed for normalization. If not provided, ale_y_norm_fun must be provided. |
ale_y_norm_fun |
function. Result of |
x_type |
character(1). Datatype of the x variable on which the ALE y is based. Values are the result of |
ALE deviation (ALED)
ALE range (ALER): range from minimum value of any ALE y
to the maximum value of any y
. This is a very simple indication of the dispersion in ALE y
values.
Normalized ALE deviation (NALED)
Normalized ALE range (NALER)
Note that if any ALE y
values are missing, they will be deleted from the calculation (with their corresponding bin_n).
Named numeric vector:
aled: ALE deviation (ALED)
aler_min: Minimum (lower value) of the ALE range (ALER)
aler_max: Maximum (upper value) of the ALE range (ALER)
naled: Normalized ALE deviation (ALED)
naler_min: Normalized minimum (lower value) of the ALE range (ALER)
naler_max: Normalized maximum (upper value) of the ALE range (ALER)
When calculating second-order (2D) ALE statistics, there is no difficulty if both variables are categorical. The regular formulas for ALE operate normally. However, if one or both variables is numeric, the calculation is complicated by the necessity to determine the ALE midpoints between the ALE bin ceilings of the numeric variables. This function calculates these ALE midpoints for the numeric variables and resets the ALE bins to these values. The ALE values for ordinal ordinal variables are not changed. As part of the adjustment, the lowest numeric bin is merged into the second: the ALE values are completely deleted (since they do not represent a midpoint) and their counts are added to the first true bin.
ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)
ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)
ale_data |
dataframe. ALE data |
x_cols |
character. Names of the x columns in |
x_types |
character same length as |
y_vals |
See documentation for |
ale_y_norm_fun |
See documentation for |
After these possible adjustments, the ALE y values and bin counts are passed to ale_stats()
, which calculates their statistics as an ordinal variable since the numeric variables have thus been discretized.
Not exported.
Same as ale_stats()
.
ALE statistics are accompanied with two indicators of the confidence of their values. First, bootstrapping creates confidence intervals for ALE measures and ALE statistics to give a range of the possible likely values. Second, we calculate p-values, an indicator of the probability that a given ALE statistic is random. Calculating p-values is not trivial for ALE statistics because ALE is non-parametric and model-agnostic. Because ALE is non-parametric (that is, it does not assume any particular distribution of data), the {ale}
package generates p-values by calculating ALE for many random variables; this makes the procedure somewhat slow. For this reason, they are not calculated by default; they must be explicitly requested. Because the {ale}
package is model-agnostic (that is, it works with any kind of R model), the ALE()
constructor cannot always automatically manipulate the model object to create the p-values. It can only do so for models that follow the standard R statistical modelling conventions, which includes almost all built-in R algorithms (like stats::lm()
and stats::glm()
) and many widely used statistics packages (like mgcv
and survival
), but which excludes most machine learning algorithms (like tidymodels
and caret
). For non-standard algorithms, the user needs to do a little work to help the ALE()
constructor correctly manipulate its model object:
The full model call must be passed as a character string in the argument random_model_call_string
, with two slight modifications as follows.
In the formula that specifies the model, you must add a variable named 'random_variable'. This corresponds to the random variables that the constructor will use to estimate p-values.
The dataset on which the model is trained must be named 'rand_data'. This corresponds to the modified datasets that will be used to train the random variables.
See the example below for how this is implemented.
ALEpDist( model, data = NULL, p_speed = "approx fast", ..., parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, random_model_call_string = NULL, random_model_call_string_vars = character(), y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = "response", output_residuals = FALSE, rand_it = 1000, seed = 0, silent = FALSE, .testing_mode = FALSE )
ALEpDist( model, data = NULL, p_speed = "approx fast", ..., parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, random_model_call_string = NULL, random_model_call_string_vars = character(), y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = "response", output_residuals = FALSE, rand_it = 1000, seed = 0, silent = FALSE, .testing_mode = FALSE )
model |
See documentation for |
data |
See documentation for |
p_speed |
character(1). Either 'approx fast' (default) or 'precise slow'. See details. |
... |
not used. Inserted to require explicit naming of subsequent arguments. |
parallel |
See documentation for |
model_packages |
See documentation for |
random_model_call_string |
character(1). If |
random_model_call_string_vars |
See documentation for |
y_col |
See documentation for |
binary_true_value |
See documentation for |
pred_fun , pred_type
|
See documentation for |
output_residuals |
logical(1). If |
rand_it |
non-negative integer(1). Number of times that the model should be retrained with a new random variable. The default of 1000 should give reasonably stable p-values; these are considered "exact" p-values. It can be reduced for "approximate" p-values as low as 100 for faster test runs but then the p-values are not as stable. |
seed |
See documentation for |
silent |
See documentation for |
.testing_mode |
logical(1). Internal use only. Disables some data validation checks to allow for debugging. |
An object of class ALEpDist
with properties rand_stats
, residual_distribution
, rand_it_ok
, and residuals
.
rand_it_ok
: An integer with the number of rand_it
iterations that successfully generated a random variable, that is, those that did not fail for whatever reason. The rand_it
- rand_it_ok
failed attempts are discarded.
rand_stats
: A named list of tibbles. There is normally one element whose name is the same as y_col
except if y_col
is a categorical variable; in that case, the elements are named for each category of y_col
. Each element is a tibble whose rows are each of the rand_it_ok
iterations of the random variable analysis and whose columns are the ALE statistics obtained for each random variable.
residual_distribution
: A univariateML
object with the closest estimated distribution for the residuals
as determined by univariateML::model_select()
. This is the distribution used to generate all the random variables.
residuals
: If output_residuals == TRUE
, returns a matrix of the actual y_col
values from data
minus the predicted values from the model
(without random variables) on the data
. If output_residuals == FALSE
, (default), does not return these residuals. The rows correspond to each row of data
. The columns correspond to the named elements (y_col
or categories) described above for rand_stats
.
The {ale}
package takes a literal frequentist approach to the calculation of p-values. That is, it literally retrains the model 1000 times, each time modifying it by adding a distinct random variable to the model. (The number of iterations is customizable with the rand_it
argument.) The ALEs and ALE statistics are calculated for each random variable. The percentiles of the distribution of these random-variable ALEs are then used to determine p-values for non-random variables. Thus, p-values are interpreted as the frequency of random variable ALE statistics that exceed the value of ALE statistic of the actual variable in question. The specific steps are as follows:
The residuals of the original model trained on the training data are calculated (residuals are the actual y target value minus the predicted values).
The closest distribution of the residuals is detected with univariateML::model_select()
.
1000 new models are trained by generating a random variable each time with univariateML::rml()
and then training a new model with that random variable added.
The ALEs and ALE statistics are calculated for each random variable.
For each ALE statistic, the empirical cumulative distribution function (from stats::ecdf()
) is used to create a function to determine p-values according to the distribution of the random variables' ALE statistics.
What we have just described is the precise approach to calculating p-values with the argument p_speed = 'precise slow'
. Because it is so slow, by default, the ALEpDist
constructor implements an approximate algorithm by default (p_speed = 'approx fast'
) which trains only a few random variables up to the number of physical parallel processing threads available, with a minimum of four. To increase speed, the random variable uses only 10 ALE bins instead of the default 100. Although approximate p-values are much faster than precise ones, they are still somewhat slow: at the very quickest, they take at least the amount of time that it would take to train the original model two or three times. See the "Parallel processing" section below for more details on the speed of computation.
Parallel processing using the {furrr}
framework is enabled by default. By default, it will use all the available physical CPU cores (minus the core being used for the current R session) with the setting parallel = future::availableCores(logical = FALSE, omit = 1)
. Note that only physical cores are used (not logical cores or "hyperthreading") because machine learning can only take advantage of the floating point processors on physical cores, which are absent from logical cores. Trying to use logical cores will not speed up processing and might actually slow it down with useless data transfer.
For exact p-values, by default 1000 random variables are trained. So, even with parallel processing, the procedure is very slow. However, an ALEpDist
object trained with a specific model on a specific dataset can be reused as often as needed for the identical model-dataset pair.
For approximate p-values (the default), at least four random variables are trained to give some minimal variation. With parallel processing, more random variables can be trained to increase the accuracy of the p_value estimates up to the maximum number of physical cores.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example) set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) # Create p_value distribution pd_diamonds <- ALEpDist( gam_diamonds, diamonds_sample, # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100, ) # Examine the structure of the returned object str(pd_diamonds) # In RStudio: View(pd_diamonds) # Calculate ALEs with p-values ale_gam_diamonds <- ALE( gam_diamonds, p_values = pd_diamonds ) # Plot the ALE data. The horizontal bands in the plots use the p-values. plot(ale_gam_diamonds) # For non-standard models that give errors with the default settings, # you can use 'random_model_call_string' to specify a model for the estimation # of p-values from random variables as in this example. # See details above for an explanation. pd_diamonds <- ALEpDist( gam_diamonds, diamonds_sample, random_model_call_string = 'mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity + random_variable, data = rand_data )', # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100 )
# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example) set.seed(0) diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ] # Create a GAM with flexible curves to predict diamond price # Smooth all numeric variables and include all other variables gam_diamonds <- mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity, data = diamonds_sample ) summary(gam_diamonds) # Create p_value distribution pd_diamonds <- ALEpDist( gam_diamonds, diamonds_sample, # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100, ) # Examine the structure of the returned object str(pd_diamonds) # In RStudio: View(pd_diamonds) # Calculate ALEs with p-values ale_gam_diamonds <- ALE( gam_diamonds, p_values = pd_diamonds ) # Plot the ALE data. The horizontal bands in the plots use the p-values. plot(ale_gam_diamonds) # For non-standard models that give errors with the default settings, # you can use 'random_model_call_string' to specify a model for the estimation # of p-values from random variables as in this example. # See details above for an explanation. pd_diamonds <- ALEpDist( gam_diamonds, diamonds_sample, random_model_call_string = 'mgcv::gam( price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) + cut + color + clarity + random_variable, data = rand_data )', # only 100 iterations for a quick demo; but usually should remain at 1000 rand_it = 100 )
An ALEPlots
S7 object contains the ALE plots from ALE
or ModelBoot
objects stored as ggplot
objects. The ALEPlots
creates all possible plots from the ALE
or ModelBoot
passed to its constructor—not only individual 1D and 2D ALE plots, but also special plots like the ALE effects plot. So, an ALEPlots
object is a collection of plots, almost never a single plot. To retrieve specific plots, use the get.ALEPlots()
method. In addition to specific ALE See examples there or examples with the ALE and ModelBoot objects.
ALEPlots( obj, ..., relative_y = "median", p_alpha = c(0.01, 0.05), median_band_pct = c(0.05, 0.5), rug_sample_size = obj@params$sample_size, min_rug_per_interval = 1, n_x1_bins = NULL, n_x2_bins = NULL, n_y_quant = 10, seed = 0, silent = FALSE )
ALEPlots( obj, ..., relative_y = "median", p_alpha = c(0.01, 0.05), median_band_pct = c(0.05, 0.5), rug_sample_size = obj@params$sample_size, min_rug_per_interval = 1, n_x1_bins = NULL, n_x2_bins = NULL, n_y_quant = 10, seed = 0, silent = FALSE )
obj |
|
... |
not used. Inserted to require explicit naming of subsequent arguments. |
relative_y |
character(1) in c('median', 'mean', 'zero'). The ALE y values in the plots will be adjusted relative to this value. 'median' is the default. 'zero' will maintain the actual ALE values, which are relative to zero. |
p_alpha |
numeric(2) from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if |
median_band_pct |
numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if |
rug_sample_size , min_rug_per_interval
|
non-negative integer(1). Rug plots are down-sampled to |
n_x1_bins , n_x2_bins
|
positive integer(1). Number of bins for the x1 or x2 axes respectively for 2D interaction plot. These values are ignored if x1 or x2 are not numeric (i.e, if they are logical or factors). |
n_y_quant |
positive integer(1). Number of intervals over which the range of y values is divided for the colour bands of the interaction plot. See details. |
seed |
See documentation for |
silent |
See documentation for |
An object of class ALEPlots
with properties distinct
and params
.
Stores the ALE plots. Use get.ALEPlots()
to access them.
The parameters used to calculate the ALE plots. These include most of the arguments used to construct the ALEPlots
object. These are either the values provided by the user or used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:
* `max_d`: See documentation for [ALE()] * `requested_x_cols`: See documentation for [ALE()]. Note, however, that `ALEPlots` does not store `ordered_x_cols`. * `y_col`: See documentation for [ALE()]
#' For the 2D interaction plots, n_y_quant
is the number of quantiles into which to divide the predicted variable (y). The middle quantiles are grouped specially:
The middle quantile is the first confidence interval of median_band_pct
(median_band_pct[1]
) around the median.
This middle quantile is special because it generally represents no meaningful interaction.
The quantiles above and below the middle are extended from the borders of the middle quantile to the regular borders of the other quantiles.
There will always be an odd number of quantiles: the special middle quantile plus an equal number of quantiles on each side of it. If n_y_quant
is even, then a middle quantile will be added to it. If n_y_quant
is odd, then the number specified will be used, including the middle quantile.
# See examples with ALE and ModelBoot objects
# See examples with ALE and ModelBoot objects
This function is not exported. It is a complete reimplementation of the ALE algorithm relative to the reference in ALEPlot::ALEPlot()
. In addition to adding bootstrapping and handling of categorical y variables, it reimplements categorical x interactions.
calc_ale( data, model, x_cols, y_col, y_cats, pred_fun, pred_type, max_num_bins, boot_it, seed, boot_alpha, boot_centre, boot_ale_y = FALSE, .bins = NULL, ale_y_norm_funs = NULL, p_dist = NULL )
calc_ale( data, model, x_cols, y_col, y_cats, pred_fun, pred_type, max_num_bins, boot_it, seed, boot_alpha, boot_centre, boot_ale_y = FALSE, .bins = NULL, ale_y_norm_funs = NULL, p_dist = NULL )
data |
See documentation for |
model |
See documentation for |
x_cols |
character(1 or 2). Names of columns in X for which ALE data is to be calculated. Length 1 for 1D ALE and length 2 for 2D ALE. |
y_col |
character(1). Name of the target y column. |
y_cats |
character. The categories of y. For most cases with non-categorical y, |
pred_fun |
See documentation for |
pred_type |
See documentation for |
max_num_bins |
See documentation for |
boot_it |
See documentation for |
seed |
See documentation for |
boot_alpha |
See documentation for |
boot_centre |
See documentation for |
boot_ale_y |
logical(1). If |
.bins |
See documentation for |
ale_y_norm_funs |
list of functions. Custom functions for normalizing ALE y for statistics. It is usually a list(1), but for categorical y, there is a distinct function for each y category. If provided, ale_y_norm_funs saves some time since it is usually the same for all all variables throughout one call to |
p_dist |
See documentation for |
For details about arguments not documented here, see ALE()
.
Apley, Daniel W., and Jingyu Zhu. "Visualizing the effects of predictor variables in black box supervised learning models." Journal of the Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. doi:10.48550/arXiv.2310.09877.
Currently assumes that the result object will have only one class.
cast(x, new_cls)
cast(x, new_cls)
x |
An R object |
new_cls |
character(1). A single class to which to convert |
x
converted to class new_cls
.
Census data that indicates, among other details, if the respondent's income exceeds $50,000 per year. Also known as "Adult" dataset.
census
census
A tibble with 32,561 rows and 15 columns:
TRUE if income > $50,000
continuous
Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
continuous. "A proxy for the demographic background of the people: 'People with similar demographic characteristics should have similar weights'" For more details, see https://www.openml.org/search?type=data&id=1590.
Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
continuous
Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
Female, Male
continuous
continuous
continuous
United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad&Tobago, Peru, Hong, Holland-Netherlands
This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.
Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.
Adaptation of base::colSums()
that, when all values in a column are NA, sets the sum to NA rather than zero as base::colSums()
does. Calls base::colSums()
internally.
col_sums(mx, na.rm = FALSE, dims = 1)
col_sums(mx, na.rm = FALSE, dims = 1)
mx |
numeric matrix |
na.rm |
logical(1). TRUE if missing values ( |
dims |
See documentation for |
numeric vector whose length is number of columns of mx
, whose values are the sums of each column of mx
.
Extracts all diagonals from a matrix in the NWSE direction (upper left down to lower right).
extract_2D_diags(mx)
extract_2D_diags(mx)
mx |
matrix |
A list whose elements each represent one diagonal of mx
. Each diagonal element is a list of two elements: coords
is a numeric vector pair of row-column coordinates; values
is the value of the diagonal at the coordinate give by coords
.
Extracts all diagonals from a 3D array in the FNWBSE direction (front upper left down to back lower right).
extract_3D_diags(ray)
extract_3D_diags(ray)
ray |
a 3-dimensional array |
A list whose elements each represent one diagonal of ray
. Each diagonal element is a list of two elements: origin
is the 3D coordinates (row, column, depth) of the first element of the diagonal; values
is a vector of the diagonal that starts from origin
.
Retrieve specific data elements from an object based on their X column names.
If obj
is not an object from the ale
package, then this generic passes on all arguments to the base::get()
function.
get(obj, ...)
get(obj, ...)
obj |
object. |
... |
For ale package objects, instructions for which predictor (x) columns should be retrieved. For everything else, arguments to pass to |
Retrieve specific elements from an ALE
object.
obj |
ALE object from which to retrieve elements. |
x_cols , exclude_cols
|
character, list, or formula. Columns names and interaction terms from |
what |
character(1). What kind of output is requested. Must be one (and only one) of |
... |
not used. Inserted to require explicit naming of subsequent arguments. |
stats |
character(1). Retrieve statistics. If |
cats |
character. Optional category names to retrieve if the ALE is for a categorical y outcome model. |
simplify |
logical(1). If |
silent |
See documentation for |
Regardless of the requested data, all get.ALE()
have a common structure:
If more than one category of the y outcome is returned, then the top level is a list named by each category. However, the y outcome is not categorical or only one category of multiple possibilities is specified using the cats
argument, then the top level never has categories, regardless of the value of simplify
.
The next level (or top level if there are zero or one category) is a list with one or two levels:
d1
: 1D ALE elements.
d2
: 2D ALE elements.
However, if elements of only one dimension (either 1D or 2D) are requested and simplify = TRUE
(default), the empty list is eliminated and the level is skipped to provide only the elements present. For example, if only 1D ALE data is requested, then there will be no d1
sublist but only a list of the ALE data as described for the next level. If simplify = FALSE
, both d1
and d2
sublists will always be returned; the empty sublist will be NULL
.
For the d1
sublist, there is no further hierarchy: the returned data is as described below. For the d2
sublist representing var1
by var2
interactions, the next level is a named list of all var1
elements. This level in turn consists of a named list of var2
elements with the actual data as described below. For example, for ALE data, result <- get(obj, list(d1 = "var1", d2 = list(c("var1", "var2"))))
would return result
as a list where the 1D ALE data is in result$d1$var1
and the 2D ALE data in result$d2$var1$var2
.
While all results follow the general structure just described, the specific type of data returned depends on the values of the what
and stats
arguments:
what = 'ale'
(default) and stats = NULL
(default)A list whose elements, named by each requested x variable, are each a tibble. The rows each represent one ALE bin. The tibble has the following columns:
* var.bin
or var.ceil
where var
is the name of a variable (column): For non-numeric x, var.bin
is the value of each of the ALE categories. For numeric x, var.ceil
is the value of the upper bound (ceiling) of each ALE bin. The first "bin" of numeric variables represents the minimum value. For 2D ALE with an var1
by var2
interaction, both var1.bin
and var2.bin
columns are returned (or var1.ceil
or var2.ceil
for numeric var1
or var2
).
* .n
: the number of rows of data in each bin represented by var.bin
or var.ceil
. For numeric x, the first bin contains all data elements that have exactly the minimum value of x. This is often 1, but might be more than 1 if more than one data element has exactly the minimum value.
* .y
: the ALE function value calculated for that bin. For bootstrapped ALE, this is the same as .y_mean
by default or .y_median
if boot_centre = 'median'
. Regardless, both .y_mean
and .y_median
are returned as columns here.
* .y_lo
, .y_hi
: the lower and upper confidence intervals, respectively, for the bootstrapped .y
value based on the boot_alpha
argument in the ALE()
constructor.
what = 'boot_data'
and stats = NULL
(default)A list whose elements, named by each requested x variable, are each a tibble. These are the data from which .y_mean
, .y_median
, .y_lo
, and .y_hi
are summarized when what = 'ale'
. The rows each represent one ALE bin for a specified bootstrap iteration. The tibble has the following columns:
* .it
: The bootstrap iteration. Iteration 0 represents the ALE calculations on the full dataset; the remaining values of .it
are from 1 to boot_it
(number of bootstrap iterations specified in the ALE()
constructor.
* var
where var
is the name of a variable (column): For non-numeric x, var
is the value of each of the ALE categories. For numeric x, var
is the value of the upper bound (ceiling) of each ALE bin. They are otherwise similar to their meanings described for what = 'ale'
above.
* .n
and .y
: Same as for what = 'ale'
.
what = 'ale'
(default) and stats = 'estimate'
A list with elements d1
and d2
with the value of each ALE statistic. Each row represents one variable or interaction. The tibble has the following columns:
* term
or term1
and term2
: The variable or column for the 1D (term
) or 2D (term1
by term2
) ALE statistic.
* aled
, aler_min
, aler_max
, naled
, naler_min
, naler_max
: the respective ALE statistic for the variable or interaction.
what = 'ale'
(default) and stats
is one value in c('aled', 'aler_min', 'aler_max', 'naled', 'naler_min', 'naler_max')
A list with elements d1
and d2
with the distribution value of the single requested ALE statistic. Each element d1
and d2
is a tibble. Each row represents one variable or interaction. The tibble has the following columns:
* term
or term1
and term2
: Same as for stats = 'estimate'
.
* estimate
, mean
, median
: The average of the bootstrapped value of the requested statistic. estimate
is equal to either mean
or median
depending on the boot_centre
argument in the ALE()
constructor. If ALE is not bootstrapped, then estimate
, mean
, and median
are equal.
* conf.low
, conf.high
: the lower and upper confidence intervals, respectively, for the bootstrapped statistic based on the boot_alpha
argument in the ALE()
constructor. If ALE is not bootstrapped, then estimate
, conf.low
, and conf.high
are equal.
what = 'ale'
(default) and stats = 'all'
A list with elements d1
and d2
with the distribution values of all available ALE statistics for the requested variables and interactions. Whereas the stats = 'aled'
(for example) format returns data for a single statistic, stats = 'all'
returns all statistics for the requested variables. Each element is a list with the requested d1
and d2
sub-elements as described in the general structure above. Each data element is a tibble. Each row represents one ALE statistic. The tibble has the following columns:
* term
or term1
and term2
: Same as for stats = 'estimate'
.
* estimate
, mean
, median
: Same as above for individual ALE statistics.
* conf.low
, conf.high
: Same as above for individual ALE statistics.
what = 'ale'
(default) and stats = 'conf_regions'
A list with elements d1
and d2
with the confidence regions for the requested variables and interactions. Each element is a list with the requested d1
and d2
sub-elements as described in the general structure above. Each data element is a tibble with confidence regions for a single variable or interaction. For an explanation of the columns, see vignette('ale-statistics')
.
what = 'ale'
(default) and stats = 'conf_sig'
A list with elements d1
and d2
filtered for the statistically significant confidence regions for the requested variables and interactions. Each element is a tibble. Each row is a single statistically significant confidence region. If there are no statistically significant confidence regions at all, an empty tibble is returned. For an explanation of the columns, see vignette('ale-statistics')
.
# See examples at ALE() for a demonstration us the get() method.
# See examples at ALE() for a demonstration us the get() method.
Retrieve specific plots from a ALEPlots
object.
See get.ALE()
for explanation of parameters not described here.
obj |
ALEPlots object from which to retrieve ALE elements. |
type |
character(1). What type of ALEPlots to retrieve: |
See get.ALE()
Retrieve specific ALE elements from a ModelBoot
object. This method is similar to get.ALE()
except that the user may specify what type
of ALE data to retrieve (see the argument definition for details).
See get.ALE()
for explanation of parameters not described here.
obj |
ModelBoot object from which to retrieve ALE elements. |
type |
character(1). The type of ModelBoot ALE elements to retrieve: |
See get.ALE()
Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories.
idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)
idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)
X |
X data |
x_col |
character |
n_bins |
integer |
x_int_counts |
bin sizes |
This intrapolation algorithm replaces internal missing values in a vector with the linear interpolation of the bounding non-missing values. If there are no-bounding non-missing values, the unbounded missing values are retained as missing. In our terminology, 'intrapolation' is distinct from 'interpolation' because interpolation might include 'extrapolation', that is, projecting estimates of values beyond the bounds. This function, in contrast, only replaces bounded missing values.
intrapolate_1D(v)
intrapolate_1D(v)
v |
numeric vector. A numeric vector. |
For example, the vector c(NA, NA, 1, NA, 5, NA, NA, 1, NA)
will be intrapolated to c(NA, NA, 1, 3, 5, 3.7, 2.3, 1, NA)
.
Note: because intrapolation requires at least three elements (left bound, missing value, right bound), an input vector of less than three will be returned unchanged.
numeric vector of the same length as the input v
with internal missing values linearly intrapolated.
This intrapolation algorithm replaces internal missing values in a matrix. It does so in the following steps:
Calculate separate intrapolations in four directions: rows, columns, NWSE diagonals (upper left down to lower right), and SWNE diagonals (lower left up to upper right). The intrapolations in each direction is based on the algorithm of intrapolate_1D()
. (See details there.)
The 2D intrapolation is the mean intrapolation from any of the four values. In taking the mean, missing intrapolations are removed.
When there is no intrapolation available from any of the four directions, the missing value remains missing.
intrapolate_2D(mx, consolidate = TRUE)
intrapolate_2D(mx, consolidate = TRUE)
mx |
numeric matrix. A numeric matrix. |
consolidate |
logical(1). See return value. |
If consolidate = TRUE
(default), returns a numeric matrix of the same dimensions as the input mx
with internal missing values linearly intrapolated. If consolidate = FALSE
, returns a list of intrapolations for missing values from each of the four directions (rows, columns, NWSE diagonal, and SWNE diagonal).
This intrapolation algorithm replaces internal missing values in a three-dimensional array. For how it works, see the details of intrapolate_2D()
. Based on that, intrapolate_3D()
does the following:
Slice the 3D array into 2D matrices along the rows, columns, and depth dimensions.
Use intrapolate_2D()
to calculate 2D intrapolations based on the algorithm of intrapolate_1D()
. See details in their documentation.
In addition, calculate intrapolations along the four directions of 3D diagonals: front northwest to back southeast, that is, front upper left down to back lower right (FNWBSE), FSWBNE, FSEBNW, and FNEBSW.
The 3D intrapolation is the mean intrapolation from any of these 2D or 3D values. In taking the mean, missing intrapolations are removed.
When there is no intrapolation available from any of the directions, the missing value remains missing.
intrapolate_3D(ray, consolidate = TRUE)
intrapolate_3D(ray, consolidate = TRUE)
ray |
numeric array of three dimensions. |
consolidate |
logical(1). See return value. |
If consolidate = TRUE
(default), returns a numeric array of the same dimensions as the input ray
with internal missing values linearly intrapolated. If consolidate = FALSE
, returns a list of intrapolations for missing values from each slice and diagonal direction.
A ModelBoot
object contains full-model bootstrapped statistics and ALE data for a trained model. Full-model bootstrapping (as distinct from data-only bootstrapping) retrains a model for each bootstrap iterations. Thus, it is very slow, though more reliable. However, for obtaining bootstrapped ALE data, plots, and statistics, full-model bootstrapping as provided by ModelBoot
is only necessary for models that have not been developed by cross-validation. For cross-validated models, it is sufficient (and much faster) to create a regular ALE
object with bootstrapping by setting the boot_it
argument there. In fact, full-model bootstrapping with ModelBoot
is often infeasible for slow machine-learning models trained on large datasets, which should rather be cross-validated to assure their reliability. However, for models that have not been cross-validated, full-model bootstrapping with ModelBoot
is necessary for reliable results. Further details follow below; see also vignette('ale-statistics')
.
ModelBoot( model, data = NULL, ..., model_call_string = NULL, model_call_string_vars = character(), parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = "response", boot_it = 100, boot_alpha = 0.05, boot_centre = "mean", seed = 0, output_model_stats = TRUE, output_model_coefs = TRUE, output_ale = TRUE, output_boot_data = FALSE, ale_options = list(), tidy_options = list(), glance_options = list(), silent = FALSE )
ModelBoot( model, data = NULL, ..., model_call_string = NULL, model_call_string_vars = character(), parallel = future::availableCores(logical = FALSE, omit = 1), model_packages = NULL, y_col = NULL, binary_true_value = TRUE, pred_fun = function(object, newdata, type = pred_type) { stats::predict(object = object, newdata = newdata, type = type) }, pred_type = "response", boot_it = 100, boot_alpha = 0.05, boot_centre = "mean", seed = 0, output_model_stats = TRUE, output_model_coefs = TRUE, output_ale = TRUE, output_boot_data = FALSE, ale_options = list(), tidy_options = list(), glance_options = list(), silent = FALSE )
model |
Required. See documentation for |
data |
dataframe. Dataset that will be bootstrapped. This must be the same data on which the |
... |
not used. Inserted to require explicit naming of subsequent arguments. |
model_call_string |
character(1). If NULL (default), the |
model_call_string_vars |
character. Names of variables included in |
parallel |
See documentation for |
model_packages |
See documentation for |
y_col , pred_fun , pred_type
|
See documentation for |
binary_true_value |
any single atomic value. If the model represented by |
boot_it |
non-negative integer(1). Number of bootstrap iterations for full-model bootstrapping. For bootstrapping of ALE values, see details to verify if |
boot_alpha |
numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrap intervals; the bootstrap confidence intervals will be the lowest and highest |
boot_centre |
character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be |
seed |
integer. Random seed. Supply this between runs to assure identical bootstrap samples are generated each time on the same data. |
output_model_stats |
logical(1). If |
output_model_coefs |
logical(1). If |
output_ale |
logical(1). If |
output_boot_data |
logical(1). If |
ale_options , tidy_options , glance_options
|
list of named arguments. Arguments to pass to the |
silent |
See documentation for |
An object of class ALE
with properties model_stats
, model_coefs
, ale
, model_stats
, boot_data
, and params
.
tibble
of bootstrapped results from broom::glance()
. NULL
if model_stats
argument is FALSE
. In general, only broom::glance()
results that make sense when bootstrapped are included, such as df
and adj.r.squared
. Results that are incomparable across bootstrapped datasets (such as aic
) are excluded. In addition, certain model performance measures are included; these are bootstrap-validated with the .632 correction (NOT the .632+ correction):
For regression (numeric prediction) models:
mae
: mean absolute error (MAE)
sa_mae_mad
: standardized accuracy of the MAE referenced on the mean absolute deviation
rmse
: root mean squared error (RMSE)
sa_rmse_sd
: standardized accuracy of the RMSE referenced on the standard deviation
For classification (probability) models:
auc
: area under the ROC curve
A tibble
of bootstrapped results from broom::tidy()
.
NULL
if model_coefs
argument is FALSE
.
A list of bootstrapped ALE results using default ALE()
settings unless if overridden with ale_options
. NULL
if ale
argument is FALSE
. Elements are:
* `single`: an `ALE` object of ALE calculations on the full dataset without bootstrapping. * `boot`: a list of bootstrapped ALE data and statistics. This element is not an `ALE` object; it is in a special internal format.
A tibble
of bootstrap results. Each row represents a bootstrap iteration. NULL
if boot_data
argument is FALSE
. The columns are:
* `it`: the specific bootstrap iteration from 0 to `boot_it` iterations. Iteration 0 is the results from the full dataset (not bootstrapped). * `row_idxs`: the row indexes for the bootstrapped sample for that iteration. To save space, the row indexes are returned rather than the full datasets. So, for example, iteration i's bootstrap sample can be reproduced by `data[ModelBoot_obj@boot_data$row_idxs[[2]], ]` where `data` is the dataset and `ModelBoot_obj` is the result of `ModelBoot()`. * `model`: the model object trained on that iteration. * `tidy`: the results of `broom::tidy(model)` on that iteration. * `stats`: the results of `broom::glance(model)` on that iteration. * `perf`: performance measures on the entire dataset. These are the measures specified above for regression and classification models.
Parameters used to calculate bootstrapped data. Most of these repeat the arguments passed to ModelBoot()
. These are either the values provided by the user or used by default if the user did not change them but the following additional objects created internally are also provided:
* `y_cats`: same as `ALE@params$y_cats` (see documentation there). * `y_type`: same as `ALE@params$y_type` (see documentation there). * `model`: same as `ALE@params$model` (see documentation there). * `data`: same as `ALE@params$data` (see documentation there).
No modelling results, with or without ALE, should be considered reliable without appropriate validation. For ALE, both the trained model itself and the ALE that explains the trained model must be validated. ALE must be validated by bootstrapping. The trained model might be validated either by cross-validation or by bootstrapping. For ALE that explains trained models that have been developed by cross-validation, it is sufficient to bootstrap just the training data. That is what the ALE
object does with its boot_it
argument. However, unvalidated models must be validated by bootstrapping them along with the calculation of ALE; this is what the ModelBoot
object does with its boot_it
argument.
ModelBoot()
carries out full-model bootstrapping to validate models. Specifically, it:
Creates multiple bootstrap samples (default 100; the user can specify any number);
Creates a model on each bootstrap sample;
Calculates overall model statistics, variable coefficients, and ALE values for each model on each bootstrap sample;
Calculates the mean, median, and lower and upper confidence intervals for each of those values across all bootstrap samples.
The broom::tidy()
summary statistics will provide p-values. However, the procedure for obtaining p-values for ALE statistics is very slow: it involves retraining the model 1000 times. Thus, it is not efficient to calculate p-values whenever a ModelBoot
object is created. Although the ALE()
function provides an 'auto' option for creating p-values, that option is disabled when creating a ModelBoot
because it would be far too slow: it would involve retraining the model 1000 times the number of bootstrap iterations. Rather, you must first create a p-values distribution object using the procedure described in help(-ALEpDist)
. If the name of your p-values object is p_dist
, you can then request p-values each time you create a ModelBoot
by passing it the argument ale_options = list(p_values = p_dist)
.
Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.
# attitude dataset attitude ## ALE for general additive models (GAM) ## GAM is tweaked to work on the small dataset. gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude) # Full model bootstrapping # Only 4 bootstrap iterations for a rapid example; default is 100 # Increase value of boot_it for more realistic results mb_gam <- ModelBoot( gam_attitude, boot_it = 4 ) # If the model is not standard, supply model_call_string with 'data = boot_data' # in the string instead of the actual dataset name (in addition to the actual dataset # as the 'data' argument directly to the `ModelBoot` constructor). mb_gam <- ModelBoot( gam_attitude, data = attitude, # the actual dataset model_call_string = 'mgcv::gam( rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = boot_data # required for model_call_string )', boot_it = 4 ) # Model statistics and coefficients mb_gam@model_stats mb_gam@model_coefs # Plot ALE plot(mb_gam)
# attitude dataset attitude ## ALE for general additive models (GAM) ## GAM is tweaked to work on the small dataset. gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = attitude) summary(gam_attitude) # Full model bootstrapping # Only 4 bootstrap iterations for a rapid example; default is 100 # Increase value of boot_it for more realistic results mb_gam <- ModelBoot( gam_attitude, boot_it = 4 ) # If the model is not standard, supply model_call_string with 'data = boot_data' # in the string instead of the actual dataset name (in addition to the actual dataset # as the 'data' argument directly to the `ModelBoot` constructor). mb_gam <- ModelBoot( gam_attitude, data = attitude, # the actual dataset model_call_string = 'mgcv::gam( rating ~ complaints + privileges + s(learning) + raises + s(critical) + advance, data = boot_data # required for model_call_string )', boot_it = 4 ) # Model statistics and coefficients mb_gam@model_stats mb_gam@model_coefs # Plot ALE plot(mb_gam)
Validation: ensure that the object is atomic (not just a vector)
For factors, get all classes and check if any class_x is a factor or ordered
Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns)
Improvements:
Validation: ensure that the object is atomic (not just a vector)
For factors, get all classes and check if any class_x is a factor or ordered
Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns)
params_data( data, y_vals, data_name = var_name(data), sample_size = 500, seed = 0 )
params_data( data, y_vals, data_name = var_name(data), sample_size = 500, seed = 0 )
data |
input dataframe |
y_vals |
y values, y predictions, or a sample thereof |
data_name |
name of the data argument |
sample_size |
size of data to sample |
seed |
random seed |
a list
ALE
objectsThis plot method simply calls the constructor for an ALEPlots
object.
x |
ALE object. |
... |
Arguments passed to |
Plot an ALEPlots
object.
x |
An object of class |
max_print |
integer(1). The maximum number of plots that may be printed at a time. 1D plots and 2D are printed separately, so this maximum applies separately to each dimension of ALE plots, not to all dimensions combined. |
... |
Arguments to pass to |
Invisibly returns x
.
ModelBoot
objectsThis plot method simply calls the constructor for an ALEPlots
object.
x |
ModelBoot object. |
... |
Arguments passed to |
Computes data needed to calculate a variable's ALE values.
prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)
prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)
x_col |
character(1). Name of single column in X for which ALE data is to be calculated. |
x_type |
character(1). var_type() of x_col. |
x_vals |
vector. The values of x_col. |
bins , n
|
See documentation for |
max_num_bins |
See documentation for |
X |
See documentation for |
Print an ALE object.
x |
An object of class |
... |
Additional arguments (currently not used). |
Invisibly returns x
.
lm_cars <- stats::lm(mpg ~ ., mtcars) ale_cars <- ALE(lm_cars, parallel = 0) print(ale_cars)
lm_cars <- stats::lm(mpg ~ ., mtcars) ale_cars <- ALE(lm_cars, parallel = 0) print(ale_cars)
Print an ALEPlots object by calling plot().
x |
An object of class |
max_print |
See documentation for |
... |
Additional arguments (currently not used). |
Invisibly returns x
.
Print a ModelBoot object.
x |
An object of class |
... |
Additional arguments (currently not used). |
Invisibly returns x
.
lm_cars <- stats::lm(mpg ~ ., mtcars) mb <- ModelBoot(lm_cars, boot_it = 2, parallel = 0) print(mb)
lm_cars <- stats::lm(mpg ~ ., mtcars) mb <- ModelBoot(lm_cars, boot_it = 2, parallel = 0) print(mb)
Resolve x_cols
and exclude_cols
to their standardized format of x_cols
to specify which 1D and 2D ALE elements are required. This specification is used throughout the ALE package. x_cols
specifies the desired columns or interactions whereas exclude_cols
optionally specifies any columns or interactions to remove from x_cols
. The result is x_cols
– exclude_cols
, giving considerable flexibility in specifying the precise columns desired.
resolve_x_cols(x_cols, col_names, y_col, exclude_cols = NULL, silent = FALSE)
resolve_x_cols(x_cols, col_names, y_col, exclude_cols = NULL, silent = FALSE)
x_cols |
character, list, or formula. Columns and interactions requested in one of the special |
col_names |
character. All the column names from a dataset. |
y_col |
character(1). The y outcome column. If found in any |
exclude_cols |
Same possible formats as |
silent |
logical(1). If |
x_cols
in canonical format, which is always a list with two elements:
d1
: a character vector with each requested column for 1D ALE. If empty (no 1D ALE at all), its value is list()
.
d2
: a list of length-2 character vectors with each requested 2D ALE interaction pair. If empty (no 2D ALE at all), its value is list()
.
See examples for details.
The x_cols
argument determines which predictor variables and interactions
are included in the analysis. It supports multiple input formats:
Character vector: A simple vector of column names (e.g., c("a", "b")
)
selects individual columns for 1D ALE analysis.
Formula (~
): Allows specifying variables and interactions in
formula notation (e.g., ~ a + b + a:b
), which is automatically converted
into a structured format. The outcome term is optional and will be ignored regardless.
So, ~ a + b + a:b
produces results identical to whatever ~ a + b + a:b
.
List format: Users can explicitly separate 1D and 2D ALE interactions:
A character vector named d1
like list(d1 = c("a", "b"))
specifies 1D ALE variables.
A list of 2-character vectors named d2
like list(d2 = list(c("a", "b"), c("a", "c")))
specifies 2D interactions.
list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c")))
combines both.
Boolean selection for an entire dimension:
list(d1 = TRUE)
selects all available variables for 1D ALE, excluding y_col
.
list(d2 = TRUE)
selects all possible 2D interactions among all columns in col_names
, excluding y_col
.
Selective 2D expansion: list(d2_all = "a")
includes all 2D interactions
where "a"
is one of the interacting variables.
NULL (or unspecified): If x_cols = NULL
, no variables are selected.
Regardless of format, y_col
is always removed automatically. The function
ensures all variables are valid and in col_names
, providing informative
messages unless silent = TRUE
. And regardless of the specification format,
the result will always be standardized in the format specified in the return value.
Run examples for details.
## Sample data set.seed(0) df <- data.frame( y = runif(10), a = sample(letters[1:3], 10, replace = TRUE), b = rnorm(10), c = sample(1:5, 10, replace = TRUE) ) col_names <- names(df) y_col <- "y" # Assume 'y' is the outcome variable ## Examples with just x_cols to show different formats for specifying x_cols ## (same format for exclude_cols) # Character vector: Simple ALE with no interactions resolve_x_cols(c("a", "b"), col_names, y_col) # Character string: Select just one 1D element resolve_x_cols("c", col_names, y_col) # Character string: Select just one 1D element resolve_x_cols("c", col_names, y_col) # list of 1- and 2-length character vectors: specify precise 1D and 2D elements desired resolve_x_cols(list(c("a", "b"), "c", c("c", "a"), "b"), col_names, y_col) # Formula: Converts to a list of individual elements resolve_x_cols(~ a + b, col_names, y_col) # Formula with interactions (1D and 2D). # This format is probably more convenient if you know precisely which terms you want. # Note that the outcome on the left-hand-side is always silently ignored. resolve_x_cols(whatever ~ a + b + a:b + c:b, col_names, y_col) # List specifying d1 (1D ALE) resolve_x_cols(list(d1 = c("a", "b")), col_names, y_col) # List specifying d2 (2D ALE) resolve_x_cols(list(d2 = list(c("a", "b"))), col_names, y_col) # List specifying both d1 and d2 resolve_x_cols(list(d1 = c("a", "b"), d2 = list(c("a", "b"))), col_names, y_col) # d1 as TRUE (select all columns except y_col) resolve_x_cols(list(d1 = TRUE), col_names, y_col) # d2 as TRUE (select all possible 2D interactions) resolve_x_cols(list(d2 = TRUE), col_names, y_col) # d2_all: Request all 2D interactions involving a specific variable resolve_x_cols(list(d2_all = "a"), col_names, y_col) # NULL: No variables selected resolve_x_cols(NULL, col_names, y_col) ## Examples of how exclude_cols are removed from x_cols to obtain various desired results # Exclude one column from a simple character vector resolve_x_cols( x_cols = c("a", "b", "c"), col_names = col_names, y_col = y_col, exclude_cols = "b" ) # Exclude multiple columns resolve_x_cols( x_cols = c("a", "b", "c"), col_names = col_names, y_col = y_col, exclude_cols = c("a", "c") ) # Exclude an interaction term from a formula input resolve_x_cols( x_cols = ~ a + b + a:b, col_names = col_names, y_col = y_col, exclude_cols = ~ a:b ) # Exclude all columns from x_cols resolve_x_cols( x_cols = c("a", "b", "c"), col_names = col_names, y_col = y_col, exclude_cols = c("a", "b", "c") ) # Exclude non-existent columns (should be ignored) resolve_x_cols( x_cols = c("a", "b"), col_names = col_names, y_col = y_col, exclude_cols = "z" ) # Exclude one column from a list-based input resolve_x_cols( x_cols = list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d1 = "a") ) # Exclude interactions only resolve_x_cols( x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d2 = list(c("a", "b"))) ) # Exclude everything, including interactions resolve_x_cols( x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))) ) # Exclude a column implicitly removed by y_col resolve_x_cols( x_cols = c("y", "a", "b"), col_names = col_names, y_col = "y", exclude_cols = "y" ) # Exclude entire 2D dimension from x_cols with d2 = TRUE resolve_x_cols( x_cols = list(d1 = TRUE, d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d1 = c("a"), d2 = TRUE) )
## Sample data set.seed(0) df <- data.frame( y = runif(10), a = sample(letters[1:3], 10, replace = TRUE), b = rnorm(10), c = sample(1:5, 10, replace = TRUE) ) col_names <- names(df) y_col <- "y" # Assume 'y' is the outcome variable ## Examples with just x_cols to show different formats for specifying x_cols ## (same format for exclude_cols) # Character vector: Simple ALE with no interactions resolve_x_cols(c("a", "b"), col_names, y_col) # Character string: Select just one 1D element resolve_x_cols("c", col_names, y_col) # Character string: Select just one 1D element resolve_x_cols("c", col_names, y_col) # list of 1- and 2-length character vectors: specify precise 1D and 2D elements desired resolve_x_cols(list(c("a", "b"), "c", c("c", "a"), "b"), col_names, y_col) # Formula: Converts to a list of individual elements resolve_x_cols(~ a + b, col_names, y_col) # Formula with interactions (1D and 2D). # This format is probably more convenient if you know precisely which terms you want. # Note that the outcome on the left-hand-side is always silently ignored. resolve_x_cols(whatever ~ a + b + a:b + c:b, col_names, y_col) # List specifying d1 (1D ALE) resolve_x_cols(list(d1 = c("a", "b")), col_names, y_col) # List specifying d2 (2D ALE) resolve_x_cols(list(d2 = list(c("a", "b"))), col_names, y_col) # List specifying both d1 and d2 resolve_x_cols(list(d1 = c("a", "b"), d2 = list(c("a", "b"))), col_names, y_col) # d1 as TRUE (select all columns except y_col) resolve_x_cols(list(d1 = TRUE), col_names, y_col) # d2 as TRUE (select all possible 2D interactions) resolve_x_cols(list(d2 = TRUE), col_names, y_col) # d2_all: Request all 2D interactions involving a specific variable resolve_x_cols(list(d2_all = "a"), col_names, y_col) # NULL: No variables selected resolve_x_cols(NULL, col_names, y_col) ## Examples of how exclude_cols are removed from x_cols to obtain various desired results # Exclude one column from a simple character vector resolve_x_cols( x_cols = c("a", "b", "c"), col_names = col_names, y_col = y_col, exclude_cols = "b" ) # Exclude multiple columns resolve_x_cols( x_cols = c("a", "b", "c"), col_names = col_names, y_col = y_col, exclude_cols = c("a", "c") ) # Exclude an interaction term from a formula input resolve_x_cols( x_cols = ~ a + b + a:b, col_names = col_names, y_col = y_col, exclude_cols = ~ a:b ) # Exclude all columns from x_cols resolve_x_cols( x_cols = c("a", "b", "c"), col_names = col_names, y_col = y_col, exclude_cols = c("a", "b", "c") ) # Exclude non-existent columns (should be ignored) resolve_x_cols( x_cols = c("a", "b"), col_names = col_names, y_col = y_col, exclude_cols = "z" ) # Exclude one column from a list-based input resolve_x_cols( x_cols = list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d1 = "a") ) # Exclude interactions only resolve_x_cols( x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d2 = list(c("a", "b"))) ) # Exclude everything, including interactions resolve_x_cols( x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))) ) # Exclude a column implicitly removed by y_col resolve_x_cols( x_cols = c("y", "a", "b"), col_names = col_names, y_col = "y", exclude_cols = "y" ) # Exclude entire 2D dimension from x_cols with d2 = TRUE resolve_x_cols( x_cols = list(d1 = TRUE, d2 = list(c("a", "b"), c("a", "c"))), col_names = col_names, y_col = y_col, exclude_cols = list(d1 = c("a"), d2 = TRUE) )
Present concise summary information about an ALEPlots
object.
object |
An object of class |
... |
Not used |
Summary string.
This is a transformation of the mtcars
dataset from R to produce a small dataset with each of the fundamental datatypes: logical, factor, ordered, integer, double, and character. Most of the transformations are obvious, but a few are noteworthy:
The row names (the car model) are saved as a character vector.
For the unordered factors, the country and continent of the car manufacturer are obtained based on the row names (model).
For the ordered factor, gears 3, 4, and 5 are encoded as 'three', 'four', and 'five', respectively. The text labels make it explicit that the variable is ordinal, yet the number names make the order crystal clear.
Here is the adaptation of the original description of the mtcars
dataset:
The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).
var_cars
var_cars
A tibble with 32 observations on 14 variables.
character
: Car model
double
: Miles/(US) gallon
integer
: Number of cylinders
double
: Displacement (cu.in.)
double
: Gross horsepower
double
: Rear axle ratio
double
: Weight (1000 lbs)
double
: 1/4 mile time
logical
: Engine (0 = V-shaped, 1 = straight)
logical
: Transmission (0 = automatic, 1 = manual)
ordered
: Number of forward gears
integer
: Number of carburetors
factor
: Country of car manufacturer
factor
: Continent of car manufacturer
Henderson and Velleman (1981) comment in a footnote to Table 1: 'Hocking (original transcriber)'s noncrucial coding of the Mazda's rotary engine as a straight six-cylinder engine and the Porsche's flat engine as a V engine, as well as the inclusion of the diesel Mercedes 240D, have been retained to enable direct comparisons to be made with previous analyses.'
Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.
Determine the datatype of a vector
var_type(var)
var_type(var)
var |
vector whose datatype is to be determined Not exported. See @returns for details of what it does. |
Returns generic datatypes of R basic vectors according to the following mapping:
logical
returns 'binary'
numeric
values (e.g., integer
and double
) return 'numeric'
However, if the only values of numeric are 0 and 1, then it returns 'binary'
unordered factor
returns 'categorical'
ordered
factor
returns 'ordinal'