Package 'ale'

Title: Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)
Description: Accumulated Local Effects (ALE) were initially developed as a model-agnostic approach for global explanations of the results of black-box machine learning algorithms. ALE has a key advantage over other approaches like partial dependency plots (PDP) and SHapley Additive exPlanations (SHAP): its values represent a clean functional decomposition of the model. As such, ALE values are not affected by the presence or absence of interactions among variables in a mode. Moreover, its computation is relatively rapid. This package reimplements the algorithms for calculating ALE data and develops highly interpretable visualizations for plotting these ALE values. It also extends the original ALE concept to add bootstrap-based confidence intervals and ALE-based statistics that can be used for statistical inference. For more details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. <arXiv:2310.09877>. <doi:10.48550/arXiv.2310.09877>.
Authors: Chitu Okoli [aut, cre]
Maintainer: Chitu Okoli <[email protected]>
License: MIT + file LICENSE
Version: 0.3.0.20250217
Built: 2025-02-19 10:31:55 UTC
Source: https://github.com/tripartio/ale

Help Index


Add two arrays or matrices, ignoring NA values by default

Description

Array or matrix addition in base R sets all sums in an element position to NA if any element in that position is NA in either of the arrays being added. In contrast, this function ignores NA values by default in its addition.

Usage

add_array_na.rm(ary1, ary2, na.rm = TRUE)

Arguments

ary1, ary2

numeric arrays or matrices. The arrays to be added. They must be of the same dimension.

na.rm

logical(1). TRUE (default) if missing values (NA) should be ignored in the summation. If both elements in a given position are missing, then the result will be NA.

Value

An array or matrix of the same dimensions as ary1 and ary2 whose values are the sums of ary1 and ary2 in each corresponding element.

Reduce(add_array_na.rm, list(x1, x2, x3))


ALE data and statistics that describe a trained model

Description

An ALE S7 object contains ALE data and statistics. For details, see vignette('ale-intro') or the details and examples below.

Usage

ALE(
  model,
  x_cols = list(d1 = TRUE),
  data = NULL,
  y_col = NULL,
  ...,
  exclude_cols = NULL,
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  output_stats = TRUE,
  output_conf = TRUE,
  output_boot_data = FALSE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  p_values = NULL,
  p_alpha = c(0.01, 0.05),
  max_num_bins = 10,
  boot_it = 0,
  boot_alpha = 0.05,
  boot_centre = "mean",
  seed = 0,
  y_type = NULL,
  median_band_pct = c(0.05, 0.5),
  sample_size = 500,
  .bins = NULL,
  silent = FALSE
)

Arguments

model

model object. Required. Model for which ALE should be calculated. May be any kind of R object that can make predictions from data.

x_cols, exclude_cols

character, list, or formula. Columns names from data requested in one of the special x_cols formats for which ALE data is to be calculated. Defaults to 1D ALE for all columns in data except y_col. See details in the documentation for resolve_x_cols().

data

dataframe. Dataset from which to create predictions for the ALE. It should normally be the same dataset on which model was trained. If not provided, ALE() will try to detect it automatically. For non-standard models, data should be provided.

y_col

character(1). Name of the outcome target label (y) variable. If not provided, ALE() will try to detect it automatically. For non-standard models, y_col should be provided. For survival models, set y_col to the name of the binary event column; in that case, pred_type should also be specified.

...

not used. Inserted to require explicit naming of subsequent arguments.

parallel

non-negative integer(1). Number of parallel threads (workers or tasks) for parallel execution of the function. Set parallel = 0 to disable parallel processing. See details.

model_packages

character. Character vector of names of packages that model depends on that might not be obvious with parallel processing. If you get weird error messages when parallel processing is enabled (which is the default) but they are resolved by setting parallel = 0, you might need to specify model_packages. See details.

output_stats

logical(1). If TRUE (default), return ALE statistics.

output_conf

logical(1). If TRUE (default), return ALE confidence regions. If output_stats is FALSE, output_conf is ignored since confidence regions cannot be produced without ALE statistics.

output_boot_data

logical(1). If TRUE, return the raw ALE data for each bootstrap iteration. Default is FALSE.

pred_fun, pred_type

function,character(1). pred_fun is a function that returns a vector of predicted values of type pred_type from model on data. See details.

p_values

instructions for calculating p-values and to determine the median band. If NULL (default), no p-values are calculated and median_band_pct is used to determine the median band. To calculate p-values, an ALEpDist() object must be provided here. If p_values is set to 'auto', this ALE() function will try to automatically create the p-values distribution; this only works with standard R model types. An error message will be given if p-values cannot be generated. Any other input provided to this argument will result in an error. For more details about creating p-values, see documentation for ALEpDist(). Note that p-values will not be generated if output_stats is FALSE.

p_alpha

numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if p_values are provided. If p_values are not provided, then median_band_pct is used instead. The inner band range will be the median value of y ± p_alpha[2] of the relevant ALE statistic (usually ALE range or normalized ALE range). For plots with a second outer band, its range will be the median ± p_alpha[1]. For example, in the ALE plots, for the default p_alpha = c(0.01, 0.05), the inner band will be the median ± ALE minimum or maximum at p = 0.05 and the outer band will be the median ± ALE minimum or maximum at p = 0.01.

max_num_bins

positive integer(1). Maximum number of bins for numeric x_cols variables. The number of bins is eventually the lower of the number of unique values of a numeric variable and max_num_bins.

boot_it

non-negative integer(1). Number of bootstrap iterations for data-only bootstrapping on ALE data. This is appropriate for models that have been developed with cross-validation. For models that have not been validated, full-model bootstrapping should be used instead with the ModelBoot class. See details there. The default boot_it = 0 turns off bootstrapping.

boot_alpha

numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrap intervals; the bootstrap confidence intervals will be the lowest and highest (1 - 0.05) / 2 percentiles. For example, if boot_alpha = 0.05 (default), the intervals will be from the 2.5 and 97.5 percentiles.

boot_centre

character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be boot_centre. Regardless of the value specified here, both the mean and median will be available.

seed

integer(1). Random seed. Supply this between runs to assure that identical random ALE data is generated each time when bootstrapping. Without bootstrapping, ALE is a deterministic algorithm that should result in identical results each time regardless of the seed specified.

y_type

character(1) in c('binary', 'numeric', 'categorical', 'ordinal'). Datatype of the y (outcome) variable. Normally determined automatically; only provide if an error message for a complex non-standard model requires it.

median_band_pct

numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if p_values are not provided. If p_values are provided, then median_band_pct is ignored. The inner band range will be the median value of y ± median_band_pct[1]/2. For plots with a second outer band, its range will be the median ± median_band_pct[2]/2. For example, for the default median_band_pct = c(0.05, 0.5), the inner band will be the median ± 2.5% and the outer band will be the median ± 25%.

sample_size

non-negative integer(1). Size of the sample of data to be returned with the ALE object. This is primarily used for rug plots. See the min_rug_per_interval argument.

.bins

Internal. List of ALE bin and n count vectors. If provided, these vectors will be used to set the intervals of the ALE x axis for each variable. By default (NULL), ALE() automatically calculates the bins. .bins is normally used in advanced analyses where the bins from a previous analysis are reused for subsequent analyses (for example, for full model bootstrapping with ModelBoot()).

silent

logical(1), default FALSE. If TRUE, do not display any non-essential messages during execution (such as progress bars). Regardless, any warnings and errors will always display. See details for how to enable progress bars.

Value

An object of class ALE with properties distinct and params.

Properties

distinct

Stores the optional ALE data, ALE statistics, and bootstrap data for one or more categories.

params

The parameters used to calculate the ALE data. These include most of the arguments used to construct the ALE object. These are either the values provided by the user or used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:

* `max_d`: the highest dimension of ALE data present. If only 1D ALE is present, then `max_d == 1`. If even one 2D ALE element is present (even with no 1D), then `max_d == 2`.
* `requested_x_cols`,`ordered_x_cols`: `requested_x_cols` is the resolved list of `x_cols` as requested by the user (that is, `x_cols` minus `exclude_cols`). `ordered_x_cols` is the same set of `x_cols` but arranged in the internal storage order.
* `y_cats`: categories for categorical classification models. For non-categorical models, this is the same as `y_col`.
* `y_type`: high-level datatype of the y outcome variable.
* `y_summary`: summary statistics of y values used for the ALE calculation. These statistics are based on the actual values of `y_col` unless if `y_type` is a probability or other value that is constrained in the `[0, 1]` range. In that case, `y_summary` is based on the predicted values of `y_col` by applying `model` to the `data`. `y_summary` is a named numeric vector. Most of the elements are the percentile of the y values. E.g., the '5%' element is the 5th percentile of y values. The following elements have special meanings:
* The first element is named either `p` or `q` and its value is always 0.
  The value is not used; only the name of the element is meaningful.
  `p` means that the following special `y_summary` elements are based on
  the provided `ALEpDist` object. `q` means that quantiles were calculated
  based on `median_band_pct` because `p_values` was not provided.
* `min`, `mean`, `max`: the minimum, mean, and maximum y values, respectively. Note that the median is `50%`, the 50th percentile.
* `med_lo_2`, `med_lo`, `med_hi`, `med_hi_2`: `med_lo` and `med_hi` are the inner lower and upper confidence intervals of y values with respect to the median (`50%`); `med_lo_2` and `med_hi_2` are the outer confidence intervals. See the documentation for the `p_alpha` and `median_band_pct` arguments to understand how these are determined.
* `model`: same as `ALE@params$model` (see documentation there).
* `data`: same as `ALE@params$model` (see documentation there).

Custom predict function

The calculation of ALE requires modifying several values of the original data. Thus, ALE() needs direct access to the predict function for the model. By default, ALE() uses a generic default predict function of the form predict(object, newdata, type) with the default prediction type of 'response'. If, however, the desired prediction values are not generated with that format, the user must specify what they want. Very often, the only modification needed is to change the prediction type to some other value by setting the pred_type argument (e.g., to 'prob' to generated classification probabilities). But if the desired predictions need a different function signature, then the user must create a custom prediction function and pass it to pred_fun. The requirements for this custom function are:

  • It must take three required arguments and nothing else:

    • object: a model

    • newdata: a dataframe or compatible table type such as a tibble or data.table

    • type: a string; it should usually be specified as type = pred_type These argument names are according to the R convention for the generic stats::predict() function.

  • It must return a vector or matrix of numeric values as the prediction.

You can see an example below of a custom prediction function.

Note: survival models probably do not need a custom prediction function but y_col must be set to the name of the binary event column and pred_type must be set to the desired prediction type.

ALE statistics

For details about the ALE-based statistics (ALED, ALER, NALED, and NALER), see vignette('ale-statistics').

Parallel processing

Parallel processing using the {furrr} framework is enabled by default. By default, it will use all the available physical CPU cores (minus the core being used for the current R session) with the setting parallel = future::availableCores(logical = FALSE, omit = 1). Note that only physical cores are used (not logical cores or "hyperthreading") because machine learning can only take advantage of the floating point processors on physical cores, which are absent from logical cores. Trying to use logical cores will not speed up processing and might actually slow it down with useless data transfer. If you will dedicate the entire computer to running this function (and you don't mind everything else becoming very slow while it runs), you may use all cores by setting parallel = future::availableCores(logical = FALSE). To disable parallel processing, set parallel = 0.

#' The {ale} package should be able to automatically recognize and load most packages that are needed, but with parallel processing enabled (which is the default), some packages might not be properly loaded. This problem might be indicated if you get a strange error message that mentions something somewhere about "progress interrupted" or "future", especially if you see such errors after the progress bars begin displaying (assuming you did not disable progress bars with silent = TRUE). In that case, first try disabling parallel processing with parallel = 0. If that resolves the problem, then to get faster parallel processing to work, try adding the package names needed for the model to this argument, e.g., model_packages = c('tidymodels', 'mgcv').

Progress bars

Progress bars are implemented with the {progressr} package, which lets the user fully control progress bars. To disable progress bars, set silent = TRUE. The first time a function is called in the {ale} package that requires progress bars, it checks if the user has activated the necessary {progressr} settings. If not, the {ale} package automatically enables {progressr} progress bars with the cli handler and prints a message notifying the user.

If you like the default progress bars and you want to make them permanent, you can add the following lines of code to your .Rprofile configuration file and they will become your defaults for every R session; you will not see the message again:

progressr::handlers(global = TRUE)
progressr::handlers('cli')

This would apply not only to the {ale} package but to any package that uses {progressr} progress bars. For more details on formatting progress bars to your liking, see the introduction to the {progressr} package.

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples

# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)




# Simple ALE without bootstrapping
ale_gam_diamonds <- ALE(gam_diamonds)

# Plot the ALE data
plot(ale_gam_diamonds)

# Bootstrapped ALE
# This can be slow, since bootstrapping runs the algorithm boot_it times

# Create ALE with 100 bootstrap samples
ale_gam_diamonds_boot <- ALE(
  gam_diamonds,
  boot_it = 100
)

# Bootstrapped ALEs print with confidence intervals
plot(ale_gam_diamonds_boot)


# If the predict function you want is non-standard, you may define a
# custom predict function. It must return a single numeric vector.
custom_predict <- function(object, newdata, type = pred_type) {
  predict(object, newdata, type = type, se.fit = TRUE)$fit
}

ale_gam_diamonds_custom <- ALE(
  gam_diamonds,
  pred_fun = custom_predict, pred_type = 'link'
)

# Plot the ALE data
plot(ale_gam_diamonds_custom)


# How to retrieve specific types of ALE data from an ALE object.
ale_diamonds_with__boot_data <- ALE(
  gam_diamonds,
  # For detailed options for x_cols, see examples at resolve_x_cols()
  x_cols = ~ carat + cut + clarity + color:depth + x:y,
  output_boot_data = TRUE,
  boot_it = 10  # just for demonstration
)

# See ?get.ALE for details on the various kinds of data that may be retrieved.
get(ale_diamonds_with__boot_data, ~ carat + color:depth)  # default ALE data
get(ale_diamonds_with__boot_data, what = 'boot_data')
get(ale_diamonds_with__boot_data, stats = 'estimate')
get(ale_diamonds_with__boot_data, stats = 'aled')
get(ale_diamonds_with__boot_data, stats = 'all')
get(ale_diamonds_with__boot_data, stats = 'conf_regions')
get(ale_diamonds_with__boot_data, stats = 'conf_sig')

Calculate statistics from ALE y values.

Description

Not exported. The following statistics are calculated based on a vector of ALE y values:

Usage

ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = "numeric")

Arguments

y

numeric. Vector of ALE y values.

bin_n

numeric. Vector of counts of rows in each ALE bin. Must be the same length as y.

y_vals

numeric. Entire vector of y values. Needed for normalization. If not provided, ale_y_norm_fun must be provided.

ale_y_norm_fun

function. Result of create_ale_y_norm_function(). If not provided, y_vals must be provided. ale_stats() could be faster if ale_y_norm_fun is provided, especially in bootstrap workflows that call the same function many, many times.

x_type

character(1). Datatype of the x variable on which the ALE y is based. Values are the result of var_type(). Used to determine how to correctly calculate ALE, so if the value is not the default "numeric", then it must be set correctly.

Details

  • ALE deviation (ALED)

  • ALE range (ALER): range from minimum value of any ALE y to the maximum value of any y. This is a very simple indication of the dispersion in ALE y values.

  • Normalized ALE deviation (NALED)

  • Normalized ALE range (NALER)

Note that if any ALE y values are missing, they will be deleted from the calculation (with their corresponding bin_n).

Value

Named numeric vector:

  • aled: ALE deviation (ALED)

  • aler_min: Minimum (lower value) of the ALE range (ALER)

  • aler_max: Maximum (upper value) of the ALE range (ALER)

  • naled: Normalized ALE deviation (ALED)

  • naler_min: Normalized minimum (lower value) of the ALE range (ALER)

  • naler_max: Normalized maximum (upper value) of the ALE range (ALER)


Calculate statistics from 2D ALE y values.

Description

When calculating second-order (2D) ALE statistics, there is no difficulty if both variables are categorical. The regular formulas for ALE operate normally. However, if one or both variables is numeric, the calculation is complicated by the necessity to determine the ALE midpoints between the ALE bin ceilings of the numeric variables. This function calculates these ALE midpoints for the numeric variables and resets the ALE bins to these values. The ALE values for ordinal ordinal variables are not changed. As part of the adjustment, the lowest numeric bin is merged into the second: the ALE values are completely deleted (since they do not represent a midpoint) and their counts are added to the first true bin.

Usage

ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)

Arguments

ale_data

dataframe. ALE data

x_cols

character. Names of the x columns in ale_data.

x_types

character same length as x_cols. Variable types (output of var_type()) of corresponding x_cols.

y_vals

See documentation for ale_stats()

ale_y_norm_fun

See documentation for ale_stats()

Details

After these possible adjustments, the ALE y values and bin counts are passed to ale_stats(), which calculates their statistics as an ordinal variable since the numeric variables have thus been discretized.

Not exported.

Value

Same as ale_stats().


Object with the ALE statistics of a random variable for generating p-values

Description

ALE statistics are accompanied with two indicators of the confidence of their values. First, bootstrapping creates confidence intervals for ALE measures and ALE statistics to give a range of the possible likely values. Second, we calculate p-values, an indicator of the probability that a given ALE statistic is random. Calculating p-values is not trivial for ALE statistics because ALE is non-parametric and model-agnostic. Because ALE is non-parametric (that is, it does not assume any particular distribution of data), the {ale} package generates p-values by calculating ALE for many random variables; this makes the procedure somewhat slow. For this reason, they are not calculated by default; they must be explicitly requested. Because the {ale} package is model-agnostic (that is, it works with any kind of R model), the ALE() constructor cannot always automatically manipulate the model object to create the p-values. It can only do so for models that follow the standard R statistical modelling conventions, which includes almost all built-in R algorithms (like stats::lm() and stats::glm()) and many widely used statistics packages (like mgcv and survival), but which excludes most machine learning algorithms (like tidymodels and caret). For non-standard algorithms, the user needs to do a little work to help the ALE() constructor correctly manipulate its model object:

  • The full model call must be passed as a character string in the argument random_model_call_string, with two slight modifications as follows.

  • In the formula that specifies the model, you must add a variable named 'random_variable'. This corresponds to the random variables that the constructor will use to estimate p-values.

  • The dataset on which the model is trained must be named 'rand_data'. This corresponds to the modified datasets that will be used to train the random variables.

See the example below for how this is implemented.

Usage

ALEpDist(
  model,
  data = NULL,
  p_speed = "approx fast",
  ...,
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  random_model_call_string = NULL,
  random_model_call_string_vars = character(),
  y_col = NULL,
  binary_true_value = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  output_residuals = FALSE,
  rand_it = 1000,
  seed = 0,
  silent = FALSE,
  .testing_mode = FALSE
)

Arguments

model

See documentation for ALE()

data

See documentation for ALE()

p_speed

character(1). Either 'approx fast' (default) or 'precise slow'. See details.

...

not used. Inserted to require explicit naming of subsequent arguments.

parallel

See documentation for ALE()

model_packages

See documentation for ALE()

random_model_call_string

character(1). If NULL, the ALEpDist() constructor tries to automatically detect and construct the call for p-values. If it cannot, the function will fail early. In that case, a character string of the full call for the model must be provided that includes the random variable. See details.

random_model_call_string_vars

See documentation for model_call_string_vars in ModelBoot(); their operation is very similar.

y_col

See documentation for ALE()

binary_true_value

See documentation for ModelBoot()

pred_fun, pred_type

See documentation for ALE().

output_residuals

logical(1). If TRUE, returns the residuals in addition to the raw data of the generated random statistics (which are always returned). If FALSE (default), does not return the residuals.

rand_it

non-negative integer(1). Number of times that the model should be retrained with a new random variable. The default of 1000 should give reasonably stable p-values; these are considered "exact" p-values. It can be reduced for "approximate" p-values as low as 100 for faster test runs but then the p-values are not as stable.

seed

See documentation for ALE()

silent

See documentation for ALE()

.testing_mode

logical(1). Internal use only. Disables some data validation checks to allow for debugging.

Value

An object of class ALEpDist with properties rand_stats, residual_distribution, rand_it_ok, and residuals.

  • rand_it_ok: An integer with the number of rand_it iterations that successfully generated a random variable, that is, those that did not fail for whatever reason. The rand_it - rand_it_ok failed attempts are discarded.

  • rand_stats: A named list of tibbles. There is normally one element whose name is the same as y_col except if y_col is a categorical variable; in that case, the elements are named for each category of y_col. Each element is a tibble whose rows are each of the rand_it_ok iterations of the random variable analysis and whose columns are the ALE statistics obtained for each random variable.

  • residual_distribution: A univariateML object with the closest estimated distribution for the residuals as determined by univariateML::model_select(). This is the distribution used to generate all the random variables.

  • residuals: If output_residuals == TRUE, returns a matrix of the actual y_col values from data minus the predicted values from the model (without random variables) on the data. If output_residuals == FALSE, (default), does not return these residuals. The rows correspond to each row of data. The columns correspond to the named elements (y_col or categories) described above for rand_stats.

Approach to calculating p-values

The {ale} package takes a literal frequentist approach to the calculation of p-values. That is, it literally retrains the model 1000 times, each time modifying it by adding a distinct random variable to the model. (The number of iterations is customizable with the rand_it argument.) The ALEs and ALE statistics are calculated for each random variable. The percentiles of the distribution of these random-variable ALEs are then used to determine p-values for non-random variables. Thus, p-values are interpreted as the frequency of random variable ALE statistics that exceed the value of ALE statistic of the actual variable in question. The specific steps are as follows:

  • The residuals of the original model trained on the training data are calculated (residuals are the actual y target value minus the predicted values).

  • The closest distribution of the residuals is detected with univariateML::model_select().

  • 1000 new models are trained by generating a random variable each time with univariateML::rml() and then training a new model with that random variable added.

  • The ALEs and ALE statistics are calculated for each random variable.

  • For each ALE statistic, the empirical cumulative distribution function (from stats::ecdf()) is used to create a function to determine p-values according to the distribution of the random variables' ALE statistics.

What we have just described is the precise approach to calculating p-values with the argument p_speed = 'precise slow'. Because it is so slow, by default, the ALEpDist constructor implements an approximate algorithm by default (p_speed = 'approx fast') which trains only a few random variables up to the number of physical parallel processing threads available, with a minimum of four. To increase speed, the random variable uses only 10 ALE bins instead of the default 100. Although approximate p-values are much faster than precise ones, they are still somewhat slow: at the very quickest, they take at least the amount of time that it would take to train the original model two or three times. See the "Parallel processing" section below for more details on the speed of computation.

Parallel processing

Parallel processing using the {furrr} framework is enabled by default. By default, it will use all the available physical CPU cores (minus the core being used for the current R session) with the setting parallel = future::availableCores(logical = FALSE, omit = 1). Note that only physical cores are used (not logical cores or "hyperthreading") because machine learning can only take advantage of the floating point processors on physical cores, which are absent from logical cores. Trying to use logical cores will not speed up processing and might actually slow it down with useless data transfer.

For exact p-values, by default 1000 random variables are trained. So, even with parallel processing, the procedure is very slow. However, an ALEpDist object trained with a specific model on a specific dataset can be reused as often as needed for the identical model-dataset pair.

For approximate p-values (the default), at least four random variables are trained to give some minimal variation. With parallel processing, more random variables can be trained to increase the accuracy of the p_value estimates up to the maximum number of physical cores.

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples

# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)

# Create p_value distribution
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100,
)

# Examine the structure of the returned object
str(pd_diamonds)
# In RStudio: View(pd_diamonds)

# Calculate ALEs with p-values
ale_gam_diamonds <- ALE(
  gam_diamonds,
  p_values = pd_diamonds
)

# Plot the ALE data. The horizontal bands in the plots use the p-values.
plot(ale_gam_diamonds)


# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  random_model_call_string = 'mgcv::gam(
    price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
        cut + color + clarity + random_variable,
    data = rand_data
  )',
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100
)

ALE plots with print and plot methods

Description

An ALEPlots S7 object contains the ALE plots from ALE or ModelBoot objects stored as ggplot objects. The ALEPlots creates all possible plots from the ALE or ModelBoot passed to its constructor—not only individual 1D and 2D ALE plots, but also special plots like the ALE effects plot. So, an ALEPlots object is a collection of plots, almost never a single plot. To retrieve specific plots, use the get.ALEPlots() method. In addition to specific ALE See examples there or examples with the ALE and ModelBoot objects.

Usage

ALEPlots(
  obj,
  ...,
  relative_y = "median",
  p_alpha = c(0.01, 0.05),
  median_band_pct = c(0.05, 0.5),
  rug_sample_size = obj@params$sample_size,
  min_rug_per_interval = 1,
  n_x1_bins = NULL,
  n_x2_bins = NULL,
  n_y_quant = 10,
  seed = 0,
  silent = FALSE
)

Arguments

obj

ALE or ModelBoot object. The object containing ALE data to be plotted.

...

not used. Inserted to require explicit naming of subsequent arguments.

relative_y

character(1) in c('median', 'mean', 'zero'). The ALE y values in the plots will be adjusted relative to this value. 'median' is the default. 'zero' will maintain the actual ALE values, which are relative to zero.

p_alpha

numeric(2) from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if p_values are provided. If p_values are not provided, then median_band_pct is used instead. The inner band range will be the median value of y ± p_alpha[2] of the relevant ALE statistic (usually ALE range or normalized ALE range). For plots with a second outer band, its range will be the median ± p_alpha[1]. For example, in the ALE plots, for the default p_alpha = c(0.01, 0.05), the inner band will be the median ± ALE minimum or maximum at p = 0.05 and the outer band will be the median ± ALE minimum or maximum at p = 0.01.

median_band_pct

numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if p_values are not provided. If p_values are provided, then median_band_pct is ignored. The inner band range will be the median value of y ± median_band_pct[1]/2. For plots with a second outer band, its range will be the median ± median_band_pct[2]/2. For example, for the default median_band_pct = c(0.05, 0.5), the inner band will be the median ± 2.5% and the outer band will be the median ± 25%.

rug_sample_size, min_rug_per_interval

non-negative integer(1). Rug plots are down-sampled to rug_sample_size rows, otherwise they can be very slow for large datasets. By default, their size is the value of obj@params$sample_size. They maintain representativeness of the data by guaranteeing that each of the ALE bins will retain at least min_rug_per_interval elements; usually set to just 1 (default) or 2. To prevent this down-sampling, set rug_sample_size to Inf (but then the ALEPlots object would store the entire dataset, so could become very large).

n_x1_bins, n_x2_bins

positive integer(1). Number of bins for the x1 or x2 axes respectively for 2D interaction plot. These values are ignored if x1 or x2 are not numeric (i.e, if they are logical or factors).

n_y_quant

positive integer(1). Number of intervals over which the range of y values is divided for the colour bands of the interaction plot. See details.

seed

See documentation for ALE()

silent

See documentation for ALE()

Value

An object of class ALEPlots with properties distinct and params.

Properties

distinct

Stores the ALE plots. Use get.ALEPlots() to access them.

params

The parameters used to calculate the ALE plots. These include most of the arguments used to construct the ALEPlots object. These are either the values provided by the user or used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:

* `max_d`: See documentation for [ALE()]
* `requested_x_cols`: See documentation for [ALE()]. Note, however, that `ALEPlots` does not store `ordered_x_cols`.
* `y_col`: See documentation for [ALE()]

2D interaction plots

#' For the 2D interaction plots, n_y_quant is the number of quantiles into which to divide the predicted variable (y). The middle quantiles are grouped specially:

  • The middle quantile is the first confidence interval of median_band_pct (median_band_pct[1]) around the median. This middle quantile is special because it generally represents no meaningful interaction.

  • The quantiles above and below the middle are extended from the borders of the middle quantile to the regular borders of the other quantiles.

There will always be an odd number of quantiles: the special middle quantile plus an equal number of quantiles on each side of it. If n_y_quant is even, then a middle quantile will be added to it. If n_y_quant is odd, then the number specified will be used, including the middle quantile.

Examples

# See examples with ALE and ModelBoot objects

Calculate ALE data

Description

This function is not exported. It is a complete reimplementation of the ALE algorithm relative to the reference in ALEPlot::ALEPlot(). In addition to adding bootstrapping and handling of categorical y variables, it reimplements categorical x interactions.

Usage

calc_ale(
  data,
  model,
  x_cols,
  y_col,
  y_cats,
  pred_fun,
  pred_type,
  max_num_bins,
  boot_it,
  seed,
  boot_alpha,
  boot_centre,
  boot_ale_y = FALSE,
  .bins = NULL,
  ale_y_norm_funs = NULL,
  p_dist = NULL
)

Arguments

data

See documentation for ALE()

model

See documentation for ALE()

x_cols

character(1 or 2). Names of columns in X for which ALE data is to be calculated. Length 1 for 1D ALE and length 2 for 2D ALE.

y_col

character(1). Name of the target y column.

y_cats

character. The categories of y. For most cases with non-categorical y, y_cats == y_col.

pred_fun

See documentation for ALE()

pred_type

See documentation for ALE()

max_num_bins

See documentation for ALE()

boot_it

See documentation for ALE()

seed

See documentation for ALE()

boot_alpha

See documentation for ALE()

boot_centre

See documentation for ALE()

boot_ale_y

logical(1). If TRUE, return the bootstrap matrix of ALE y values. If FALSE (default) return NULL for the boot_ale_y element of the return value.

.bins

See documentation for ALE()

ale_y_norm_funs

list of functions. Custom functions for normalizing ALE y for statistics. It is usually a list(1), but for categorical y, there is a distinct function for each y category. If provided, ale_y_norm_funs saves some time since it is usually the same for all all variables throughout one call to ALE(). For now, used as a flag to determine whether statistics will be calculated or not; if NULL, statistics will not be calculated.

p_dist

See documentation for p_values in ALE()

Details

For details about arguments not documented here, see ALE().

References

Apley, Daniel W., and Jingyu Zhu. "Visualizing the effects of predictor variables in black box supervised learning models." Journal of the Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086.

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. doi:10.48550/arXiv.2310.09877.


Cast (convert) the class of an object

Description

Currently assumes that the result object will have only one class.

Usage

cast(x, new_cls)

Arguments

x

An R object

new_cls

character(1). A single class to which to convert x.

Value

x converted to class new_cls.


Census Income

Description

Census data that indicates, among other details, if the respondent's income exceeds $50,000 per year. Also known as "Adult" dataset.

Usage

census

Format

A tibble with 32,561 rows and 15 columns:

higher_income

TRUE if income > $50,000

age

continuous

workclass

Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked

fnlwgt

continuous. "A proxy for the demographic background of the people: 'People with similar demographic characteristics should have similar weights'" For more details, see https://www.openml.org/search?type=data&id=1590.

education

Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool

education_num

continuous

marital_status

Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse

occupation

Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces

relationship

Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried

race

White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black

sex

Female, Male

capital_gain

continuous

capital_loss

continuous

hours_per_week

continuous

native_country

United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad&Tobago, Peru, Hong, Holland-Netherlands

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Source

Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.


Sum up a matrix across columns

Description

Adaptation of base::colSums() that, when all values in a column are NA, sets the sum to NA rather than zero as base::colSums() does. Calls base::colSums() internally.

Usage

col_sums(mx, na.rm = FALSE, dims = 1)

Arguments

mx

numeric matrix

na.rm

logical(1). TRUE if missing values (NA) should be ignored in the summation. If FALSE (default), even one missing value will result in NA for the entire column.

dims

See documentation for base::colSums()

Value

numeric vector whose length is number of columns of mx, whose values are the sums of each column of mx.


Extract all NWSE diagonals from a matrix

Description

Extracts all diagonals from a matrix in the NWSE direction (upper left down to lower right).

Usage

extract_2D_diags(mx)

Arguments

mx

matrix

Value

A list whose elements each represent one diagonal of mx. Each diagonal element is a list of two elements: coords is a numeric vector pair of row-column coordinates; values is the value of the diagonal at the coordinate give by coords.


Extract all FNWBSE diagonals from a 3D array

Description

Extracts all diagonals from a 3D array in the FNWBSE direction (front upper left down to back lower right).

Usage

extract_3D_diags(ray)

Arguments

ray

a 3-dimensional array

Value

A list whose elements each represent one diagonal of ray. Each diagonal element is a list of two elements: origin is the 3D coordinates (row, column, depth) of the first element of the diagonal; values is a vector of the diagonal that starts from origin.


S7 generic get method for objects in the ale package

Description

Retrieve specific data elements from an object based on their X column names.

If obj is not an object from the ale package, then this generic passes on all arguments to the base::get() function.

Usage

get(obj, ...)

Arguments

obj

object.

...

For ale package objects, instructions for which predictor (x) columns should be retrieved. For everything else, arguments to pass to base::get().


get method for ALE objects

Description

Retrieve specific elements from an ALE object.

Arguments

obj

ALE object from which to retrieve elements.

x_cols, exclude_cols

character, list, or formula. Columns names and interaction terms from obj requested in one of the special x_cols formats. The default value of NULL for x_cols retrieves all available data of the output requested in what. See details in the documentation for resolve_x_cols().

what

character(1). What kind of output is requested. Must be one (and only one) of c('ale', 'boot_data'). Default is 'ale'. If stats is specified and what = 'ale', then ALE statistics are retrieved. Otherwise, get() errors if stats is specified and what has some other value.

...

not used. Inserted to require explicit naming of subsequent arguments.

stats

character(1). Retrieve statistics. If stats is specified, then what must be 'ale'. See the return value details below for valid values for stats.

cats

character. Optional category names to retrieve if the ALE is for a categorical y outcome model.

simplify

logical(1). If TRUE (default), the results will be simplified to the simplest structure possible to give the requested results.

silent

See documentation for resolve_x_cols()

Value

Regardless of the requested data, all get.ALE() have a common structure:

  • If more than one category of the y outcome is returned, then the top level is a list named by each category. However, the y outcome is not categorical or only one category of multiple possibilities is specified using the cats argument, then the top level never has categories, regardless of the value of simplify.

  • The next level (or top level if there are zero or one category) is a list with one or two levels:

    • d1: 1D ALE elements.

    • d2: 2D ALE elements. However, if elements of only one dimension (either 1D or 2D) are requested and simplify = TRUE (default), the empty list is eliminated and the level is skipped to provide only the elements present. For example, if only 1D ALE data is requested, then there will be no d1 sublist but only a list of the ALE data as described for the next level. If simplify = FALSE, both d1 and d2 sublists will always be returned; the empty sublist will be NULL.

  • For the d1 sublist, there is no further hierarchy: the returned data is as described below. For the d2 sublist representing var1 by var2 interactions, the next level is a named list of all var1 elements. This level in turn consists of a named list of var2 elements with the actual data as described below. For example, for ALE data, result <- get(obj, list(d1 = "var1", d2 = list(c("var1", "var2")))) would return result as a list where the 1D ALE data is in result$d1$var1 and the 2D ALE data in result$d2$var1$var2.

While all results follow the general structure just described, the specific type of data returned depends on the values of the what and stats arguments:

what = 'ale' (default) and stats = NULL (default)

A list whose elements, named by each requested x variable, are each a tibble. The rows each represent one ALE bin. The tibble has the following columns: * var.bin or var.ceil where var is the name of a variable (column): For non-numeric x, var.bin is the value of each of the ALE categories. For numeric x, var.ceil is the value of the upper bound (ceiling) of each ALE bin. The first "bin" of numeric variables represents the minimum value. For 2D ALE with an var1 by var2 interaction, both var1.bin and var2.bin columns are returned (or var1.ceil or var2.ceilfor numeric var1 or var2). * .n: the number of rows of data in each bin represented by var.bin or var.ceil. For numeric x, the first bin contains all data elements that have exactly the minimum value of x. This is often 1, but might be more than 1 if more than one data element has exactly the minimum value. * .y: the ALE function value calculated for that bin. For bootstrapped ALE, this is the same as .y_mean by default or .y_median if boot_centre = 'median'. Regardless, both .y_mean and .y_median are returned as columns here. * .y_lo, .y_hi: the lower and upper confidence intervals, respectively, for the bootstrapped .y value based on the boot_alpha argument in the ALE() constructor.

what = 'boot_data' and stats = NULL (default)

A list whose elements, named by each requested x variable, are each a tibble. These are the data from which .y_mean, .y_median, .y_lo, and .y_hi are summarized when what = 'ale'. The rows each represent one ALE bin for a specified bootstrap iteration. The tibble has the following columns: * .it: The bootstrap iteration. Iteration 0 represents the ALE calculations on the full dataset; the remaining values of .it are from 1 to boot_it (number of bootstrap iterations specified in the ALE() constructor. * var where var is the name of a variable (column): For non-numeric x, var is the value of each of the ALE categories. For numeric x, var is the value of the upper bound (ceiling) of each ALE bin. They are otherwise similar to their meanings described for what = 'ale' above. * .n and .y: Same as for what = 'ale'.

what = 'ale' (default) and stats = 'estimate'

A list with elements d1 and d2 with the value of each ALE statistic. Each row represents one variable or interaction. The tibble has the following columns: * term or term1 and term2: The variable or column for the 1D (term) or 2D (term1 by term2) ALE statistic. * aled, aler_min, aler_max, naled, naler_min, naler_max: the respective ALE statistic for the variable or interaction.

what = 'ale' (default) and stats is one value in c('aled', 'aler_min', 'aler_max', 'naled', 'naler_min', 'naler_max')

A list with elements d1 and d2 with the distribution value of the single requested ALE statistic. Each element d1 and d2 is a tibble. Each row represents one variable or interaction. The tibble has the following columns: * term or term1 and term2: Same as for stats = 'estimate'. * estimate, mean, median: The average of the bootstrapped value of the requested statistic. estimate is equal to either mean or median depending on the boot_centre argument in the ALE() constructor. If ALE is not bootstrapped, then estimate, mean, and median are equal. * conf.low, conf.high: the lower and upper confidence intervals, respectively, for the bootstrapped statistic based on the boot_alpha argument in the ALE() constructor. If ALE is not bootstrapped, then estimate, conf.low, and conf.high are equal.

what = 'ale' (default) and stats = 'all'

A list with elements d1 and d2 with the distribution values of all available ALE statistics for the requested variables and interactions. Whereas the stats = 'aled' (for example) format returns data for a single statistic, stats = 'all' returns all statistics for the requested variables. Each element is a list with the requested d1 and d2 sub-elements as described in the general structure above. Each data element is a tibble. Each row represents one ALE statistic. The tibble has the following columns: * term or term1 and term2: Same as for stats = 'estimate'. * estimate, mean, median: Same as above for individual ALE statistics. * conf.low, conf.high: Same as above for individual ALE statistics.

what = 'ale' (default) and stats = 'conf_regions'

A list with elements d1 and d2 with the confidence regions for the requested variables and interactions. Each element is a list with the requested d1 and d2 sub-elements as described in the general structure above. Each data element is a tibble with confidence regions for a single variable or interaction. For an explanation of the columns, see vignette('ale-statistics').

what = 'ale' (default) and stats = 'conf_sig'

A list with elements d1 and d2 filtered for the statistically significant confidence regions for the requested variables and interactions. Each element is a tibble. Each row is a single statistically significant confidence region. If there are no statistically significant confidence regions at all, an empty tibble is returned. For an explanation of the columns, see vignette('ale-statistics').

Examples

# See examples at ALE() for a demonstration us the get() method.

get method for ALEPlots objects

Description

Retrieve specific plots from a ALEPlots object.

See get.ALE() for explanation of parameters not described here.

Arguments

obj

ALEPlots object from which to retrieve ALE elements.

type

character(1). What type of ALEPlots to retrieve: 'ale' for standard ALE plots or 'eff' for ALE effects plots.

Value

See get.ALE()


get method for ModelBoot objects

Description

Retrieve specific ALE elements from a ModelBoot object. This method is similar to get.ALE() except that the user may specify what type of ALE data to retrieve (see the argument definition for details).

See get.ALE() for explanation of parameters not described here.

Arguments

obj

ModelBoot object from which to retrieve ALE elements.

type

character(1). The type of ModelBoot ALE elements to retrieve: 'single' for the ALE calculated on the full data set or 'boot' for the bootstrapped ALE data (based on full-model bootstrapping). The default 'auto' will retrieve 'boot' if it is available and 'single' otherwise.

Value

See get.ALE()


Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories.

Description

Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories.

Usage

idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)

Arguments

X

X data

x_col

character

n_bins

integer

x_int_counts

bin sizes


Intrapolate missing values of vector

Description

This intrapolation algorithm replaces internal missing values in a vector with the linear interpolation of the bounding non-missing values. If there are no-bounding non-missing values, the unbounded missing values are retained as missing. In our terminology, 'intrapolation' is distinct from 'interpolation' because interpolation might include 'extrapolation', that is, projecting estimates of values beyond the bounds. This function, in contrast, only replaces bounded missing values.

Usage

intrapolate_1D(v)

Arguments

v

numeric vector. A numeric vector.

Details

For example, the vector c(NA, NA, 1, NA, 5, NA, NA, 1, NA) will be intrapolated to c(NA, NA, 1, 3, 5, 3.7, 2.3, 1, NA).

Note: because intrapolation requires at least three elements (left bound, missing value, right bound), an input vector of less than three will be returned unchanged.

Value

numeric vector of the same length as the input v with internal missing values linearly intrapolated.


Intrapolate missing values of matrix

Description

This intrapolation algorithm replaces internal missing values in a matrix. It does so in the following steps:

  • Calculate separate intrapolations in four directions: rows, columns, NWSE diagonals (upper left down to lower right), and SWNE diagonals (lower left up to upper right). The intrapolations in each direction is based on the algorithm of intrapolate_1D(). (See details there.)

  • The 2D intrapolation is the mean intrapolation from any of the four values. In taking the mean, missing intrapolations are removed.

  • When there is no intrapolation available from any of the four directions, the missing value remains missing.

Usage

intrapolate_2D(mx, consolidate = TRUE)

Arguments

mx

numeric matrix. A numeric matrix.

consolidate

logical(1). See return value.

Value

If consolidate = TRUE (default), returns a numeric matrix of the same dimensions as the input mx with internal missing values linearly intrapolated. If consolidate = FALSE, returns a list of intrapolations for missing values from each of the four directions (rows, columns, NWSE diagonal, and SWNE diagonal).


Intrapolate missing values of a 3D array

Description

This intrapolation algorithm replaces internal missing values in a three-dimensional array. For how it works, see the details of intrapolate_2D(). Based on that, intrapolate_3D() does the following:

  • Slice the 3D array into 2D matrices along the rows, columns, and depth dimensions.

  • Use intrapolate_2D() to calculate 2D intrapolations based on the algorithm of intrapolate_1D(). See details in their documentation.

  • In addition, calculate intrapolations along the four directions of 3D diagonals: front northwest to back southeast, that is, front upper left down to back lower right (FNWBSE), FSWBNE, FSEBNW, and FNEBSW.

  • The 3D intrapolation is the mean intrapolation from any of these 2D or 3D values. In taking the mean, missing intrapolations are removed.

  • When there is no intrapolation available from any of the directions, the missing value remains missing.

Usage

intrapolate_3D(ray, consolidate = TRUE)

Arguments

ray

numeric array of three dimensions.

consolidate

logical(1). See return value.

Value

If consolidate = TRUE (default), returns a numeric array of the same dimensions as the input ray with internal missing values linearly intrapolated. If consolidate = FALSE, returns a list of intrapolations for missing values from each slice and diagonal direction.


Statistics and ALE data for a bootstrapped model

Description

A ModelBoot object contains full-model bootstrapped statistics and ALE data for a trained model. Full-model bootstrapping (as distinct from data-only bootstrapping) retrains a model for each bootstrap iterations. Thus, it is very slow, though more reliable. However, for obtaining bootstrapped ALE data, plots, and statistics, full-model bootstrapping as provided by ModelBoot is only necessary for models that have not been developed by cross-validation. For cross-validated models, it is sufficient (and much faster) to create a regular ALE object with bootstrapping by setting the boot_it argument there. In fact, full-model bootstrapping with ModelBoot is often infeasible for slow machine-learning models trained on large datasets, which should rather be cross-validated to assure their reliability. However, for models that have not been cross-validated, full-model bootstrapping with ModelBoot is necessary for reliable results. Further details follow below; see also vignette('ale-statistics').

Usage

ModelBoot(
  model,
  data = NULL,
  ...,
  model_call_string = NULL,
  model_call_string_vars = character(),
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  y_col = NULL,
  binary_true_value = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  boot_it = 100,
  boot_alpha = 0.05,
  boot_centre = "mean",
  seed = 0,
  output_model_stats = TRUE,
  output_model_coefs = TRUE,
  output_ale = TRUE,
  output_boot_data = FALSE,
  ale_options = list(),
  tidy_options = list(),
  glance_options = list(),
  silent = FALSE
)

Arguments

model

Required. See documentation for ALE()

data

dataframe. Dataset that will be bootstrapped. This must be the same data on which the model was trained. If not provided, ModelBoot() will try to detect it automatically. For non-standard models, data should be provided.

...

not used. Inserted to require explicit naming of subsequent arguments.

model_call_string

character(1). If NULL (default), the ModelBoot tries to automatically detect and construct the call for bootstrapped datasets. If it cannot, the function will fail early. In that case, a character string of the full call for the model must be provided that includes boot_data as the data argument for the call. See examples.

model_call_string_vars

character. Names of variables included in model_call_string that are not columns in data. If any such variables exist, they must be specified here or else parallel processing will produce an error. If parallelization is disabled with parallel = 0, then this is not a concern. See documentation for the model_packages argument in ALE().

parallel

See documentation for ALE()

model_packages

See documentation for ALE()

y_col, pred_fun, pred_type

See documentation for ALE(). Used to calculate bootstrapped performance measures. If NULL (default), then the relevant performance measures are calculated only if these arguments can be automatically detected.

binary_true_value

any single atomic value. If the model represented by model or model_call_string is a binary classification model, binary_true_value specifies the value of y_col (the target outcome) that is considered TRUE; any other value of y_col is considered FALSE. This argument is ignored if the model is not a binary classification model. For example, if 2 means TRUE and 1 means FALSE, then set binary_true_value = 2.

boot_it

non-negative integer(1). Number of bootstrap iterations for full-model bootstrapping. For bootstrapping of ALE values, see details to verify if ALE() with bootstrapping is not more appropriate than ModelBoot(). If boot_it = 0, then the model is run as normal once on the full data with no bootstrapping.

boot_alpha

numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrap intervals; the bootstrap confidence intervals will be the lowest and highest (1 - 0.05) / 2 percentiles. For example, if boot_alpha = 0.05 (default), the intervals will be from the 2.5 and 97.5 percentiles.

boot_centre

character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be boot_centre. Regardless of the value specified here, both the mean and median will be available.

seed

integer. Random seed. Supply this between runs to assure identical bootstrap samples are generated each time on the same data.

output_model_stats

logical(1). If TRUE (default), return overall model statistics.

output_model_coefs

logical(1). If TRUE (default), return model coefficients.

output_ale

logical(1). If TRUE (default), return ALE data and statistics.

output_boot_data

logical(1). If TRUE, return the full raw data for each bootstrap iteration, specifically, the bootstrapped models and the model row indices. Default is FALSE.

ale_options, tidy_options, glance_options

list of named arguments. Arguments to pass to the ALE() when ale = TRUE, broom::tidy() when model_coefs = TRUE, or broom::glance() when model_stats = TRUE, respectively, beyond (or overriding) their defaults. In particular, to obtain p-values for ALE statistics, see the details.

silent

See documentation for ALE()

Value

An object of class ALE with properties model_stats, model_coefs, ale, model_stats, boot_data, and params.

Properties

model_stats

tibble of bootstrapped results from broom::glance(). NULL if model_stats argument is FALSE. In general, only broom::glance() results that make sense when bootstrapped are included, such as df and adj.r.squared. Results that are incomparable across bootstrapped datasets (such as aic) are excluded. In addition, certain model performance measures are included; these are bootstrap-validated with the .632 correction (NOT the .632+ correction):

  • For regression (numeric prediction) models:

    • mae: mean absolute error (MAE)

    • sa_mae_mad: standardized accuracy of the MAE referenced on the mean absolute deviation

    • rmse: root mean squared error (RMSE)

    • sa_rmse_sd: standardized accuracy of the RMSE referenced on the standard deviation

  • For classification (probability) models:

    • auc: area under the ROC curve

model_coefs

A tibble of bootstrapped results from broom::tidy(). NULL if model_coefs argument is FALSE.

ale

A list of bootstrapped ALE results using default ALE() settings unless if overridden with ale_options. NULL if ale argument is FALSE. Elements are:

  * `single`: an `ALE` object of ALE calculations on the full dataset without bootstrapping.
  * `boot`: a list of bootstrapped ALE data and statistics. This element is not an `ALE` object; it is in a special internal format.
boot_data

A tibble of bootstrap results. Each row represents a bootstrap iteration. NULL if boot_data argument is FALSE. The columns are:

  * `it`: the specific bootstrap iteration from 0 to `boot_it` iterations. Iteration 0 is the results from the full dataset (not bootstrapped).
  * `row_idxs`: the row indexes for the bootstrapped sample for that iteration. To save space, the row indexes are returned rather than the full datasets. So, for example, iteration i's bootstrap sample can be reproduced by `data[ModelBoot_obj@boot_data$row_idxs[[2]], ]` where `data` is the dataset and `ModelBoot_obj` is the result of `ModelBoot()`.
  * `model`: the model object trained on that iteration.
  * `tidy`: the results of `broom::tidy(model)` on that iteration.
  * `stats`: the results of `broom::glance(model)` on that iteration.
  * `perf`: performance measures on the entire dataset. These are the measures specified above for regression and classification models.
params

Parameters used to calculate bootstrapped data. Most of these repeat the arguments passed to ModelBoot(). These are either the values provided by the user or used by default if the user did not change them but the following additional objects created internally are also provided:

* `y_cats`: same as `ALE@params$y_cats` (see documentation there).
* `y_type`: same as `ALE@params$y_type` (see documentation there).
* `model`: same as `ALE@params$model` (see documentation there).
* `data`: same as `ALE@params$data` (see documentation there).

Full-model bootstrapping

No modelling results, with or without ALE, should be considered reliable without appropriate validation. For ALE, both the trained model itself and the ALE that explains the trained model must be validated. ALE must be validated by bootstrapping. The trained model might be validated either by cross-validation or by bootstrapping. For ALE that explains trained models that have been developed by cross-validation, it is sufficient to bootstrap just the training data. That is what the ALE object does with its boot_it argument. However, unvalidated models must be validated by bootstrapping them along with the calculation of ALE; this is what the ModelBoot object does with its boot_it argument.

ModelBoot() carries out full-model bootstrapping to validate models. Specifically, it:

  • Creates multiple bootstrap samples (default 100; the user can specify any number);

  • Creates a model on each bootstrap sample;

  • Calculates overall model statistics, variable coefficients, and ALE values for each model on each bootstrap sample;

  • Calculates the mean, median, and lower and upper confidence intervals for each of those values across all bootstrap samples.

p-values

The broom::tidy() summary statistics will provide p-values. However, the procedure for obtaining p-values for ALE statistics is very slow: it involves retraining the model 1000 times. Thus, it is not efficient to calculate p-values whenever a ModelBoot object is created. Although the ALE() function provides an 'auto' option for creating p-values, that option is disabled when creating a ModelBoot because it would be far too slow: it would involve retraining the model 1000 times the number of bootstrap iterations. Rather, you must first create a p-values distribution object using the procedure described in help(-ALEpDist). If the name of your p-values object is p_dist, you can then request p-values each time you create a ModelBoot by passing it the argument ale_options = list(p_values = p_dist).

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples

# attitude dataset
attitude

## ALE for general additive models (GAM)
## GAM is tweaked to work on the small dataset.
gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) +
                            raises + s(critical) + advance,
                          data = attitude)
summary(gam_attitude)


# Full model bootstrapping
# Only 4 bootstrap iterations for a rapid example; default is 100
# Increase value of boot_it for more realistic results
mb_gam <- ModelBoot(
  gam_attitude,
  boot_it = 4
)

# If the model is not standard, supply model_call_string with 'data = boot_data'
# in the string instead of the actual dataset name (in addition to the actual dataset
# as the 'data' argument directly to the `ModelBoot` constructor).
mb_gam <- ModelBoot(
  gam_attitude,
  data = attitude,  # the actual dataset
  model_call_string = 'mgcv::gam(
    rating ~ complaints + privileges + s(learning) +
      raises + s(critical) + advance,
    data = boot_data  # required for model_call_string
  )',
  boot_it = 4
)

# Model statistics and coefficients
mb_gam@model_stats
mb_gam@model_coefs

# Plot ALE
plot(mb_gam)

Improvements:
  • Validation: ensure that the object is atomic (not just a vector)

  • For factors, get all classes and check if any class_x is a factor or ordered

  • Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns)

Description

Improvements:

  • Validation: ensure that the object is atomic (not just a vector)

  • For factors, get all classes and check if any class_x is a factor or ordered

  • Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns)

Usage

params_data(
  data,
  y_vals,
  data_name = var_name(data),
  sample_size = 500,
  seed = 0
)

Arguments

data

input dataframe

y_vals

y values, y predictions, or a sample thereof

data_name

name of the data argument

sample_size

size of data to sample

seed

random seed

Value

a list


plot method for ALE objects

Description

This plot method simply calls the constructor for an ALEPlots object.

Arguments

x

ALE object.

...

Arguments passed to ALEPlots()


Plot method for ALEPlots object

Description

Plot an ALEPlots object.

Arguments

x

An object of class ALEPlots.

max_print

integer(1). The maximum number of plots that may be printed at a time. 1D plots and 2D are printed separately, so this maximum applies separately to each dimension of ALE plots, not to all dimensions combined.

...

Arguments to pass to patchwork::wrap_plots()

Value

Invisibly returns x.


plot method for ModelBoot objects

Description

This plot method simply calls the constructor for an ALEPlots object.

Arguments

x

ModelBoot object.

...

Arguments passed to ALEPlots()


Compute preparatory data for ALE calculation

Description

Computes data needed to calculate a variable's ALE values.

Usage

prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)

Arguments

x_col

character(1). Name of single column in X for which ALE data is to be calculated.

x_type

character(1). var_type() of x_col.

x_vals

vector. The values of x_col.

bins, n

See documentation for calc_ale()

max_num_bins

See documentation for ALE()

X

See documentation for calc_ale(). Used only for categorical x_col.


print Method for ALE object

Description

Print an ALE object.

Arguments

x

An object of class ALE.

...

Additional arguments (currently not used).

Value

Invisibly returns x.

Examples

lm_cars <- stats::lm(mpg ~ ., mtcars)
ale_cars <- ALE(lm_cars, parallel = 0)
print(ale_cars)

Print method for ALEPlots object

Description

Print an ALEPlots object by calling plot().

Arguments

x

An object of class ALEPlots.

max_print

See documentation for plot.ALEPlots()

...

Additional arguments (currently not used).

Value

Invisibly returns x.


print method for ModelBoot object

Description

Print a ModelBoot object.

Arguments

x

An object of class ModelBoot.

...

Additional arguments (currently not used).

Value

Invisibly returns x.

Examples

lm_cars <- stats::lm(mpg ~ ., mtcars)
mb <- ModelBoot(lm_cars, boot_it = 2, parallel = 0)
print(mb)

Resolve x_cols and exclude_cols to their standardized format

Description

Resolve x_cols and exclude_cols to their standardized format of x_cols to specify which 1D and 2D ALE elements are required. This specification is used throughout the ALE package. x_cols specifies the desired columns or interactions whereas exclude_cols optionally specifies any columns or interactions to remove from x_cols. The result is x_colsexclude_cols, giving considerable flexibility in specifying the precise columns desired.

Usage

resolve_x_cols(x_cols, col_names, y_col, exclude_cols = NULL, silent = FALSE)

Arguments

x_cols

character, list, or formula. Columns and interactions requested in one of the special x_cols formats. x_cols values not found in col_names will error. See examples.

col_names

character. All the column names from a dataset.

y_col

character(1). The y outcome column. If found in any x_cols value, it will be silently removed.

exclude_cols

Same possible formats as x_cols. Columns and interactions to exclude from those requested in x_cols. exclude_cols values not found in col_names will be ignored with a message (which can be silenced with silent).

silent

logical(1). If TRUE, no message will be given; in particular, x_cols not found in col_names will be silently ignored. Default is FALSE. Regardless, warnings and errors are never silenced (e.g, invalid x_cols formats will still report errors).

Value

x_cols in canonical format, which is always a list with two elements:

  • d1: a character vector with each requested column for 1D ALE. If empty (no 1D ALE at all), its value is list().

  • d2: a list of length-2 character vectors with each requested 2D ALE interaction pair. If empty (no 2D ALE at all), its value is list().

See examples for details.

x_cols format options

The x_cols argument determines which predictor variables and interactions are included in the analysis. It supports multiple input formats:

  • Character vector: A simple vector of column names (e.g., c("a", "b")) selects individual columns for 1D ALE analysis.

  • Formula (~): Allows specifying variables and interactions in formula notation (e.g., ~ a + b + a:b), which is automatically converted into a structured format. The outcome term is optional and will be ignored regardless. So, ~ a + b + a:b produces results identical to whatever ~ a + b + a:b.

  • List format: Users can explicitly separate 1D and 2D ALE interactions:

    • A character vector named d1 like list(d1 = c("a", "b")) specifies 1D ALE variables.

    • A list of 2-character vectors named d2 like list(d2 = list(c("a", "b"), c("a", "c"))) specifies 2D interactions.

    • list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))) combines both.

  • Boolean selection for an entire dimension:

    • list(d1 = TRUE) selects all available variables for 1D ALE, excluding y_col.

    • list(d2 = TRUE) selects all possible 2D interactions among all columns in col_names, excluding y_col.

  • Selective 2D expansion: list(d2_all = "a") includes all 2D interactions where "a" is one of the interacting variables.

  • NULL (or unspecified): If x_cols = NULL, no variables are selected.

Regardless of format, y_col is always removed automatically. The function ensures all variables are valid and in col_names, providing informative messages unless silent = TRUE. And regardless of the specification format, the result will always be standardized in the format specified in the return value.

Run examples for details.

Examples

## Sample data
set.seed(0)
df <- data.frame(
  y = runif(10),
  a = sample(letters[1:3], 10, replace = TRUE),
  b = rnorm(10),
  c = sample(1:5, 10, replace = TRUE)
)
col_names <- names(df)
y_col <- "y"  # Assume 'y' is the outcome variable


## Examples with just x_cols to show different formats for specifying x_cols
## (same format for exclude_cols)

# Character vector: Simple ALE with no interactions
resolve_x_cols(c("a", "b"), col_names, y_col)

# Character string: Select just one 1D element
resolve_x_cols("c", col_names, y_col)

# Character string: Select just one 1D element
resolve_x_cols("c", col_names, y_col)

# list of 1- and 2-length character vectors: specify precise 1D and 2D elements desired
resolve_x_cols(list(c("a", "b"), "c", c("c", "a"), "b"), col_names, y_col)

# Formula: Converts to a list of individual elements
resolve_x_cols(~ a + b, col_names, y_col)

# Formula with interactions (1D and 2D).
# This format is probably more convenient if you know precisely which terms you want.
# Note that the outcome on the left-hand-side is always silently ignored.
resolve_x_cols(whatever ~ a + b + a:b + c:b, col_names, y_col)

# List specifying d1 (1D ALE)
resolve_x_cols(list(d1 = c("a", "b")), col_names, y_col)

# List specifying d2 (2D ALE)
resolve_x_cols(list(d2 = list(c("a", "b"))), col_names, y_col)

# List specifying both d1 and d2
resolve_x_cols(list(d1 = c("a", "b"), d2 = list(c("a", "b"))), col_names, y_col)

# d1 as TRUE (select all columns except y_col)
resolve_x_cols(list(d1 = TRUE), col_names, y_col)

# d2 as TRUE (select all possible 2D interactions)
resolve_x_cols(list(d2 = TRUE), col_names, y_col)

# d2_all: Request all 2D interactions involving a specific variable
resolve_x_cols(list(d2_all = "a"), col_names, y_col)

# NULL: No variables selected
resolve_x_cols(NULL, col_names, y_col)



## Examples of how exclude_cols are removed from x_cols to obtain various desired results

# Exclude one column from a simple character vector
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = "b"
)

# Exclude multiple columns
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = c("a", "c")
)

# Exclude an interaction term from a formula input
resolve_x_cols(
  x_cols = ~ a + b + a:b,
  col_names = col_names,
  y_col = y_col,
  exclude_cols = ~ a:b
)

# Exclude all columns from x_cols
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = c("a", "b", "c")
)

# Exclude non-existent columns (should be ignored)
resolve_x_cols(
  x_cols = c("a", "b"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = "z"
)

# Exclude one column from a list-based input
resolve_x_cols(
  x_cols = list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = "a")
)

# Exclude interactions only
resolve_x_cols(
  x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d2 = list(c("a", "b")))
)

# Exclude everything, including interactions
resolve_x_cols(
  x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c")))
)

# Exclude a column implicitly removed by y_col
resolve_x_cols(
  x_cols = c("y", "a", "b"),
  col_names = col_names,
  y_col = "y",
  exclude_cols = "y"
)

# Exclude entire 2D dimension from x_cols with d2 = TRUE
resolve_x_cols(
  x_cols = list(d1 = TRUE, d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = c("a"), d2 = TRUE)
)

summary method for ALEPlots object

Description

Present concise summary information about an ALEPlots object.

Arguments

object

An object of class ALEPlots.

...

Not used

Value

Summary string.


Multi-variable transformation of the mtcars dataset.

Description

This is a transformation of the mtcars dataset from R to produce a small dataset with each of the fundamental datatypes: logical, factor, ordered, integer, double, and character. Most of the transformations are obvious, but a few are noteworthy:

  • The row names (the car model) are saved as a character vector.

  • For the unordered factors, the country and continent of the car manufacturer are obtained based on the row names (model).

  • For the ordered factor, gears 3, 4, and 5 are encoded as 'three', 'four', and 'five', respectively. The text labels make it explicit that the variable is ordinal, yet the number names make the order crystal clear.

Here is the adaptation of the original description of the mtcars dataset:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

var_cars

Format

A tibble with 32 observations on 14 variables.

model

character: Car model

mpg

double: Miles/(US) gallon

cyl

integer: Number of cylinders

disp

double: Displacement (cu.in.)

hp

double: Gross horsepower

drat

double: Rear axle ratio

wt

double: Weight (1000 lbs)

qsec

double: 1/4 mile time

vs

logical: Engine (0 = V-shaped, 1 = straight)

am

logical: Transmission (0 = automatic, 1 = manual)

gear

ordered: Number of forward gears

carb

integer: Number of carburetors

country

factor: Country of car manufacturer

continent

factor: Continent of car manufacturer

Note

Henderson and Velleman (1981) comment in a footnote to Table 1: 'Hocking (original transcriber)'s noncrucial coding of the Mazda's rotary engine as a straight six-cylinder engine and the Porsche's flat engine as a V engine, as well as the inclusion of the diesel Mercedes 240D, have been retained to enable direct comparisons to be made with previous analyses.'

References

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.


Determine the datatype of a vector

Description

Determine the datatype of a vector

Usage

var_type(var)

Arguments

var

vector whose datatype is to be determined

Not exported. See @returns for details of what it does.

Value

Returns generic datatypes of R basic vectors according to the following mapping:

  • logical returns 'binary'

  • numeric values (e.g., integer and double) return 'numeric'

  • However, if the only values of numeric are 0 and 1, then it returns 'binary'

  • unordered factor returns 'categorical'

  • ordered factor returns 'ordinal'