Package 'ale' reference manual

Title:	Interpretable Machine Learning and Statistical Inference with Accumulated Local Effects (ALE)
Description:	Accumulated Local Effects (ALE) were initially developed as a model-agnostic approach for global explanations of the results of black-box machine learning algorithms. ALE has a key advantage over other approaches like partial dependency plots (PDP) and SHapley Additive exPlanations (SHAP): its values represent a clean functional decomposition of the model. As such, ALE values are not affected by the presence or absence of interactions among variables in a mode. Moreover, its computation is relatively rapid. This package reimplements the algorithms for calculating ALE data and develops highly interpretable visualizations for plotting these ALE values. It also extends the original ALE concept to add bootstrap-based confidence intervals and ALE-based statistics that can be used for statistical inference. For more details, see Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. <arXiv:2310.09877>. <doi:10.48550/arXiv.2310.09877>.
Authors:	Chitu Okoli [aut, cre]
Maintainer:	Chitu Okoli <[email protected]>
License:	MIT + file LICENSE
Version:	0.3.0.20250217
Built:	2025-03-21 06:26:42 UTC
Source:	https://github.com/tripartio/ale

Add two arrays or matrices, ignoring NA values by default

Description

Array or matrix addition in base R sets all sums in an element position to NA if any element in that position is NA in either of the arrays being added. In contrast, this function ignores NA values by default in its addition.

Usage

add_array_na.rm(ary1, ary2, na.rm = TRUE)
add_array_na.rm(ary1, ary2, na.rm = TRUE)

Arguments

`ary1`, `ary2`	numeric arrays or matrices. The arrays to be added. They must be of the same dimension.
`na.rm`	logical(1). TRUE (default) if missing values (`NA`) should be ignored in the summation. If both elements in a given position are missing, then the result will be `NA`.

Value

An array or matrix of the same dimensions as ary1 and ary2 whose values are the sums of ary1 and ary2 in each corresponding element.

Reduce(add_array_na.rm, list(x1, x2, x3))

ALE data and statistics that describe a trained model

Description

An ALE S7 object contains ALE data and statistics. For details, see vignette('ale-intro') or the details and examples below.

Usage

ALE(
  model,
  x_cols = list(d1 = TRUE),
  data = NULL,
  y_col = NULL,
  ...,
  exclude_cols = NULL,
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  output_stats = TRUE,
  output_conf = TRUE,
  output_boot_data = FALSE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  p_values = NULL,
  p_alpha = c(0.01, 0.05),
  max_num_bins = 10,
  boot_it = 0,
  boot_alpha = 0.05,
  boot_centre = "mean",
  seed = 0,
  y_type = NULL,
  median_band_pct = c(0.05, 0.5),
  sample_size = 500,
  .bins = NULL,
  silent = FALSE
)
ALE(
  model,
  x_cols = list(d1 = TRUE),
  data = NULL,
  y_col = NULL,
  ...,
  exclude_cols = NULL,
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  output_stats = TRUE,
  output_conf = TRUE,
  output_boot_data = FALSE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  p_values = NULL,
  p_alpha = c(0.01, 0.05),
  max_num_bins = 10,
  boot_it = 0,
  boot_alpha = 0.05,
  boot_centre = "mean",
  seed = 0,
  y_type = NULL,
  median_band_pct = c(0.05, 0.5),
  sample_size = 500,
  .bins = NULL,
  silent = FALSE
)

Arguments

`model`	model object. Required. Model for which ALE should be calculated. May be any kind of R object that can make predictions from data.
`x_cols`, `exclude_cols`	character, list, or formula. Columns names from `data` requested in one of the special `x_cols` formats for which ALE data is to be calculated. Defaults to 1D ALE for all columns in `data` except `y_col`. See details in the documentation for `resolve_x_cols()`.
`data`	dataframe. Dataset from which to create predictions for the ALE. It should normally be the same dataset on which `model` was trained. If not provided, `ALE()` will try to detect it automatically. For non-standard models, `data` should be provided.
`y_col`	character(1). Name of the outcome target label (y) variable. If not provided, `ALE()` will try to detect it automatically. For non-standard models, `y_col` should be provided. For survival models, set `y_col` to the name of the binary event column; in that case, `pred_type` should also be specified.
`...`	not used. Inserted to require explicit naming of subsequent arguments.
`parallel`	non-negative integer(1). Number of parallel threads (workers or tasks) for parallel execution of the function. Set `parallel = 0` to disable parallel processing. See details.
`model_packages`	character. Character vector of names of packages that `model` depends on that might not be obvious with parallel processing. If you get weird error messages when parallel processing is enabled (which is the default) but they are resolved by setting `parallel = 0`, you might need to specify `model_packages`. See details.
`output_stats`	logical(1). If `TRUE` (default), return ALE statistics.
`output_conf`	logical(1). If `TRUE` (default), return ALE confidence regions. If `output_stats` is `FALSE`, `output_conf` is ignored since confidence regions cannot be produced without ALE statistics.
`output_boot_data`	logical(1). If `TRUE`, return the raw ALE data for each bootstrap iteration. Default is `FALSE`.
`pred_fun`, `pred_type`	function,character(1). `pred_fun` is a function that returns a vector of predicted values of type `pred_type` from `model` on `data`. See details.
`p_values`	instructions for calculating p-values and to determine the median band. If `NULL` (default), no p-values are calculated and `median_band_pct` is used to determine the median band. To calculate p-values, an `ALEpDist()` object must be provided here. If `p_values` is set to 'auto', this `ALE()` function will try to automatically create the p-values distribution; this only works with standard R model types. An error message will be given if p-values cannot be generated. Any other input provided to this argument will result in an error. For more details about creating p-values, see documentation for `ALEpDist()`. Note that p-values will not be generated if `output_stats` is `FALSE`.
`p_alpha`	numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if `p_values` are provided. If `p_values` are not provided, then `median_band_pct` is used instead. The inner band range will be the median value of y ± `p_alpha[2]` of the relevant ALE statistic (usually ALE range or normalized ALE range). For plots with a second outer band, its range will be the median ± `p_alpha[1]`. For example, in the ALE plots, for the default `p_alpha = c(0.01, 0.05)`, the inner band will be the median ± ALE minimum or maximum at p = 0.05 and the outer band will be the median ± ALE minimum or maximum at p = 0.01.
`max_num_bins`	positive integer(1). Maximum number of bins for numeric `x_cols` variables. The number of bins is eventually the lower of the number of unique values of a numeric variable and `max_num_bins`.
`boot_it`	non-negative integer(1). Number of bootstrap iterations for data-only bootstrapping on ALE data. This is appropriate for models that have been developed with cross-validation. For models that have not been validated, full-model bootstrapping should be used instead with the `ModelBoot` class. See details there. The default `boot_it = 0` turns off bootstrapping.
`boot_alpha`	numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrap intervals; the bootstrap confidence intervals will be the lowest and highest `(1 - 0.05) / 2` percentiles. For example, if `boot_alpha = 0.05` (default), the intervals will be from the 2.5 and 97.5 percentiles.
`boot_centre`	character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be `boot_centre`. Regardless of the value specified here, both the mean and median will be available.
`seed`	integer(1). Random seed. Supply this between runs to assure that identical random ALE data is generated each time when bootstrapping. Without bootstrapping, ALE is a deterministic algorithm that should result in identical results each time regardless of the seed specified.
`y_type`	character(1) in c('binary', 'numeric', 'categorical', 'ordinal'). Datatype of the y (outcome) variable. Normally determined automatically; only provide if an error message for a complex non-standard model requires it.
`median_band_pct`	numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if `p_values` are not provided. If `p_values` are provided, then `median_band_pct` is ignored. The inner band range will be the median value of y ± `median_band_pct[1]/2`. For plots with a second outer band, its range will be the median ± `median_band_pct[2]/2`. For example, for the default `median_band_pct = c(0.05, 0.5)`, the inner band will be the median ± 2.5% and the outer band will be the median ± 25%.
`sample_size`	non-negative integer(1). Size of the sample of `data` to be returned with the `ALE` object. This is primarily used for rug plots. See the `min_rug_per_interval` argument.
`.bins`	Internal. List of ALE bin and n count vectors. If provided, these vectors will be used to set the intervals of the ALE x axis for each variable. By default (NULL), `ALE()` automatically calculates the bins. `.bins` is normally used in advanced analyses where the bins from a previous analysis are reused for subsequent analyses (for example, for full model bootstrapping with `ModelBoot()`).
`silent`	logical(1), default `FALSE.` If `TRUE`, do not display any non-essential messages during execution (such as progress bars). Regardless, any warnings and errors will always display. See details for how to enable progress bars.

Value

An object of class ALE with properties distinct and params.

Properties

distinct

Stores the optional ALE data, ALE statistics, and bootstrap data for one or more categories.

params

The parameters used to calculate the ALE data. These include most of the arguments used to construct the ALE object. These are either the values provided by the user or used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:

* `max_d`: the highest dimension of ALE data present. If only 1D ALE is present, then `max_d == 1`. If even one 2D ALE element is present (even with no 1D), then `max_d == 2`.
* `requested_x_cols`,`ordered_x_cols`: `requested_x_cols` is the resolved list of `x_cols` as requested by the user (that is, `x_cols` minus `exclude_cols`). `ordered_x_cols` is the same set of `x_cols` but arranged in the internal storage order.
* `y_cats`: categories for categorical classification models. For non-categorical models, this is the same as `y_col`.
* `y_type`: high-level datatype of the y outcome variable.
* `y_summary`: summary statistics of y values used for the ALE calculation. These statistics are based on the actual values of `y_col` unless if `y_type` is a probability or other value that is constrained in the `[0, 1]` range. In that case, `y_summary` is based on the predicted values of `y_col` by applying `model` to the `data`. `y_summary` is a named numeric vector. Most of the elements are the percentile of the y values. E.g., the '5%' element is the 5th percentile of y values. The following elements have special meanings:
* The first element is named either `p` or `q` and its value is always 0.
The value is not used; only the name of the element is meaningful.
`p` means that the following special `y_summary` elements are based on
the provided `ALEpDist` object. `q` means that quantiles were calculated
based on `median_band_pct` because `p_values` was not provided.
* `min`, `mean`, `max`: the minimum, mean, and maximum y values, respectively. Note that the median is `50%`, the 50th percentile.
* `med_lo_2`, `med_lo`, `med_hi`, `med_hi_2`: `med_lo` and `med_hi` are the inner lower and upper confidence intervals of y values with respect to the median (`50%`); `med_lo_2` and `med_hi_2` are the outer confidence intervals. See the documentation for the `p_alpha` and `median_band_pct` arguments to understand how these are determined.
* `model`: same as `ALE@params$model` (see documentation there).
* `data`: same as `ALE@params$model` (see documentation there).

Custom predict function

The calculation of ALE requires modifying several values of the original data. Thus, ALE() needs direct access to the predict function for the model. By default, ALE() uses a generic default predict function of the form predict(object, newdata, type) with the default prediction type of 'response'. If, however, the desired prediction values are not generated with that format, the user must specify what they want. Very often, the only modification needed is to change the prediction type to some other value by setting the pred_type argument (e.g., to 'prob' to generated classification probabilities). But if the desired predictions need a different function signature, then the user must create a custom prediction function and pass it to pred_fun. The requirements for this custom function are:

It must take three required arguments and nothing else:
- object: a model
- newdata: a dataframe or compatible table type such as a tibble or data.table
- type: a string; it should usually be specified as type = pred_type These argument names are according to the R convention for the generic stats::predict() function.
It must return a vector or matrix of numeric values as the prediction.

You can see an example below of a custom prediction function.

Note: survival models probably do not need a custom prediction function but y_col must be set to the name of the binary event column and pred_type must be set to the desired prediction type.

ALE statistics

For details about the ALE-based statistics (ALED, ALER, NALED, and NALER), see vignette('ale-statistics').

Parallel processing

Parallel processing using the {furrr} framework is enabled by default. By default, it will use all the available physical CPU cores (minus the core being used for the current R session) with the setting parallel = future::availableCores(logical = FALSE, omit = 1). Note that only physical cores are used (not logical cores or "hyperthreading") because machine learning can only take advantage of the floating point processors on physical cores, which are absent from logical cores. Trying to use logical cores will not speed up processing and might actually slow it down with useless data transfer. If you will dedicate the entire computer to running this function (and you don't mind everything else becoming very slow while it runs), you may use all cores by setting parallel = future::availableCores(logical = FALSE). To disable parallel processing, set parallel = 0.

#' The {ale} package should be able to automatically recognize and load most packages that are needed, but with parallel processing enabled (which is the default), some packages might not be properly loaded. This problem might be indicated if you get a strange error message that mentions something somewhere about "progress interrupted" or "future", especially if you see such errors after the progress bars begin displaying (assuming you did not disable progress bars with silent = TRUE). In that case, first try disabling parallel processing with parallel = 0. If that resolves the problem, then to get faster parallel processing to work, try adding the package names needed for the model to this argument, e.g., model_packages = c('tidymodels', 'mgcv').

Progress bars

Progress bars are implemented with the {progressr} package, which lets the user fully control progress bars. To disable progress bars, set silent = TRUE. The first time a function is called in the {ale} package that requires progress bars, it checks if the user has activated the necessary {progressr} settings. If not, the {ale} package automatically enables {progressr} progress bars with the cli handler and prints a message notifying the user.

If you like the default progress bars and you want to make them permanent, you can add the following lines of code to your .Rprofile configuration file and they will become your defaults for every R session; you will not see the message again:

progressr::handlers(global = TRUE)
progressr::handlers('cli')

This would apply not only to the {ale} package but to any package that uses {progressr} progress bars. For more details on formatting progress bars to your liking, see the introduction to the {progressr} package.

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples

# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)




# Simple ALE without bootstrapping
ale_gam_diamonds <- ALE(gam_diamonds)

# Plot the ALE data
plot(ale_gam_diamonds)

# Bootstrapped ALE
# This can be slow, since bootstrapping runs the algorithm boot_it times

# Create ALE with 100 bootstrap samples
ale_gam_diamonds_boot <- ALE(
  gam_diamonds,
  boot_it = 100
)

# Bootstrapped ALEs print with confidence intervals
plot(ale_gam_diamonds_boot)


# If the predict function you want is non-standard, you may define a
# custom predict function. It must return a single numeric vector.
custom_predict <- function(object, newdata, type = pred_type) {
  predict(object, newdata, type = type, se.fit = TRUE)$fit
}

ale_gam_diamonds_custom <- ALE(
  gam_diamonds,
  pred_fun = custom_predict, pred_type = 'link'
)

# Plot the ALE data
plot(ale_gam_diamonds_custom)


# How to retrieve specific types of ALE data from an ALE object.
ale_diamonds_with__boot_data <- ALE(
  gam_diamonds,
  # For detailed options for x_cols, see examples at resolve_x_cols()
  x_cols = ~ carat + cut + clarity + color:depth + x:y,
  output_boot_data = TRUE,
  boot_it = 10  # just for demonstration
)

# See ?get.ALE for details on the various kinds of data that may be retrieved.
get(ale_diamonds_with__boot_data, ~ carat + color:depth)  # default ALE data
get(ale_diamonds_with__boot_data, what = 'boot_data')
get(ale_diamonds_with__boot_data, stats = 'estimate')
get(ale_diamonds_with__boot_data, stats = 'aled')
get(ale_diamonds_with__boot_data, stats = 'all')
get(ale_diamonds_with__boot_data, stats = 'conf_regions')
get(ale_diamonds_with__boot_data, stats = 'conf_sig')


# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM model with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)




# Simple ALE without bootstrapping
ale_gam_diamonds <- ALE(gam_diamonds)

# Plot the ALE data
plot(ale_gam_diamonds)

# Bootstrapped ALE
# This can be slow, since bootstrapping runs the algorithm boot_it times

# Create ALE with 100 bootstrap samples
ale_gam_diamonds_boot <- ALE(
  gam_diamonds,
  boot_it = 100
)

# Bootstrapped ALEs print with confidence intervals
plot(ale_gam_diamonds_boot)


# If the predict function you want is non-standard, you may define a
# custom predict function. It must return a single numeric vector.
custom_predict <- function(object, newdata, type = pred_type) {
  predict(object, newdata, type = type, se.fit = TRUE)$fit
}

ale_gam_diamonds_custom <- ALE(
  gam_diamonds,
  pred_fun = custom_predict, pred_type = 'link'
)

# Plot the ALE data
plot(ale_gam_diamonds_custom)


# How to retrieve specific types of ALE data from an ALE object.
ale_diamonds_with__boot_data <- ALE(
  gam_diamonds,
  # For detailed options for x_cols, see examples at resolve_x_cols()
  x_cols = ~ carat + cut + clarity + color:depth + x:y,
  output_boot_data = TRUE,
  boot_it = 10  # just for demonstration
)

# See ?get.ALE for details on the various kinds of data that may be retrieved.
get(ale_diamonds_with__boot_data, ~ carat + color:depth)  # default ALE data
get(ale_diamonds_with__boot_data, what = 'boot_data')
get(ale_diamonds_with__boot_data, stats = 'estimate')
get(ale_diamonds_with__boot_data, stats = 'aled')
get(ale_diamonds_with__boot_data, stats = 'all')
get(ale_diamonds_with__boot_data, stats = 'conf_regions')
get(ale_diamonds_with__boot_data, stats = 'conf_sig')

Calculate statistics from ALE y values.

Description

Not exported. The following statistics are calculated based on a vector of ALE y values:

Usage

ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = "numeric")
ale_stats(y, bin_n, y_vals = NULL, ale_y_norm_fun = NULL, x_type = "numeric")

Arguments

`y`	numeric. Vector of ALE y values.
`bin_n`	numeric. Vector of counts of rows in each ALE bin. Must be the same length as `y`.
`y_vals`	numeric. Entire vector of y values. Needed for normalization. If not provided, ale_y_norm_fun must be provided.
`ale_y_norm_fun`	function. Result of `create_ale_y_norm_function()`. If not provided, `y_vals` must be provided. `ale_stats()` could be faster if `ale_y_norm_fun` is provided, especially in bootstrap workflows that call the same function many, many times.
`x_type`	character(1). Datatype of the x variable on which the ALE y is based. Values are the result of `var_type()`. Used to determine how to correctly calculate ALE, so if the value is not the default `"numeric"`, then it must be set correctly.

Details

ALE deviation (ALED)
ALE range (ALER): range from minimum value of any ALE y to the maximum value of any y. This is a very simple indication of the dispersion in ALE y values.
Normalized ALE deviation (NALED)
Normalized ALE range (NALER)

Note that if any ALE y values are missing, they will be deleted from the calculation (with their corresponding bin_n).

Value

Named numeric vector:

aled: ALE deviation (ALED)
aler_min: Minimum (lower value) of the ALE range (ALER)
aler_max: Maximum (upper value) of the ALE range (ALER)
naled: Normalized ALE deviation (ALED)
naler_min: Normalized minimum (lower value) of the ALE range (ALER)
naler_max: Normalized maximum (upper value) of the ALE range (ALER)

Calculate statistics from 2D ALE y values.

Description

When calculating second-order (2D) ALE statistics, there is no difficulty if both variables are categorical. The regular formulas for ALE operate normally. However, if one or both variables is numeric, the calculation is complicated by the necessity to determine the ALE midpoints between the ALE bin ceilings of the numeric variables. This function calculates these ALE midpoints for the numeric variables and resets the ALE bins to these values. The ALE values for ordinal ordinal variables are not changed. As part of the adjustment, the lowest numeric bin is merged into the second: the ALE values are completely deleted (since they do not represent a midpoint) and their counts are added to the first true bin.

Usage

ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)
ale_stats_2D(ale_data, x_cols, x_types, y_vals = NULL, ale_y_norm_fun = NULL)

Arguments

`ale_data`	dataframe. ALE data
`x_cols`	character. Names of the x columns in `ale_data`.
`x_types`	character same length as `x_cols`. Variable types (output of `var_type()`) of corresponding `x_cols`.
`y_vals`	See documentation for `ale_stats()`
`ale_y_norm_fun`	See documentation for `ale_stats()`

Details

After these possible adjustments, the ALE y values and bin counts are passed to ale_stats(), which calculates their statistics as an ordinal variable since the numeric variables have thus been discretized.

Not exported.

Value

Same as ale_stats().

Object with the ALE statistics of a random variable for generating p-values

Description

ALE statistics are accompanied with two indicators of the confidence of their values. First, bootstrapping creates confidence intervals for ALE measures and ALE statistics to give a range of the possible likely values. Second, we calculate p-values, an indicator of the probability that a given ALE statistic is random. Calculating p-values is not trivial for ALE statistics because ALE is non-parametric and model-agnostic. Because ALE is non-parametric (that is, it does not assume any particular distribution of data), the {ale} package generates p-values by calculating ALE for many random variables; this makes the procedure somewhat slow. For this reason, they are not calculated by default; they must be explicitly requested. Because the {ale} package is model-agnostic (that is, it works with any kind of R model), the ALE() constructor cannot always automatically manipulate the model object to create the p-values. It can only do so for models that follow the standard R statistical modelling conventions, which includes almost all built-in R algorithms (like stats::lm() and stats::glm()) and many widely used statistics packages (like mgcv and survival), but which excludes most machine learning algorithms (like tidymodels and caret). For non-standard algorithms, the user needs to do a little work to help the ALE() constructor correctly manipulate its model object:

The full model call must be passed as a character string in the argument random_model_call_string, with two slight modifications as follows.
In the formula that specifies the model, you must add a variable named 'random_variable'. This corresponds to the random variables that the constructor will use to estimate p-values.
The dataset on which the model is trained must be named 'rand_data'. This corresponds to the modified datasets that will be used to train the random variables.

See the example below for how this is implemented.

Usage

ALEpDist(
  model,
  data = NULL,
  p_speed = "approx fast",
  ...,
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  random_model_call_string = NULL,
  random_model_call_string_vars = character(),
  y_col = NULL,
  binary_true_value = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  output_residuals = FALSE,
  rand_it = 1000,
  seed = 0,
  silent = FALSE,
  .testing_mode = FALSE
)
ALEpDist(
  model,
  data = NULL,
  p_speed = "approx fast",
  ...,
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  random_model_call_string = NULL,
  random_model_call_string_vars = character(),
  y_col = NULL,
  binary_true_value = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  output_residuals = FALSE,
  rand_it = 1000,
  seed = 0,
  silent = FALSE,
  .testing_mode = FALSE
)

Arguments

`model`	See documentation for `ALE()`
`data`	See documentation for `ALE()`
`p_speed`	character(1). Either 'approx fast' (default) or 'precise slow'. See details.
`...`	not used. Inserted to require explicit naming of subsequent arguments.
`parallel`	See documentation for `ALE()`
`model_packages`	See documentation for `ALE()`
`random_model_call_string`	character(1). If `NULL`, the `ALEpDist()` constructor tries to automatically detect and construct the call for p-values. If it cannot, the function will fail early. In that case, a character string of the full call for the model must be provided that includes the random variable. See details.
`random_model_call_string_vars`	See documentation for `model_call_string_vars` in `ModelBoot()`; their operation is very similar.
`y_col`	See documentation for `ALE()`
`binary_true_value`	See documentation for `ModelBoot()`
`pred_fun`, `pred_type`	See documentation for `ALE()`.
`output_residuals`	logical(1). If `TRUE`, returns the residuals in addition to the raw data of the generated random statistics (which are always returned). If `FALSE` (default), does not return the residuals.
`rand_it`	non-negative integer(1). Number of times that the model should be retrained with a new random variable. The default of 1000 should give reasonably stable p-values; these are considered "exact" p-values. It can be reduced for "approximate" p-values as low as 100 for faster test runs but then the p-values are not as stable.
`seed`	See documentation for `ALE()`
`silent`	See documentation for `ALE()`
`.testing_mode`	logical(1). Internal use only. Disables some data validation checks to allow for debugging.

Value

An object of class ALEpDist with properties rand_stats, residual_distribution, rand_it_ok, and residuals.

rand_it_ok: An integer with the number of rand_it iterations that successfully generated a random variable, that is, those that did not fail for whatever reason. The rand_it - rand_it_ok failed attempts are discarded.
rand_stats: A named list of tibbles. There is normally one element whose name is the same as y_col except if y_col is a categorical variable; in that case, the elements are named for each category of y_col. Each element is a tibble whose rows are each of the rand_it_ok iterations of the random variable analysis and whose columns are the ALE statistics obtained for each random variable.
residual_distribution: A univariateML object with the closest estimated distribution for the residuals as determined by univariateML::model_select(). This is the distribution used to generate all the random variables.
residuals: If output_residuals == TRUE, returns a matrix of the actual y_col values from data minus the predicted values from the model (without random variables) on the data. If output_residuals == FALSE, (default), does not return these residuals. The rows correspond to each row of data. The columns correspond to the named elements (y_col or categories) described above for rand_stats.

Approach to calculating p-values

The {ale} package takes a literal frequentist approach to the calculation of p-values. That is, it literally retrains the model 1000 times, each time modifying it by adding a distinct random variable to the model. (The number of iterations is customizable with the rand_it argument.) The ALEs and ALE statistics are calculated for each random variable. The percentiles of the distribution of these random-variable ALEs are then used to determine p-values for non-random variables. Thus, p-values are interpreted as the frequency of random variable ALE statistics that exceed the value of ALE statistic of the actual variable in question. The specific steps are as follows:

The residuals of the original model trained on the training data are calculated (residuals are the actual y target value minus the predicted values).
The closest distribution of the residuals is detected with univariateML::model_select().
1000 new models are trained by generating a random variable each time with univariateML::rml() and then training a new model with that random variable added.
The ALEs and ALE statistics are calculated for each random variable.
For each ALE statistic, the empirical cumulative distribution function (from stats::ecdf()) is used to create a function to determine p-values according to the distribution of the random variables' ALE statistics.

What we have just described is the precise approach to calculating p-values with the argument p_speed = 'precise slow'. Because it is so slow, by default, the ALEpDist constructor implements an approximate algorithm by default (p_speed = 'approx fast') which trains only a few random variables up to the number of physical parallel processing threads available, with a minimum of four. To increase speed, the random variable uses only 10 ALE bins instead of the default 100. Although approximate p-values are much faster than precise ones, they are still somewhat slow: at the very quickest, they take at least the amount of time that it would take to train the original model two or three times. See the "Parallel processing" section below for more details on the speed of computation.

Parallel processing

For exact p-values, by default 1000 random variables are trained. So, even with parallel processing, the procedure is very slow. However, an ALEpDist object trained with a specific model on a specific dataset can be reused as often as needed for the identical model-dataset pair.

For approximate p-values (the default), at least four random variables are trained to give some minimal variation. With parallel processing, more random variables can be trained to increase the accuracy of the p_value estimates up to the maximum number of physical cores.

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples


# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)

# Create p_value distribution
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100,
)

# Examine the structure of the returned object
str(pd_diamonds)
# In RStudio: View(pd_diamonds)

# Calculate ALEs with p-values
ale_gam_diamonds <- ALE(
  gam_diamonds,
  p_values = pd_diamonds
)

# Plot the ALE data. The horizontal bands in the plots use the p-values.
plot(ale_gam_diamonds)


# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  random_model_call_string = 'mgcv::gam(
    price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
        cut + color + clarity + random_variable,
    data = rand_data
  )',
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100
)



# Sample 1000 rows from the ggplot2::diamonds dataset (for a simple example)
set.seed(0)
diamonds_sample <- ggplot2::diamonds[sample(nrow(ggplot2::diamonds), 1000), ]

# Create a GAM with flexible curves to predict diamond price
# Smooth all numeric variables and include all other variables
gam_diamonds <- mgcv::gam(
  price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
    cut + color + clarity,
  data = diamonds_sample
)
summary(gam_diamonds)

# Create p_value distribution
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100,
)

# Examine the structure of the returned object
str(pd_diamonds)
# In RStudio: View(pd_diamonds)

# Calculate ALEs with p-values
ale_gam_diamonds <- ALE(
  gam_diamonds,
  p_values = pd_diamonds
)

# Plot the ALE data. The horizontal bands in the plots use the p-values.
plot(ale_gam_diamonds)


# For non-standard models that give errors with the default settings,
# you can use 'random_model_call_string' to specify a model for the estimation
# of p-values from random variables as in this example.
# See details above for an explanation.
pd_diamonds <- ALEpDist(
  gam_diamonds,
  diamonds_sample,
  random_model_call_string = 'mgcv::gam(
    price ~ s(carat) + s(depth) + s(table) + s(x) + s(y) + s(z) +
        cut + color + clarity + random_variable,
    data = rand_data
  )',
  # only 100 iterations for a quick demo; but usually should remain at 1000
  rand_it = 100
)

ALE plots with print and plot methods

Description

An ALEPlots S7 object contains the ALE plots from ALE or ModelBoot objects stored as ggplot objects. The ALEPlots creates all possible plots from the ALE or ModelBoot passed to its constructor—not only individual 1D and 2D ALE plots, but also special plots like the ALE effects plot. So, an ALEPlots object is a collection of plots, almost never a single plot. To retrieve specific plots, use the get.ALEPlots() method. In addition to specific ALE See examples there or examples with the ALE and ModelBoot objects.

Usage

ALEPlots(
  obj,
  ...,
  relative_y = "median",
  p_alpha = c(0.01, 0.05),
  median_band_pct = c(0.05, 0.5),
  rug_sample_size = obj@params$sample_size,
  min_rug_per_interval = 1,
  n_x1_bins = NULL,
  n_x2_bins = NULL,
  n_y_quant = 10,
  seed = 0,
  silent = FALSE
)
ALEPlots(
  obj,
  ...,
  relative_y = "median",
  p_alpha = c(0.01, 0.05),
  median_band_pct = c(0.05, 0.5),
  rug_sample_size = obj@params$sample_size,
  min_rug_per_interval = 1,
  n_x1_bins = NULL,
  n_x2_bins = NULL,
  n_y_quant = 10,
  seed = 0,
  silent = FALSE
)

Arguments

`obj`	`ALE` or `ModelBoot` object. The object containing ALE data to be plotted.
`...`	not used. Inserted to require explicit naming of subsequent arguments.
`relative_y`	character(1) in c('median', 'mean', 'zero'). The ALE y values in the plots will be adjusted relative to this value. 'median' is the default. 'zero' will maintain the actual ALE values, which are relative to zero.
`p_alpha`	numeric(2) from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if `p_values` are provided. If `p_values` are not provided, then `median_band_pct` is used instead. The inner band range will be the median value of y ± `p_alpha[2]` of the relevant ALE statistic (usually ALE range or normalized ALE range). For plots with a second outer band, its range will be the median ± `p_alpha[1]`. For example, in the ALE plots, for the default `p_alpha = c(0.01, 0.05)`, the inner band will be the median ± ALE minimum or maximum at p = 0.05 and the outer band will be the median ± ALE minimum or maximum at p = 0.01.
`median_band_pct`	numeric length 2 from 0 to 1. Alpha for "confidence interval" ranges for printing bands around the median for single-variable plots. These are the default values used if `p_values` are not provided. If `p_values` are provided, then `median_band_pct` is ignored. The inner band range will be the median value of y ± `median_band_pct[1]/2`. For plots with a second outer band, its range will be the median ± `median_band_pct[2]/2`. For example, for the default `median_band_pct = c(0.05, 0.5)`, the inner band will be the median ± 2.5% and the outer band will be the median ± 25%.
`rug_sample_size`, `min_rug_per_interval`	non-negative integer(1). Rug plots are down-sampled to `rug_sample_size` rows, otherwise they can be very slow for large datasets. By default, their size is the value of `obj@params$sample_size`. They maintain representativeness of the data by guaranteeing that each of the ALE bins will retain at least `min_rug_per_interval` elements; usually set to just 1 (default) or 2. To prevent this down-sampling, set `rug_sample_size` to `Inf` (but then the `ALEPlots` object would store the entire dataset, so could become very large).
`n_x1_bins`, `n_x2_bins`	positive integer(1). Number of bins for the x1 or x2 axes respectively for 2D interaction plot. These values are ignored if x1 or x2 are not numeric (i.e, if they are logical or factors).
`n_y_quant`	positive integer(1). Number of intervals over which the range of y values is divided for the colour bands of the interaction plot. See details.
`seed`	See documentation for `ALE()`
`silent`	See documentation for `ALE()`

Value

An object of class ALEPlots with properties distinct and params.

Properties

distinct

Stores the ALE plots. Use get.ALEPlots() to access them.

params

The parameters used to calculate the ALE plots. These include most of the arguments used to construct the ALEPlots object. These are either the values provided by the user or used by default if the user did not change them but also includes several objects that are created within the constructor. These extra objects are described here, as well as those parameters that are stored differently from the form in the arguments:

* `max_d`: See documentation for [ALE()]
* `requested_x_cols`: See documentation for [ALE()]. Note, however, that `ALEPlots` does not store `ordered_x_cols`.
* `y_col`: See documentation for [ALE()]

2D interaction plots

#' For the 2D interaction plots, n_y_quant is the number of quantiles into which to divide the predicted variable (y). The middle quantiles are grouped specially:

The middle quantile is the first confidence interval of median_band_pct (median_band_pct[1]) around the median. This middle quantile is special because it generally represents no meaningful interaction.
The quantiles above and below the middle are extended from the borders of the middle quantile to the regular borders of the other quantiles.

There will always be an odd number of quantiles: the special middle quantile plus an equal number of quantiles on each side of it. If n_y_quant is even, then a middle quantile will be added to it. If n_y_quant is odd, then the number specified will be used, including the middle quantile.

Examples

# See examples with ALE and ModelBoot objects

# See examples with ALE and ModelBoot objects

Calculate ALE data

Description

This function is not exported. It is a complete reimplementation of the ALE algorithm relative to the reference in ALEPlot::ALEPlot(). In addition to adding bootstrapping and handling of categorical y variables, it reimplements categorical x interactions.

Usage

calc_ale(
  data,
  model,
  x_cols,
  y_col,
  y_cats,
  pred_fun,
  pred_type,
  max_num_bins,
  boot_it,
  seed,
  boot_alpha,
  boot_centre,
  boot_ale_y = FALSE,
  .bins = NULL,
  ale_y_norm_funs = NULL,
  p_dist = NULL
)
calc_ale(
  data,
  model,
  x_cols,
  y_col,
  y_cats,
  pred_fun,
  pred_type,
  max_num_bins,
  boot_it,
  seed,
  boot_alpha,
  boot_centre,
  boot_ale_y = FALSE,
  .bins = NULL,
  ale_y_norm_funs = NULL,
  p_dist = NULL
)

Arguments

`data`	See documentation for `ALE()`
`model`	See documentation for `ALE()`
`x_cols`	character(1 or 2). Names of columns in X for which ALE data is to be calculated. Length 1 for 1D ALE and length 2 for 2D ALE.
`y_col`	character(1). Name of the target y column.
`y_cats`	character. The categories of y. For most cases with non-categorical y, `y_cats == y_col`.
`pred_fun`	See documentation for `ALE()`
`pred_type`	See documentation for `ALE()`
`max_num_bins`	See documentation for `ALE()`
`boot_it`	See documentation for `ALE()`
`seed`	See documentation for `ALE()`
`boot_alpha`	See documentation for `ALE()`
`boot_centre`	See documentation for `ALE()`
`boot_ale_y`	logical(1). If `TRUE`, return the bootstrap matrix of ALE y values. If `FALSE` (default) return NULL for the `boot_ale_y` element of the return value.
`.bins`	See documentation for `ALE()`
`ale_y_norm_funs`	list of functions. Custom functions for normalizing ALE y for statistics. It is usually a list(1), but for categorical y, there is a distinct function for each y category. If provided, ale_y_norm_funs saves some time since it is usually the same for all all variables throughout one call to `ALE()`. For now, used as a flag to determine whether statistics will be calculated or not; if NULL, statistics will not be calculated.
`p_dist`	See documentation for `p_values` in `ALE()`

Details

For details about arguments not documented here, see ALE().

References

Apley, Daniel W., and Jingyu Zhu. "Visualizing the effects of predictor variables in black box supervised learning models." Journal of the Royal Statistical Society Series B: Statistical Methodology 82.4 (2020): 1059-1086.

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. doi:10.48550/arXiv.2310.09877.

Cast (convert) the class of an object

Description

Currently assumes that the result object will have only one class.

Usage

cast(x, new_cls)
cast(x, new_cls)

Arguments

`x`	An R object
`new_cls`	character(1). A single class to which to convert `x`.

Value

x converted to class new_cls.

Census Income

Description

Census data that indicates, among other details, if the respondent's income exceeds $50,000 per year. Also known as "Adult" dataset.

Usage

census
census

Format

A tibble with 32,561 rows and 15 columns:

higher_income: TRUE if income > $50,000
age: continuous
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked
fnlwgt: continuous. "A proxy for the demographic background of the people: 'People with similar demographic characteristics should have similar weights'" For more details, see https://www.openml.org/search?type=data&id=1590.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool
education_num: continuous
marital_status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black
sex: Female, Male
capital_gain: continuous
capital_loss: continuous
hours_per_week: continuous
native_country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinidad&Tobago, Peru, Hong, Holland-Netherlands

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

Source

Becker,Barry and Kohavi,Ronny. (1996). Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5XW20.

Sum up a matrix across columns

Description

Adaptation of base::colSums() that, when all values in a column are NA, sets the sum to NA rather than zero as base::colSums() does. Calls base::colSums() internally.

Usage

col_sums(mx, na.rm = FALSE, dims = 1)
col_sums(mx, na.rm = FALSE, dims = 1)

Arguments

`mx`	numeric matrix
`na.rm`	logical(1). TRUE if missing values (`NA`) should be ignored in the summation. If FALSE (default), even one missing value will result in `NA` for the entire column.
`dims`	See documentation for `base::colSums()`

Value

numeric vector whose length is number of columns of mx, whose values are the sums of each column of mx.

Extract all NWSE diagonals from a matrix

Description

Extracts all diagonals from a matrix in the NWSE direction (upper left down to lower right).

Usage

extract_2D_diags(mx)
extract_2D_diags(mx)

Arguments

mx

matrix

Value

A list whose elements each represent one diagonal of mx. Each diagonal element is a list of two elements: coords is a numeric vector pair of row-column coordinates; values is the value of the diagonal at the coordinate give by coords.

Extract all FNWBSE diagonals from a 3D array

Description

Extracts all diagonals from a 3D array in the FNWBSE direction (front upper left down to back lower right).

Usage

extract_3D_diags(ray)
extract_3D_diags(ray)

Arguments

ray

a 3-dimensional array

Value

A list whose elements each represent one diagonal of ray. Each diagonal element is a list of two elements: origin is the 3D coordinates (row, column, depth) of the first element of the diagonal; values is a vector of the diagonal that starts from origin.

S7 generic get method for objects in the ale package

Description

Retrieve specific data elements from an object based on their X column names.

If obj is not an object from the ale package, then this generic passes on all arguments to the base::get() function.

Usage

get(obj, ...)
get(obj, ...)

Arguments

`obj`	object.
`...`	For ale package objects, instructions for which predictor (x) columns should be retrieved. For everything else, arguments to pass to `base::get()`.

get method for ALE objects

Description

Retrieve specific elements from an ALE object.

Arguments

`obj`	ALE object from which to retrieve elements.
`x_cols`, `exclude_cols`	character, list, or formula. Columns names and interaction terms from `obj` requested in one of the special `x_cols` formats. The default value of `NULL` for `x_cols` retrieves all available data of the output requested in `what`. See details in the documentation for `resolve_x_cols()`.
`what`	character(1). What kind of output is requested. Must be one (and only one) of `c('ale', 'boot_data')`. Default is `'ale'`. If `stats` is specified and `what = 'ale'`, then ALE statistics are retrieved. Otherwise, `get()` errors if `stats` is specified and `what` has some other value.
`...`	not used. Inserted to require explicit naming of subsequent arguments.
`stats`	character(1). Retrieve statistics. If `stats` is specified, then `what` must be `'ale'`. See the return value details below for valid values for `stats`.
`cats`	character. Optional category names to retrieve if the ALE is for a categorical y outcome model.
`simplify`	logical(1). If `TRUE` (default), the results will be simplified to the simplest structure possible to give the requested results.
`silent`	See documentation for `resolve_x_cols()`

Value

Regardless of the requested data, all get.ALE() have a common structure:

If more than one category of the y outcome is returned, then the top level is a list named by each category. However, the y outcome is not categorical or only one category of multiple possibilities is specified using the cats argument, then the top level never has categories, regardless of the value of simplify.
The next level (or top level if there are zero or one category) is a list with one or two levels:
- d1: 1D ALE elements.
- d2: 2D ALE elements. However, if elements of only one dimension (either 1D or 2D) are requested and simplify = TRUE (default), the empty list is eliminated and the level is skipped to provide only the elements present. For example, if only 1D ALE data is requested, then there will be no d1 sublist but only a list of the ALE data as described for the next level. If simplify = FALSE, both d1 and d2 sublists will always be returned; the empty sublist will be NULL.
For the d1 sublist, there is no further hierarchy: the returned data is as described below. For the d2 sublist representing var1 by var2 interactions, the next level is a named list of all var1 elements. This level in turn consists of a named list of var2 elements with the actual data as described below. For example, for ALE data, result <- get(obj, list(d1 = "var1", d2 = list(c("var1", "var2")))) would return result as a list where the 1D ALE data is in result$d1$var1 and the 2D ALE data in result$d2$var1$var2.

While all results follow the general structure just described, the specific type of data returned depends on the values of the what and stats arguments:

what = 'ale' (default) and stats = NULL (default): A list whose elements, named by each requested x variable, are each a tibble. The rows each represent one ALE bin. The tibble has the following columns: * var.bin or var.ceil where var is the name of a variable (column): For non-numeric x, var.bin is the value of each of the ALE categories. For numeric x, var.ceil is the value of the upper bound (ceiling) of each ALE bin. The first "bin" of numeric variables represents the minimum value. For 2D ALE with an var1 by var2 interaction, both var1.bin and var2.bin columns are returned (or var1.ceil or var2.ceilfor numeric var1 or var2). * .n: the number of rows of data in each bin represented by var.bin or var.ceil. For numeric x, the first bin contains all data elements that have exactly the minimum value of x. This is often 1, but might be more than 1 if more than one data element has exactly the minimum value. * .y: the ALE function value calculated for that bin. For bootstrapped ALE, this is the same as .y_mean by default or .y_median if boot_centre = 'median'. Regardless, both .y_mean and .y_median are returned as columns here. * .y_lo, .y_hi: the lower and upper confidence intervals, respectively, for the bootstrapped .y value based on the boot_alpha argument in the ALE() constructor.
what = 'boot_data' and stats = NULL (default): A list whose elements, named by each requested x variable, are each a tibble. These are the data from which .y_mean, .y_median, .y_lo, and .y_hi are summarized when what = 'ale'. The rows each represent one ALE bin for a specified bootstrap iteration. The tibble has the following columns: * .it: The bootstrap iteration. Iteration 0 represents the ALE calculations on the full dataset; the remaining values of .it are from 1 to boot_it (number of bootstrap iterations specified in the ALE() constructor. * var where var is the name of a variable (column): For non-numeric x, var is the value of each of the ALE categories. For numeric x, var is the value of the upper bound (ceiling) of each ALE bin. They are otherwise similar to their meanings described for what = 'ale' above. * .n and .y: Same as for what = 'ale'.
what = 'ale' (default) and stats = 'estimate': A list with elements d1 and d2 with the value of each ALE statistic. Each row represents one variable or interaction. The tibble has the following columns: * term or term1 and term2: The variable or column for the 1D (term) or 2D (term1 by term2) ALE statistic. * aled, aler_min, aler_max, naled, naler_min, naler_max: the respective ALE statistic for the variable or interaction.
what = 'ale' (default) and stats is one value in c('aled', 'aler_min', 'aler_max', 'naled', 'naler_min', 'naler_max'): A list with elements d1 and d2 with the distribution value of the single requested ALE statistic. Each element d1 and d2 is a tibble. Each row represents one variable or interaction. The tibble has the following columns: * term or term1 and term2: Same as for stats = 'estimate'. * estimate, mean, median: The average of the bootstrapped value of the requested statistic. estimate is equal to either mean or median depending on the boot_centre argument in the ALE() constructor. If ALE is not bootstrapped, then estimate, mean, and median are equal. * conf.low, conf.high: the lower and upper confidence intervals, respectively, for the bootstrapped statistic based on the boot_alpha argument in the ALE() constructor. If ALE is not bootstrapped, then estimate, conf.low, and conf.high are equal.
what = 'ale' (default) and stats = 'all': A list with elements d1 and d2 with the distribution values of all available ALE statistics for the requested variables and interactions. Whereas the stats = 'aled' (for example) format returns data for a single statistic, stats = 'all' returns all statistics for the requested variables. Each element is a list with the requested d1 and d2 sub-elements as described in the general structure above. Each data element is a tibble. Each row represents one ALE statistic. The tibble has the following columns: * term or term1 and term2: Same as for stats = 'estimate'. * estimate, mean, median: Same as above for individual ALE statistics. * conf.low, conf.high: Same as above for individual ALE statistics.
what = 'ale' (default) and stats = 'conf_regions': A list with elements d1 and d2 with the confidence regions for the requested variables and interactions. Each element is a list with the requested d1 and d2 sub-elements as described in the general structure above. Each data element is a tibble with confidence regions for a single variable or interaction. For an explanation of the columns, see vignette('ale-statistics').
what = 'ale' (default) and stats = 'conf_sig': A list with elements d1 and d2 filtered for the statistically significant confidence regions for the requested variables and interactions. Each element is a tibble. Each row is a single statistically significant confidence region. If there are no statistically significant confidence regions at all, an empty tibble is returned. For an explanation of the columns, see vignette('ale-statistics').

Examples

# See examples at ALE() for a demonstration us the get() method.


# See examples at ALE() for a demonstration us the get() method.

get method for ALEPlots objects

Description

Retrieve specific plots from a ALEPlots object.

See get.ALE() for explanation of parameters not described here.

Arguments

`obj`	ALEPlots object from which to retrieve ALE elements.
`type`	character(1). What type of ALEPlots to retrieve: `'ale'` for standard ALE plots or `'eff'` for ALE effects plots.

Value

See get.ALE()

get method for ModelBoot objects

Description

Retrieve specific ALE elements from a ModelBoot object. This method is similar to get.ALE() except that the user may specify what type of ALE data to retrieve (see the argument definition for details).

See get.ALE() for explanation of parameters not described here.

Arguments

`obj`	ModelBoot object from which to retrieve ALE elements.
`type`	character(1). The type of ModelBoot ALE elements to retrieve: `'single'` for the ALE calculated on the full data set or `'boot'` for the bootstrapped ALE data (based on full-model bootstrapping). The default `'auto'` will retrieve `'boot'` if it is available and `'single'` otherwise.

Value

See get.ALE()

Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories.

Description

Sorted categorical indices based on Kolmogorov-Smirnov distances for empirically ordering categorical categories.

Usage

idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)
idxs_kolmogorov_smirnov(X, x_col, n_bins, x_int_counts)

Arguments

`X`	X data
`x_col`	character
`n_bins`	integer
`x_int_counts`	bin sizes

Intrapolate missing values of vector

Description

This intrapolation algorithm replaces internal missing values in a vector with the linear interpolation of the bounding non-missing values. If there are no-bounding non-missing values, the unbounded missing values are retained as missing. In our terminology, 'intrapolation' is distinct from 'interpolation' because interpolation might include 'extrapolation', that is, projecting estimates of values beyond the bounds. This function, in contrast, only replaces bounded missing values.

Usage

intrapolate_1D(v)
intrapolate_1D(v)

Arguments

`v`	numeric vector. A numeric vector.

Details

For example, the vector c(NA, NA, 1, NA, 5, NA, NA, 1, NA) will be intrapolated to c(NA, NA, 1, 3, 5, 3.7, 2.3, 1, NA).

Note: because intrapolation requires at least three elements (left bound, missing value, right bound), an input vector of less than three will be returned unchanged.

Value

numeric vector of the same length as the input v with internal missing values linearly intrapolated.

Intrapolate missing values of matrix

Description

This intrapolation algorithm replaces internal missing values in a matrix. It does so in the following steps:

Calculate separate intrapolations in four directions: rows, columns, NWSE diagonals (upper left down to lower right), and SWNE diagonals (lower left up to upper right). The intrapolations in each direction is based on the algorithm of intrapolate_1D(). (See details there.)
The 2D intrapolation is the mean intrapolation from any of the four values. In taking the mean, missing intrapolations are removed.
When there is no intrapolation available from any of the four directions, the missing value remains missing.

Usage

intrapolate_2D(mx, consolidate = TRUE)
intrapolate_2D(mx, consolidate = TRUE)

Arguments

`mx`	numeric matrix. A numeric matrix.
`consolidate`	logical(1). See return value.

Value

If consolidate = TRUE (default), returns a numeric matrix of the same dimensions as the input mx with internal missing values linearly intrapolated. If consolidate = FALSE, returns a list of intrapolations for missing values from each of the four directions (rows, columns, NWSE diagonal, and SWNE diagonal).

Intrapolate missing values of a 3D array

Description

This intrapolation algorithm replaces internal missing values in a three-dimensional array. For how it works, see the details of intrapolate_2D(). Based on that, intrapolate_3D() does the following:

Slice the 3D array into 2D matrices along the rows, columns, and depth dimensions.
Use intrapolate_2D() to calculate 2D intrapolations based on the algorithm of intrapolate_1D(). See details in their documentation.
In addition, calculate intrapolations along the four directions of 3D diagonals: front northwest to back southeast, that is, front upper left down to back lower right (FNWBSE), FSWBNE, FSEBNW, and FNEBSW.
The 3D intrapolation is the mean intrapolation from any of these 2D or 3D values. In taking the mean, missing intrapolations are removed.
When there is no intrapolation available from any of the directions, the missing value remains missing.

Usage

intrapolate_3D(ray, consolidate = TRUE)
intrapolate_3D(ray, consolidate = TRUE)

Arguments

`ray`	numeric array of three dimensions.
`consolidate`	logical(1). See return value.

Value

If consolidate = TRUE (default), returns a numeric array of the same dimensions as the input ray with internal missing values linearly intrapolated. If consolidate = FALSE, returns a list of intrapolations for missing values from each slice and diagonal direction.

Statistics and ALE data for a bootstrapped model

Description

A ModelBoot object contains full-model bootstrapped statistics and ALE data for a trained model. Full-model bootstrapping (as distinct from data-only bootstrapping) retrains a model for each bootstrap iterations. Thus, it is very slow, though more reliable. However, for obtaining bootstrapped ALE data, plots, and statistics, full-model bootstrapping as provided by ModelBoot is only necessary for models that have not been developed by cross-validation. For cross-validated models, it is sufficient (and much faster) to create a regular ALE object with bootstrapping by setting the boot_it argument there. In fact, full-model bootstrapping with ModelBoot is often infeasible for slow machine-learning models trained on large datasets, which should rather be cross-validated to assure their reliability. However, for models that have not been cross-validated, full-model bootstrapping with ModelBoot is necessary for reliable results. Further details follow below; see also vignette('ale-statistics').

Usage

ModelBoot(
  model,
  data = NULL,
  ...,
  model_call_string = NULL,
  model_call_string_vars = character(),
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  y_col = NULL,
  binary_true_value = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  boot_it = 100,
  boot_alpha = 0.05,
  boot_centre = "mean",
  seed = 0,
  output_model_stats = TRUE,
  output_model_coefs = TRUE,
  output_ale = TRUE,
  output_boot_data = FALSE,
  ale_options = list(),
  tidy_options = list(),
  glance_options = list(),
  silent = FALSE
)
ModelBoot(
  model,
  data = NULL,
  ...,
  model_call_string = NULL,
  model_call_string_vars = character(),
  parallel = future::availableCores(logical = FALSE, omit = 1),
  model_packages = NULL,
  y_col = NULL,
  binary_true_value = TRUE,
  pred_fun = function(object, newdata, type = pred_type) {
     stats::predict(object =
    object, newdata = newdata, type = type)
 },
  pred_type = "response",
  boot_it = 100,
  boot_alpha = 0.05,
  boot_centre = "mean",
  seed = 0,
  output_model_stats = TRUE,
  output_model_coefs = TRUE,
  output_ale = TRUE,
  output_boot_data = FALSE,
  ale_options = list(),
  tidy_options = list(),
  glance_options = list(),
  silent = FALSE
)

Arguments

`model`	Required. See documentation for `ALE()`
`data`	dataframe. Dataset that will be bootstrapped. This must be the same data on which the `model` was trained. If not provided, `ModelBoot()` will try to detect it automatically. For non-standard models, `data` should be provided.
`...`	not used. Inserted to require explicit naming of subsequent arguments.
`model_call_string`	character(1). If NULL (default), the `ModelBoot` tries to automatically detect and construct the call for bootstrapped datasets. If it cannot, the function will fail early. In that case, a character string of the full call for the model must be provided that includes `boot_data` as the data argument for the call. See examples.
`model_call_string_vars`	character. Names of variables included in `model_call_string` that are not columns in `data`. If any such variables exist, they must be specified here or else parallel processing will produce an error. If parallelization is disabled with `parallel = 0`, then this is not a concern. See documentation for the `model_packages` argument in `ALE()`.
`parallel`	See documentation for `ALE()`
`model_packages`	See documentation for `ALE()`
`y_col`, `pred_fun`, `pred_type`	See documentation for `ALE()`. Used to calculate bootstrapped performance measures. If NULL (default), then the relevant performance measures are calculated only if these arguments can be automatically detected.
`binary_true_value`	any single atomic value. If the model represented by `model` or `model_call_string` is a binary classification model, `binary_true_value` specifies the value of `y_col` (the target outcome) that is considered `TRUE`; any other value of `y_col` is considered `FALSE`. This argument is ignored if the model is not a binary classification model. For example, if 2 means `TRUE` and 1 means `FALSE`, then set `binary_true_value = 2`.
`boot_it`	non-negative integer(1). Number of bootstrap iterations for full-model bootstrapping. For bootstrapping of ALE values, see details to verify if `ALE()` with bootstrapping is not more appropriate than `ModelBoot()`. If `boot_it = 0`, then the model is run as normal once on the full `data` with no bootstrapping.
`boot_alpha`	numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrap intervals; the bootstrap confidence intervals will be the lowest and highest `(1 - 0.05) / 2` percentiles. For example, if `boot_alpha = 0.05` (default), the intervals will be from the 2.5 and 97.5 percentiles.
`boot_centre`	character(1) in c('mean', 'median'). When bootstrapping, the main estimate for the ALE y value is considered to be `boot_centre`. Regardless of the value specified here, both the mean and median will be available.
`seed`	integer. Random seed. Supply this between runs to assure identical bootstrap samples are generated each time on the same data.
`output_model_stats`	logical(1). If `TRUE` (default), return overall model statistics.
`output_model_coefs`	logical(1). If `TRUE` (default), return model coefficients.
`output_ale`	logical(1). If `TRUE` (default), return ALE data and statistics.
`output_boot_data`	logical(1). If `TRUE`, return the full raw data for each bootstrap iteration, specifically, the bootstrapped models and the model row indices. Default is `FALSE`.
`ale_options`, `tidy_options`, `glance_options`	list of named arguments. Arguments to pass to the `ALE()` when `ale = TRUE`, `broom::tidy()` when `model_coefs = TRUE`, or `broom::glance()` when `model_stats = TRUE`, respectively, beyond (or overriding) their defaults. In particular, to obtain p-values for ALE statistics, see the details.
`silent`	See documentation for `ALE()`

Value

An object of class ALE with properties model_stats, model_coefs, ale, model_stats, boot_data, and params.

Properties

model_stats

tibble of bootstrapped results from broom::glance(). NULL if model_stats argument is FALSE. In general, only broom::glance() results that make sense when bootstrapped are included, such as df and adj.r.squared. Results that are incomparable across bootstrapped datasets (such as aic) are excluded. In addition, certain model performance measures are included; these are bootstrap-validated with the .632 correction (NOT the .632+ correction):

For regression (numeric prediction) models:
- mae: mean absolute error (MAE)
- sa_mae_mad: standardized accuracy of the MAE referenced on the mean absolute deviation
- rmse: root mean squared error (RMSE)
- sa_rmse_sd: standardized accuracy of the RMSE referenced on the standard deviation
For classification (probability) models:
- auc: area under the ROC curve

model_coefs

A tibble of bootstrapped results from broom::tidy(). NULL if model_coefs argument is FALSE.

ale

A list of bootstrapped ALE results using default ALE() settings unless if overridden with ale_options. NULL if ale argument is FALSE. Elements are:

  * `single`: an `ALE` object of ALE calculations on the full dataset without bootstrapping.
  * `boot`: a list of bootstrapped ALE data and statistics. This element is not an `ALE` object; it is in a special internal format.

boot_data

A tibble of bootstrap results. Each row represents a bootstrap iteration. NULL if boot_data argument is FALSE. The columns are:

  * `it`: the specific bootstrap iteration from 0 to `boot_it` iterations. Iteration 0 is the results from the full dataset (not bootstrapped).
  * `row_idxs`: the row indexes for the bootstrapped sample for that iteration. To save space, the row indexes are returned rather than the full datasets. So, for example, iteration i's bootstrap sample can be reproduced by `data[ModelBoot_obj@boot_data$row_idxs[[2]], ]` where `data` is the dataset and `ModelBoot_obj` is the result of `ModelBoot()`.
  * `model`: the model object trained on that iteration.
  * `tidy`: the results of `broom::tidy(model)` on that iteration.
  * `stats`: the results of `broom::glance(model)` on that iteration.
  * `perf`: performance measures on the entire dataset. These are the measures specified above for regression and classification models.

params

Parameters used to calculate bootstrapped data. Most of these repeat the arguments passed to ModelBoot(). These are either the values provided by the user or used by default if the user did not change them but the following additional objects created internally are also provided:

* `y_cats`: same as `ALE@params$y_cats` (see documentation there).
* `y_type`: same as `ALE@params$y_type` (see documentation there).
* `model`: same as `ALE@params$model` (see documentation there).
* `data`: same as `ALE@params$data` (see documentation there).

Full-model bootstrapping

No modelling results, with or without ALE, should be considered reliable without appropriate validation. For ALE, both the trained model itself and the ALE that explains the trained model must be validated. ALE must be validated by bootstrapping. The trained model might be validated either by cross-validation or by bootstrapping. For ALE that explains trained models that have been developed by cross-validation, it is sufficient to bootstrap just the training data. That is what the ALE object does with its boot_it argument. However, unvalidated models must be validated by bootstrapping them along with the calculation of ALE; this is what the ModelBoot object does with its boot_it argument.

ModelBoot() carries out full-model bootstrapping to validate models. Specifically, it:

Creates multiple bootstrap samples (default 100; the user can specify any number);
Creates a model on each bootstrap sample;
Calculates overall model statistics, variable coefficients, and ALE values for each model on each bootstrap sample;
Calculates the mean, median, and lower and upper confidence intervals for each of those values across all bootstrap samples.

p-values

The broom::tidy() summary statistics will provide p-values. However, the procedure for obtaining p-values for ALE statistics is very slow: it involves retraining the model 1000 times. Thus, it is not efficient to calculate p-values whenever a ModelBoot object is created. Although the ALE() function provides an 'auto' option for creating p-values, that option is disabled when creating a ModelBoot because it would be far too slow: it would involve retraining the model 1000 times the number of bootstrap iterations. Rather, you must first create a p-values distribution object using the procedure described in help(-ALEpDist). If the name of your p-values object is p_dist, you can then request p-values each time you create a ModelBoot by passing it the argument ale_options = list(p_values = p_dist).

References

Okoli, Chitu. 2023. “Statistical Inference Using Machine Learning and Classical Techniques Based on Accumulated Local Effects (ALE).” arXiv. https://arxiv.org/abs/2310.09877.

Examples


# attitude dataset
attitude

## ALE for general additive models (GAM)
## GAM is tweaked to work on the small dataset.
gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) +
                            raises + s(critical) + advance,
                          data = attitude)
summary(gam_attitude)


# Full model bootstrapping
# Only 4 bootstrap iterations for a rapid example; default is 100
# Increase value of boot_it for more realistic results
mb_gam <- ModelBoot(
  gam_attitude,
  boot_it = 4
)

# If the model is not standard, supply model_call_string with 'data = boot_data'
# in the string instead of the actual dataset name (in addition to the actual dataset
# as the 'data' argument directly to the `ModelBoot` constructor).
mb_gam <- ModelBoot(
  gam_attitude,
  data = attitude,  # the actual dataset
  model_call_string = 'mgcv::gam(
    rating ~ complaints + privileges + s(learning) +
      raises + s(critical) + advance,
    data = boot_data  # required for model_call_string
  )',
  boot_it = 4
)

# Model statistics and coefficients
mb_gam@model_stats
mb_gam@model_coefs

# Plot ALE
plot(mb_gam)


# attitude dataset
attitude

## ALE for general additive models (GAM)
## GAM is tweaked to work on the small dataset.
gam_attitude <- mgcv::gam(rating ~ complaints + privileges + s(learning) +
                            raises + s(critical) + advance,
                          data = attitude)
summary(gam_attitude)


# Full model bootstrapping
# Only 4 bootstrap iterations for a rapid example; default is 100
# Increase value of boot_it for more realistic results
mb_gam <- ModelBoot(
  gam_attitude,
  boot_it = 4
)

# If the model is not standard, supply model_call_string with 'data = boot_data'
# in the string instead of the actual dataset name (in addition to the actual dataset
# as the 'data' argument directly to the `ModelBoot` constructor).
mb_gam <- ModelBoot(
  gam_attitude,
  data = attitude,  # the actual dataset
  model_call_string = 'mgcv::gam(
    rating ~ complaints + privileges + s(learning) +
      raises + s(critical) + advance,
    data = boot_data  # required for model_call_string
  )',
  boot_it = 4
)

# Model statistics and coefficients
mb_gam@model_stats
mb_gam@model_coefs

# Plot ALE
plot(mb_gam)

Improvements:

Validation: ensure that the object is atomic (not just a vector)

For factors, get all classes and check if any class_x is a factor or ordered

Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns)

Description

Improvements:

Validation: ensure that the object is atomic (not just a vector)
For factors, get all classes and check if any class_x is a factor or ordered
Add arguments to return a unique mode with options to sort: occurrence order, lexicographical Reduce a dataframe to a sample (retains the structure of its columns)

Usage

params_data(
  data,
  y_vals,
  data_name = var_name(data),
  sample_size = 500,
  seed = 0
)
params_data(
  data,
  y_vals,
  data_name = var_name(data),
  sample_size = 500,
  seed = 0
)

Arguments

`data`	input dataframe
`y_vals`	y values, y predictions, or a sample thereof
`data_name`	name of the data argument
`sample_size`	size of data to sample
`seed`	random seed

Value

a list

plot method for `ALE` objects

Description

This plot method simply calls the constructor for an ALEPlots object.

Arguments

`x`	ALE object.
`...`	Arguments passed to `ALEPlots()`

Plot method for ALEPlots object

Description

Plot an ALEPlots object.

Arguments

`x`	An object of class `ALEPlots`.
`max_print`	integer(1). The maximum number of plots that may be printed at a time. 1D plots and 2D are printed separately, so this maximum applies separately to each dimension of ALE plots, not to all dimensions combined.
`...`	Arguments to pass to `patchwork::wrap_plots()`

Value

Invisibly returns x.

plot method for `ModelBoot` objects

Description

This plot method simply calls the constructor for an ALEPlots object.

Arguments

`x`	ModelBoot object.
`...`	Arguments passed to `ALEPlots()`

Compute preparatory data for ALE calculation

Description

Computes data needed to calculate a variable's ALE values.

Usage

prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)
prep_var_for_ale(x_col, x_type, x_vals, bins, n, max_num_bins, X = NULL)

Arguments

`x_col`	character(1). Name of single column in X for which ALE data is to be calculated.
`x_type`	character(1). var_type() of x_col.
`x_vals`	vector. The values of x_col.
`bins`, `n`	See documentation for `calc_ale()`
`max_num_bins`	See documentation for `ALE()`
`X`	See documentation for `calc_ale()`. Used only for categorical x_col.

print Method for ALE object

Description

Print an ALE object.

Arguments

`x`	An object of class `ALE`.
`...`	Additional arguments (currently not used).

Value

Invisibly returns x.

Examples

lm_cars <- stats::lm(mpg ~ ., mtcars)
ale_cars <- ALE(lm_cars, parallel = 0)
print(ale_cars)

lm_cars <- stats::lm(mpg ~ ., mtcars)
ale_cars <- ALE(lm_cars, parallel = 0)
print(ale_cars)

Print method for ALEPlots object

Description

Print an ALEPlots object by calling plot().

Arguments

`x`	An object of class `ALEPlots`.
`max_print`	See documentation for `plot.ALEPlots()`
`...`	Additional arguments (currently not used).

Value

Invisibly returns x.

print method for ModelBoot object

Description

Print a ModelBoot object.

Arguments

`x`	An object of class `ModelBoot`.
`...`	Additional arguments (currently not used).

Value

Invisibly returns x.

Examples

lm_cars <- stats::lm(mpg ~ ., mtcars)
mb <- ModelBoot(lm_cars, boot_it = 2, parallel = 0)
print(mb)

lm_cars <- stats::lm(mpg ~ ., mtcars)
mb <- ModelBoot(lm_cars, boot_it = 2, parallel = 0)
print(mb)

Resolve x_cols and exclude_cols to their standardized format

Description

Resolve x_cols and exclude_cols to their standardized format of x_cols to specify which 1D and 2D ALE elements are required. This specification is used throughout the ALE package. x_cols specifies the desired columns or interactions whereas exclude_cols optionally specifies any columns or interactions to remove from x_cols. The result is x_cols – exclude_cols, giving considerable flexibility in specifying the precise columns desired.

Usage

resolve_x_cols(x_cols, col_names, y_col, exclude_cols = NULL, silent = FALSE)
resolve_x_cols(x_cols, col_names, y_col, exclude_cols = NULL, silent = FALSE)

Arguments

`x_cols`	character, list, or formula. Columns and interactions requested in one of the special `x_cols` formats. `x_cols` values not found in `col_names` will error. See examples.
`col_names`	character. All the column names from a dataset.
`y_col`	character(1). The y outcome column. If found in any `x_cols` value, it will be silently removed.
`exclude_cols`	Same possible formats as `x_cols`. Columns and interactions to exclude from those requested in `x_cols`. `exclude_cols` values not found in `col_names` will be ignored with a message (which can be silenced with `silent`).
`silent`	logical(1). If `TRUE`, no message will be given; in particular, x_cols not found in col_names will be silently ignored. Default is `FALSE`. Regardless, warnings and errors are never silenced (e.g, invalid `x_cols` formats will still report errors).

Value

x_cols in canonical format, which is always a list with two elements:

d1: a character vector with each requested column for 1D ALE. If empty (no 1D ALE at all), its value is list().
d2: a list of length-2 character vectors with each requested 2D ALE interaction pair. If empty (no 2D ALE at all), its value is list().

See examples for details.

x_cols format options

The x_cols argument determines which predictor variables and interactions are included in the analysis. It supports multiple input formats:

Character vector: A simple vector of column names (e.g., c("a", "b")) selects individual columns for 1D ALE analysis.
Formula (~): Allows specifying variables and interactions in formula notation (e.g., ~ a + b + a:b), which is automatically converted into a structured format. The outcome term is optional and will be ignored regardless. So, ~ a + b + a:b produces results identical to whatever ~ a + b + a:b.
List format: Users can explicitly separate 1D and 2D ALE interactions:
- A character vector named d1 like list(d1 = c("a", "b")) specifies 1D ALE variables.
- A list of 2-character vectors named d2 like list(d2 = list(c("a", "b"), c("a", "c"))) specifies 2D interactions.
- list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))) combines both.
Boolean selection for an entire dimension:
- list(d1 = TRUE) selects all available variables for 1D ALE, excluding y_col.
- list(d2 = TRUE) selects all possible 2D interactions among all columns in col_names, excluding y_col.
Selective 2D expansion: list(d2_all = "a") includes all 2D interactions where "a" is one of the interacting variables.
NULL (or unspecified): If x_cols = NULL, no variables are selected.

Regardless of format, y_col is always removed automatically. The function ensures all variables are valid and in col_names, providing informative messages unless silent = TRUE. And regardless of the specification format, the result will always be standardized in the format specified in the return value.

Run examples for details.

Examples

## Sample data
set.seed(0)
df <- data.frame(
  y = runif(10),
  a = sample(letters[1:3], 10, replace = TRUE),
  b = rnorm(10),
  c = sample(1:5, 10, replace = TRUE)
)
col_names <- names(df)
y_col <- "y"  # Assume 'y' is the outcome variable


## Examples with just x_cols to show different formats for specifying x_cols
## (same format for exclude_cols)

# Character vector: Simple ALE with no interactions
resolve_x_cols(c("a", "b"), col_names, y_col)

# Character string: Select just one 1D element
resolve_x_cols("c", col_names, y_col)

# Character string: Select just one 1D element
resolve_x_cols("c", col_names, y_col)

# list of 1- and 2-length character vectors: specify precise 1D and 2D elements desired
resolve_x_cols(list(c("a", "b"), "c", c("c", "a"), "b"), col_names, y_col)

# Formula: Converts to a list of individual elements
resolve_x_cols(~ a + b, col_names, y_col)

# Formula with interactions (1D and 2D).
# This format is probably more convenient if you know precisely which terms you want.
# Note that the outcome on the left-hand-side is always silently ignored.
resolve_x_cols(whatever ~ a + b + a:b + c:b, col_names, y_col)

# List specifying d1 (1D ALE)
resolve_x_cols(list(d1 = c("a", "b")), col_names, y_col)

# List specifying d2 (2D ALE)
resolve_x_cols(list(d2 = list(c("a", "b"))), col_names, y_col)

# List specifying both d1 and d2
resolve_x_cols(list(d1 = c("a", "b"), d2 = list(c("a", "b"))), col_names, y_col)

# d1 as TRUE (select all columns except y_col)
resolve_x_cols(list(d1 = TRUE), col_names, y_col)

# d2 as TRUE (select all possible 2D interactions)
resolve_x_cols(list(d2 = TRUE), col_names, y_col)

# d2_all: Request all 2D interactions involving a specific variable
resolve_x_cols(list(d2_all = "a"), col_names, y_col)

# NULL: No variables selected
resolve_x_cols(NULL, col_names, y_col)



## Examples of how exclude_cols are removed from x_cols to obtain various desired results

# Exclude one column from a simple character vector
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = "b"
)

# Exclude multiple columns
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = c("a", "c")
)

# Exclude an interaction term from a formula input
resolve_x_cols(
  x_cols = ~ a + b + a:b,
  col_names = col_names,
  y_col = y_col,
  exclude_cols = ~ a:b
)

# Exclude all columns from x_cols
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = c("a", "b", "c")
)

# Exclude non-existent columns (should be ignored)
resolve_x_cols(
  x_cols = c("a", "b"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = "z"
)

# Exclude one column from a list-based input
resolve_x_cols(
  x_cols = list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = "a")
)

# Exclude interactions only
resolve_x_cols(
  x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d2 = list(c("a", "b")))
)

# Exclude everything, including interactions
resolve_x_cols(
  x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c")))
)

# Exclude a column implicitly removed by y_col
resolve_x_cols(
  x_cols = c("y", "a", "b"),
  col_names = col_names,
  y_col = "y",
  exclude_cols = "y"
)

# Exclude entire 2D dimension from x_cols with d2 = TRUE
resolve_x_cols(
  x_cols = list(d1 = TRUE, d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = c("a"), d2 = TRUE)
)

## Sample data
set.seed(0)
df <- data.frame(
  y = runif(10),
  a = sample(letters[1:3], 10, replace = TRUE),
  b = rnorm(10),
  c = sample(1:5, 10, replace = TRUE)
)
col_names <- names(df)
y_col <- "y"  # Assume 'y' is the outcome variable


## Examples with just x_cols to show different formats for specifying x_cols
## (same format for exclude_cols)

# Character vector: Simple ALE with no interactions
resolve_x_cols(c("a", "b"), col_names, y_col)

# Character string: Select just one 1D element
resolve_x_cols("c", col_names, y_col)

# Character string: Select just one 1D element
resolve_x_cols("c", col_names, y_col)

# list of 1- and 2-length character vectors: specify precise 1D and 2D elements desired
resolve_x_cols(list(c("a", "b"), "c", c("c", "a"), "b"), col_names, y_col)

# Formula: Converts to a list of individual elements
resolve_x_cols(~ a + b, col_names, y_col)

# Formula with interactions (1D and 2D).
# This format is probably more convenient if you know precisely which terms you want.
# Note that the outcome on the left-hand-side is always silently ignored.
resolve_x_cols(whatever ~ a + b + a:b + c:b, col_names, y_col)

# List specifying d1 (1D ALE)
resolve_x_cols(list(d1 = c("a", "b")), col_names, y_col)

# List specifying d2 (2D ALE)
resolve_x_cols(list(d2 = list(c("a", "b"))), col_names, y_col)

# List specifying both d1 and d2
resolve_x_cols(list(d1 = c("a", "b"), d2 = list(c("a", "b"))), col_names, y_col)

# d1 as TRUE (select all columns except y_col)
resolve_x_cols(list(d1 = TRUE), col_names, y_col)

# d2 as TRUE (select all possible 2D interactions)
resolve_x_cols(list(d2 = TRUE), col_names, y_col)

# d2_all: Request all 2D interactions involving a specific variable
resolve_x_cols(list(d2_all = "a"), col_names, y_col)

# NULL: No variables selected
resolve_x_cols(NULL, col_names, y_col)



## Examples of how exclude_cols are removed from x_cols to obtain various desired results

# Exclude one column from a simple character vector
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = "b"
)

# Exclude multiple columns
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = c("a", "c")
)

# Exclude an interaction term from a formula input
resolve_x_cols(
  x_cols = ~ a + b + a:b,
  col_names = col_names,
  y_col = y_col,
  exclude_cols = ~ a:b
)

# Exclude all columns from x_cols
resolve_x_cols(
  x_cols = c("a", "b", "c"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = c("a", "b", "c")
)

# Exclude non-existent columns (should be ignored)
resolve_x_cols(
  x_cols = c("a", "b"),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = "z"
)

# Exclude one column from a list-based input
resolve_x_cols(
  x_cols = list(d1 = c("a", "b"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = "a")
)

# Exclude interactions only
resolve_x_cols(
  x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d2 = list(c("a", "b")))
)

# Exclude everything, including interactions
resolve_x_cols(
  x_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = c("a", "b", "c"), d2 = list(c("a", "b"), c("a", "c")))
)

# Exclude a column implicitly removed by y_col
resolve_x_cols(
  x_cols = c("y", "a", "b"),
  col_names = col_names,
  y_col = "y",
  exclude_cols = "y"
)

# Exclude entire 2D dimension from x_cols with d2 = TRUE
resolve_x_cols(
  x_cols = list(d1 = TRUE, d2 = list(c("a", "b"), c("a", "c"))),
  col_names = col_names,
  y_col = y_col,
  exclude_cols = list(d1 = c("a"), d2 = TRUE)
)

summary method for ALEPlots object

Description

Present concise summary information about an ALEPlots object.

Arguments

`object`	An object of class `ALEPlots`.
`...`	Not used

Value

Summary string.

Multi-variable transformation of the mtcars dataset.

Description

This is a transformation of the mtcars dataset from R to produce a small dataset with each of the fundamental datatypes: logical, factor, ordered, integer, double, and character. Most of the transformations are obvious, but a few are noteworthy:

The row names (the car model) are saved as a character vector.
For the unordered factors, the country and continent of the car manufacturer are obtained based on the row names (model).
For the ordered factor, gears 3, 4, and 5 are encoded as 'three', 'four', and 'five', respectively. The text labels make it explicit that the variable is ordinal, yet the number names make the order crystal clear.

Here is the adaptation of the original description of the mtcars dataset:

The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

Usage

var_cars
var_cars

Format

A tibble with 32 observations on 14 variables.

model: character: Car model
mpg: double: Miles/(US) gallon
cyl: integer: Number of cylinders
disp: double: Displacement (cu.in.)
hp: double: Gross horsepower
drat: double: Rear axle ratio
wt: double: Weight (1000 lbs)
qsec: double: 1/4 mile time
vs: logical: Engine (0 = V-shaped, 1 = straight)
am: logical: Transmission (0 = automatic, 1 = manual)
gear: ordered: Number of forward gears
carb: integer: Number of carburetors
country: factor: Country of car manufacturer
continent: factor: Continent of car manufacturer

Note

Henderson and Velleman (1981) comment in a footnote to Table 1: 'Hocking (original transcriber)'s noncrucial coding of the Mazda's rotary engine as a straight six-cylinder engine and the Porsche's flat engine as a V engine, as well as the inclusion of the diesel Mercedes 240D, have been retained to enable direct comparisons to be made with previous analyses.'

References

Henderson and Velleman (1981), Building multiple regression models interactively. Biometrics, 37, 391–411.

Determine the datatype of a vector

Description

Determine the datatype of a vector

Usage

var_type(var)
var_type(var)

Arguments

var

vector whose datatype is to be determined

Not exported. See @returns for details of what it does.

Value

Returns generic datatypes of R basic vectors according to the following mapping:

logical returns 'binary'
numeric values (e.g., integer and double) return 'numeric'
However, if the only values of numeric are 0 and 1, then it returns 'binary'
unordered factor returns 'categorical'
ordered factor returns 'ordinal'

Package 'ale'

Help Index

Add two arrays or matrices, ignoring NA values by default

Description

Usage

Arguments

Value

ALE data and statistics that describe a trained model

Description

Usage

Arguments

Value

Properties

Custom predict function

ALE statistics

Parallel processing

Progress bars

References

Examples

Calculate statistics from ALE y values.

Description

Usage

Arguments

Details

Value

Calculate statistics from 2D ALE y values.

Description

Usage

Arguments

Details

Value

Object with the ALE statistics of a random variable for generating p-values

Description

Usage

Arguments

Value

Approach to calculating p-values

Parallel processing

References

Examples

ALE plots with print and plot methods

Description

Usage

Arguments

Value

Properties

2D interaction plots

Examples

Calculate ALE data

Description

Usage

Arguments

Details

References

Cast (convert) the class of an object

Description

Usage

Arguments

Value

Census Income

Description

Usage

Format

Source

Sum up a matrix across columns

Description

Usage

Arguments

Value

Extract all NWSE diagonals from a matrix

Description

Usage

Arguments

Value

Extract all FNWBSE diagonals from a 3D array

Description

Usage

Arguments

Value

S7 generic get method for objects in the ale package

plot method for `ALE` objects

plot method for `ModelBoot` objects