Title: | Standardized Accuracy and Other Model Performance Metrics |
---|---|
Description: | Standardized accuracy (staccuracy) is a framework for expressing accuracy scores such that 50% represents a reference level of performance and 100% is a perfect prediction. The 'staccuracy' package provides tools for creating staccuracy functions as well as some recommended staccuracy measures. It also provides functions for some classic performance metrics such as mean absolute error (MAE), root mean squared error (RMSE), and area under the receiver operating characteristic curve (AUCROC), as well as their winsorized versions when applicable. |
Authors: | Chitu Okoli [aut, cre] |
Maintainer: | Chitu Okoli <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.2.2 |
Built: | 2025-02-23 16:22:02 UTC |
Source: | https://github.com/tripartio/staccuracy |
Returns the area under the ROC curve based on comparing the predicted scores to the actual binary values. Tied predictions are handled by calculating the optimistic AUC (positive cases sorted first, resulting in higher AUC) and the pessimistic AUC (positive cases sorted last, resulting in lower AUC) and then returning the average of the two. For the ROC, a "tie" means at least one pair of pred
predictions whose value is identical yet their corresponding values of actual
are different. (If the value of actual
are the same for identical predictions, then these are unproblematic and are not considered "ties".)
aucroc( actual, pred, na.rm = FALSE, positive = NULL, sample_size = 10000, seed = 0 )
aucroc( actual, pred, na.rm = FALSE, positive = NULL, sample_size = 10000, seed = 0 )
actual |
any atomic vector. Actual label values from a dataset. They must be binary; that is, there must be exactly two distinct values (other than missing values, which are allowed). The "true" or "positive" class is determined by coercing |
pred |
numeric vector. Predictions corresponding to each respective element in |
na.rm |
logical(1). |
positive |
any single atomic value. The value of |
sample_size |
single positive integer. To keep the computation relatively rapid, when |
seed |
numeric(1). Random seed used only if |
List with the following elements:
roc_opt
: tibble with optimistic ROC data. "Optimistic" means that when predictions are tied, the TRUE/positive actual values are ordered before the FALSE/negative ones.
roc_pess
: tibble with pessimistic ROC data. "Pessimistic" means that when predictions are tied, the FALSE/negative actual values are ordered before the TRUE/positive ones. Note that this difference is not merely in the sort order: when there are ties, the way that true positives, true negatives, etc. are counted is different for optimistic and pessimistic approaches. If there are no tied predictions, then roc_opt
and roc_pess
are identical.
auc_opt
: area under the ROC curve for optimistic ROC.
auc_pess
: area under the ROC curve for pessimistic ROC.
auc
: mean of auc_opt
and auc_pess
. If there are no tied predictions, then auc_opt
, auc_pess
, and auc
are identical.
ties
: TRUE
if there are two or more tied predictions; FALSE
if there are no ties.
set.seed(0) # Generate some simulated "actual" data a <- sample(c(TRUE, FALSE), 50, replace = TRUE) # Generate some simulated predictions p <- runif(50) |> round(2) p[c(7, 8, 22, 35, 40, 41)] <- 0.5 # Calculate AUCROC with its components ar <- aucroc(a, p) ar$auc
set.seed(0) # Generate some simulated "actual" data a <- sample(c(TRUE, FALSE), 50, replace = TRUE) # Generate some simulated predictions p <- runif(50) |> round(2) p[c(7, 8, 22, 35, 40, 41)] <- 0.5 # Calculate AUCROC with its components ar <- aucroc(a, p) ar$auc
These are standard error and deviation measures for numeric data. "Deviation" means the natural variation of the values of a numeric vector around its central tendency (usually the mean or median). "Error" means the average discrepancy between the actual values of a numeric vector and its predicted values.
mae(actual, pred, na.rm = FALSE) rmse(actual, pred, na.rm = FALSE) mad(x, na.rm = FALSE, version = "mean", ...)
mae(actual, pred, na.rm = FALSE) rmse(actual, pred, na.rm = FALSE) mad(x, na.rm = FALSE, version = "mean", ...)
actual |
numeric vector. Actual (true) values of target outcome data. |
pred |
numeric vector. Predictions corresponding to each respective element in |
na.rm |
logical(1). |
x |
numeric vector. Values for which to calculate the MAD. |
version |
character(1). By default ( |
... |
Arguments to pass to |
Mean absolute deviation (MAD)
mad()
returns the mean absolute deviation (MAD) of values relative to their mean. This is useful as a default benchmark for the mean absolute error (MAE), as the standard deviation (SD) is a default benchmark for the root mean square error (RMSE).
NOTE: This function name overrides stats::mad()
(median absolute deviation relative to their median). To maintain the functionality of stats::mad()
, specify the version
argument.
In all cases, if any value in actual
or pred
is NA
and na.rm = FALSE
, then the function returns NA
.
mae()
returns the mean absolute error (MAE) of predicted values pred
compared to the actual
values.
rmse()
returns the root mean squared error (RMSE) of predicted values pred
compared to the actual
values.
mad()
returns either the mean absolute deviation (MAD) of values relative to their mean (default) or the median absolute deviation relative to their median. See details.
a <- c(3, 5, 2, 7, 9, 4, 6, 8, 1, 10) p <- c(2.5, 5.5, 2, 6.5, 9.5, 3.5, 6, 7.5, 1.5, 9.5) mae(a, p) rmse(a, p) mad(a)
a <- c(3, 5, 2, 7, 9, 4, 6, 8, 1, 10) p <- c(2.5, 5.5, 2, 6.5, 9.5, 3.5, 6, 7.5, 1.5, 9.5) mae(a, p) rmse(a, p) mad(a)
Area under the ROC curve (AUCROC) is a classification measure. By dichotomizing the range of actual
values, reg_aucroc()
turns regression evaluation into classification evaluation for any regression model. Note that the model that generates the predictions is assumed to be a regression model; however, any numeric inputs are allowed for the pred
argument, so there is no check for the nature of the source model.
reg_aucroc( actual, pred, num_quants = 100, ..., cuts = NULL, imbalance = 0.05, na.rm = FALSE, sample_size = 10000, seed = 0 )
reg_aucroc( actual, pred, num_quants = 100, ..., cuts = NULL, imbalance = 0.05, na.rm = FALSE, sample_size = 10000, seed = 0 )
actual |
numeric vector. Actual label values from a dataset. They must be numeric. |
pred |
numeric vector. Predictions corresponding to each respective element in |
num_quants |
scalar positive integer. If |
... |
Not used. Forces explicit naming of the arguments that follow. |
cuts |
numeric vector. If |
imbalance |
numeric(1) in (0, 0.5]. The result element |
na.rm |
See documentation for |
sample_size |
See documentation for |
seed |
See documentation for |
The ROC data and AUCROC values are calculated with aucroc()
.
List with the following elements:
rocs
: List of results for aucroc()
for each dichotomized segment of actual
.
auc
: named numeric vector of AUC extracted from each element of rocs
. Named by the percentile that the AUC represents.
mean_auc
: named numeric(3). The average AUC over the low, middle, and high quantiles of dichotomization:
lo
: average AUC with imbalance
% (e.g., 5%) or less of the actual target values;
mid
: average AUC in between lo
and hi
;
hi
: average AUC with (1 - imbalance
)% (e.g., 95%) or more of the actual target values;
# Remove rows with missing values from airquality dataset airq <- airquality |> na.omit() # Create binary version where the target variable 'Ozone' is dichotomized based on its median airq_bin <- airq airq_bin$Ozone <- airq_bin$Ozone >= median(airq_bin$Ozone) # Create a generic regression model; use autogam req_aq <- autogam::autogam(airq, 'Ozone', family = gaussian()) req_aq$perf$sa_wmae_mad # Standardized accuracy for regression # Create a generic classification model; use autogam class_aq <- autogam::autogam(airq_bin, 'Ozone', family = binomial()) class_aq$perf$auc # AUC (standardized accuracy for classification) # Compute AUC for regression predictions reg_auc_aq <- reg_aucroc( airq$Ozone, predict(req_aq) ) # Average AUC over the lo, mid, and hi quantiles of dichotomization: reg_auc_aq$mean_auc
# Remove rows with missing values from airquality dataset airq <- airquality |> na.omit() # Create binary version where the target variable 'Ozone' is dichotomized based on its median airq_bin <- airq airq_bin$Ozone <- airq_bin$Ozone >= median(airq_bin$Ozone) # Create a generic regression model; use autogam req_aq <- autogam::autogam(airq, 'Ozone', family = gaussian()) req_aq$perf$sa_wmae_mad # Standardized accuracy for regression # Create a generic classification model; use autogam class_aq <- autogam::autogam(airq_bin, 'Ozone', family = binomial()) class_aq$perf$auc # AUC (standardized accuracy for classification) # Compute AUC for regression predictions reg_auc_aq <- reg_aucroc( airq$Ozone, predict(req_aq) ) # Average AUC over the lo, mid, and hi quantiles of dichotomization: reg_auc_aq$mean_auc
Because the distribution of staccuracies is uncertain (and indeed, different staccuracies likely have different distributions), bootstrapping is used to empirically estimate the distributions and calculate the p-values. See the return value description for details on what the function provides.
sa_diff( actual, preds, ..., na.rm = FALSE, sa = NULL, pct = c(0.01, 0.02, 0.03, 0.04, 0.05), boot_alpha = 0.05, boot_it = 1000, seed = 0 )
sa_diff( actual, preds, ..., na.rm = FALSE, sa = NULL, pct = c(0.01, 0.02, 0.03, 0.04, 0.05), boot_alpha = 0.05, boot_it = 1000, seed = 0 )
actual |
numeric vector. The actual (true) labels. |
preds |
named list of at least two numeric vectors. Each element is a vector of the same length as actual with predictions for each row corresponding to each element of actual. The names of the list elements should be the names of the models that produced each respective prediction; these names will be used to distinguish the results. |
... |
not used. Forces explicit naming of subsequent arguments. |
na.rm |
See documentation for |
sa |
list of functions. Each element is the unquoted name of a valid staccuracy function (see |
pct |
numeric with values from (0, 1). The percentage values on which the difference in staccuracies will be tested. |
boot_alpha |
numeric(1) from 0 to 1. Alpha for percentile-based confidence interval range for the bootstrapped means; the bootstrap confidence intervals will be the lowest and highest |
boot_it |
positive integer(1). The number of bootstrap iterations. |
seed |
integer(1). Random seed for the bootstrap sampling. Supply this between runs to assure identical results. |
tibble with staccuracy difference results:
staccuracy
: name of staccuracy measure
pred
: Each named element (model name) in the input preds
. The row values give the staccuracy for that prediction. When pred
is NA
, the row represents the difference between prediction staccuracies (diff
) instead of staccuracies themselves.
diff
: When diff
takes the form 'model1-model2', then the row values give the difference in staccuracies between two named elements (model names) in the input preds
. When diff
is NA
, the row instead represents the staccuracy of a specific model prediction (pred
).
lo
, mean
, hi
: The lower bound, mean, and upper bound of the bootstrapped staccuracy. The lower and upper bounds are confidence intervals specified by the input boot_alpha
.
p__
: p-values that the difference in staccuracies are at least the specified percentage amount or greater. E.g., for the default input pct = c(0.01, 0.02, 0.03, 0.04, 0.05)
, these columns would be p01
, p02
, p03
, p04
, and p05
. As they apply only to differences between staccuracies, they are provided only for diff
rows and are NA
for pred
rows. As an example of their meaning, if the mean
difference for 'model1-model2' is 0.0832 with p01
of 0.012 and p02
of 0.035, then 1.2% of bootstrapped staccuracies had a model1 - model2 difference of less than 0.01 and 3.5% were less than 0.02. (That is, 98.8% of differences were greater than 0.01 and 96.5% were greater than 0.02.)
lm_attitude_all <- lm(rating ~ ., data = attitude) lm_attitude__a <- lm(rating ~ . - advance, data = attitude) lm_attitude__c <- lm(rating ~ . - complaints, data = attitude) sdf <- sa_diff( attitude$rating, list( all = predict(lm_attitude_all), madv = predict(lm_attitude__a), mcmp = predict(lm_attitude__c) ), boot_it = 10 ) sdf
lm_attitude_all <- lm(rating ~ ., data = attitude) lm_attitude__a <- lm(rating ~ . - advance, data = attitude) lm_attitude__c <- lm(rating ~ . - complaints, data = attitude) sdf <- sa_diff( attitude$rating, list( all = predict(lm_attitude_all), madv = predict(lm_attitude__a), mcmp = predict(lm_attitude__c) ), boot_it = 10 ) sdf
Standardized accuracy (staccuracy) represents error or accuracy measures on a scale where 1 or 100% means perfect prediction and 0.5 or 50% is a reference comparison of some specified standard performance. Higher than 0.5 is better than the reference and below 0.5 is worse. 0 might or might not have a special meaning; sometimes negative scores are possible, but these often indicate modelling errors.
The core function is staccuracy()
, which receives as input a generic error function and a reference function against which to compare the error function performance. In addition, the following recommended staccuracy functions are provided:
sa_mae_mad
: standardized accuracy of the mean absolute error (MAE) based on the mean absolute deviation (MAD)
sa_rmse_sd
: standardized accuracy of the root mean squared error (RMSE) based on the standard deviation (SD)
sa_wmae_mad
: standardized accuracy of the winsorized mean absolute error (MAE) based on the mean absolute deviation (MAD)
sa_wrmse_sd
: standardized accuracy of the winsorized root mean squared error (RMSE) based on the standard deviation (SD)
staccuracy(error_fun, ref_fun) sa_mae_mad(actual, pred, na.rm = FALSE) sa_wmae_mad(actual, pred, na.rm = FALSE) sa_rmse_sd(actual, pred, na.rm = FALSE) sa_wrmse_sd(actual, pred, na.rm = FALSE)
staccuracy(error_fun, ref_fun) sa_mae_mad(actual, pred, na.rm = FALSE) sa_wmae_mad(actual, pred, na.rm = FALSE) sa_rmse_sd(actual, pred, na.rm = FALSE) sa_wrmse_sd(actual, pred, na.rm = FALSE)
error_fun |
function. The unquoted name of the function that calculates the error (or accuracy) measure. This function must be of the signature |
ref_fun |
function. The unquoted name of the function that calculates the reference error, accuracy, or deviation measure. This function must be of the signature |
actual |
numeric. The true (actual) labels. |
pred |
numeric. The predicted estimates. Must be the same length as |
na.rm |
logical(1). Whether NA values should be removed ( |
The core function staccuracy()
receives as input a generic error function and a reference function against which to compare the error function's performance. These input functions must have the following signatures (see the argument specifications for details of the arguments):
error_fun
: function(actual, pred, na.rm = na.rm)
; the output must be a scalar numeric (that is, a single number).
error_fun
: function(actual, pred, na.rm = na.rm)
; the output must be a scalar numeric (that is, a single number).
staccuracy()
returns a function with signature function(actual, pred, na.rm = FALSE)
that receives an actual
and a pred
vector as inputs and returns the staccuracy of the originally input error function based on the input reference function.
The convenience sa_*()
functions return the staccuracy measures specified above.
# Here's some data actual_1 <- c(2.3, 4.5, 1.8, 7.6, 3.2) # Here are some predictions of that data predicted_1 <- c(2.5, 4.2, 1.9, 7.4, 3.0) # MAE measures the average error in the predictions mae(actual_1, predicted_1) # But how good is that? # MAD gives the natural variation in the actual data; this is a point of comparison. mad(actual_1) # So, our predictions are better (lower) than the MAD, but how good, really? # Create a standardized accuracy function to give us an easily interpretable metric: my_mae_vs_mad_sa <- staccuracy(mae, mad) # Now use it my_mae_vs_mad_sa(actual_1, predicted_1) # That's 94.2% standardized accuracy compared to the MAD. Pretty good!
# Here's some data actual_1 <- c(2.3, 4.5, 1.8, 7.6, 3.2) # Here are some predictions of that data predicted_1 <- c(2.5, 4.2, 1.9, 7.4, 3.0) # MAE measures the average error in the predictions mae(actual_1, predicted_1) # But how good is that? # MAD gives the natural variation in the actual data; this is a point of comparison. mad(actual_1) # So, our predictions are better (lower) than the MAD, but how good, really? # Create a standardized accuracy function to give us an easily interpretable metric: my_mae_vs_mad_sa <- staccuracy(mae, mad) # Now use it my_mae_vs_mad_sa(actual_1, predicted_1) # That's 94.2% standardized accuracy compared to the MAD. Pretty good!
Winsorization means truncating the extremes of a numeric range by replacing extreme values with a predetermined minimum and maximum. winsorize()
returns the input vector values with values less than or greater than the provided minimum or maximum replaced by the provided minimum or maximum, respectively.
win_mae()
and win_rmse()
return MAE and RMSE respectively with winsorized predictions. The fundamental idea underlying the winsorization of predictions is that if the actual data has well-defined bounds, then models should not be penalized for being overzealous in predicting beyond the extremes of the data. Models that are overzealous in the boundaries might sometimes be superior within normal ranges; the extremes can be easily corrected by winsorization.
winsorize(x, win_range) win_mae(actual, pred, win_range = range(actual), na.rm = FALSE) win_rmse(actual, pred, win_range = range(actual), na.rm = FALSE)
winsorize(x, win_range) win_mae(actual, pred, win_range = range(actual), na.rm = FALSE) win_rmse(actual, pred, win_range = range(actual), na.rm = FALSE)
x |
numeric vector. |
win_range |
numeric(2). The minimum and maximum allowable values for the |
actual |
numeric vector. Actual (true) values of target outcome data. |
pred |
numeric vector. Predictions corresponding to each respective element in |
na.rm |
logical(1). |
winsorize()
returns a winsorized vector.
win_mae()
returns the mean absolute error (MAE) of winsorized predicted values pred
compared to the actual
values. See mae()
for details.
win_rmse()
returns the root mean squared error (RMSE) of winsorized predicted values pred
compared to the actual
values. See rmse()
for details.
a <- c(3, 5, 2, 7, 9, 4, 6, 8, 2, 10) p <- c(2.5, 5.5, 1.5, 6.5, 10.5, 3.5, 6, 7.5, 0.5, 11.5) a # the original data winsorize(a, c(2, 8)) # a winsorized on defined boundaries # range of the original data a range(a) # some overzealous predictions p range(p) # MAE penalizes overzealous predictions mae(a, p) # Winsorized MAE forgives overzealous predictions win_mae(a, p) # RMSE penalizes overzealous predictions rmse(a, p) # Winsorized RMSE forgives overzealous predictions win_rmse(a, p)
a <- c(3, 5, 2, 7, 9, 4, 6, 8, 2, 10) p <- c(2.5, 5.5, 1.5, 6.5, 10.5, 3.5, 6, 7.5, 0.5, 11.5) a # the original data winsorize(a, c(2, 8)) # a winsorized on defined boundaries # range of the original data a range(a) # some overzealous predictions p range(p) # MAE penalizes overzealous predictions mae(a, p) # Winsorized MAE forgives overzealous predictions win_mae(a, p) # RMSE penalizes overzealous predictions rmse(a, p) # Winsorized RMSE forgives overzealous predictions win_rmse(a, p)