In this tutorial, we show how you can speed up pre-processing, model training, and feature importance steps for individual runs, as well as how to train multiple models in parallel within R and visualize the results. However, we highly recommend using a workflow manager such as Snakemake rather than parallelizing within a single R session. Jump to the section Parallelizing with Snakemake below if you’re interested in skipping right to our best recommendation.
library(mikropml)
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(ggplot2)
Speed up single runs
By default, preprocess_data()
, run_ml()
,
and compare_models()
use only one process in series. If
you’d like to parallelize various steps of the pipeline to make them run
faster, install foreach
, future
,
future.apply
, and doFuture
. Then, register a
future plan prior to calling these functions:
doFuture::registerDoFuture()
future::plan(future::multicore, workers = 2)
Above, we used the multicore
plan to split the work
across 2 cores. See the future
documentation for more about picking the best plan for your use
case. Notably, multicore
does not work inside RStudio or on
Windows; you will need to use multisession
instead in those
cases.
After registering a future plan, you can call
preprocess_data()
and run_ml()
as usual, and
they will run certain tasks in parallel.
otu_data_preproc <- preprocess_data(otu_mini_bin, "dx")$dat_transformed
result1 <- run_ml(otu_data_preproc, "glmnet", seed = 2019)
There’s a also a parallel version of the rf
engine
called parRF
which trains the trees in the forest in
parallel. See the caret
docs for more information.
Bootstrap performance
If you only intend to call run_ml()
once to generate one
train/test split (e.g. such as for a temporal split of the dataset), you
can evaluate the model performance by bootstrapping the test set.
Here we show how to generate 100
bootstraps and
calculate a confidence interval for the model performance. We only use
100
here for computation speed, but it is recommended to
generate 10000
bootstraps for a more precise estimation of
the confidence interval.
boot_perf <- bootstrap_performance(result1,
outcome_colname = "dx",
bootstrap_times = 100, alpha = 0.05
)
boot_perf
#> # A tibble: 15 × 6
#> term .lower .estimate .upper .alpha .method
#> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 AUC 0.437 0.641 0.830 0.05 percentile
#> 2 Accuracy 0.410 0.584 0.769 0.05 percentile
#> 3 Balanced_Accuracy 0.412 0.584 0.760 0.05 percentile
#> 4 Detection_Rate 0.128 0.3 0.475 0.05 percentile
#> 5 F1 0.340 0.582 0.8 0.05 percentile
#> 6 Kappa -0.176 0.166 0.520 0.05 percentile
#> 7 Neg_Pred_Value 0.394 0.584 0.8 0.05 percentile
#> 8 Pos_Pred_Value 0.333 0.585 0.8 0.05 percentile
#> 9 Precision 0.333 0.585 0.8 0.05 percentile
#> 10 Recall 0.315 0.591 0.85 0.05 percentile
#> 11 Sensitivity 0.315 0.591 0.85 0.05 percentile
#> 12 Specificity 0.372 0.578 0.765 0.05 percentile
#> 13 cv_metric_AUC 0.622 0.622 0.622 0.05 percentile
#> 14 logLoss 0.660 0.685 0.715 0.05 percentile
#> 15 prAUC 0.442 0.585 0.739 0.05 percentile
Call run_ml()
multiple times in parallel in R
You can use functions from the future.apply
package to
call run_ml()
multiple times in parallel with different
parameters. You will first need to run future::plan()
as
above if you haven’t already. Then, call run_ml()
with
multiple seeds using future_lapply()
:
# NOTE: use more seeds for real-world data
results_multi <- future.apply::future_lapply(seq(100, 102), function(seed) {
run_ml(otu_data_preproc, "glmnet", seed = seed)
}, future.seed = TRUE)
#> Using 'dx' as the outcome column.
#> Training the model...
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:mikropml':
#>
#> compare_models
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:mikropml':
#>
#> compare_models
#> Training complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
Each call to run_ml()
with a different seed uses a
different random split of the data into training and testing sets. Since
we are using seeds, we must set future.seed
to
TRUE
(see the future.apply
documentation and this
blog post for details on parallel-safe random seeds). This example
uses only a few seeds for speed and simplicity, but for real data we
recommend using many more seeds to get a better estimate of model
performance.
In these examples, we used functions from the
future.apply
package to run_ml()
in parallel,
but you can accomplish the same thing with parallel versions of the
purrr::map()
functions using the furrr
package
(e.g. furrr::future_map_dfr()
).
Extract the performance results and combine into one dataframe for all seeds:
perf_df <- future.apply::future_lapply(results_multi,
function(result) {
result[["performance"]] %>%
select(cv_metric_AUC, AUC, method)
},
future.seed = TRUE
) %>%
dplyr::bind_rows()
perf_df
#> # A tibble: 3 × 3
#> cv_metric_AUC AUC method
#> <dbl> <dbl> <chr>
#> 1 0.630 0.634 glmnet
#> 2 0.591 0.608 glmnet
#> 3 0.671 0.471 glmnet
Multiple ML methods
You may also wish to compare performance for different ML methods.
mapply()
can iterate over multiple lists or vectors, and
future_mapply()
works the same way:
# NOTE: use more seeds for real-world data
param_grid <- expand.grid(
seeds = seq(100, 103),
methods = c("glmnet", "rf")
)
results_mtx <- future.apply::future_mapply(
function(seed, method) {
run_ml(otu_data_preproc,
method,
seed = seed,
find_feature_importance = TRUE
)
},
param_grid$seeds,
param_grid$methods %>% as.character(),
future.seed = TRUE
)
#> Using 'dx' as the outcome column.
#> Training the model...
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:mikropml':
#>
#> compare_models
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Loading required package: lattice
#>
#> Attaching package: 'caret'
#> The following object is masked from 'package:mikropml':
#>
#> compare_models
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Finding feature importance...
#> Feature importance complete.
Visualize the results
ggplot2
is required to use our plotting functions below.
You can also create your own plots however you like using the results
data.
Performance
Mean AUC
perf_df <- lapply(
results_mtx["performance", ],
function(x) {
x %>% select(cv_metric_AUC, AUC, method)
}
) %>%
dplyr::bind_rows()
perf_boxplot <- plot_model_performance(perf_df)
perf_boxplot
plot_model_performance()
returns a ggplot2 object. You
can add layers to customize the plot:
perf_boxplot +
theme_classic() +
scale_color_brewer(palette = "Dark2") +
coord_flip()
ROC and PRC curves
First calculate the sensitivity, specificity, and precision for all models.
get_sensspec_seed <- function(colnum) {
result <- results_mtx[, colnum]
trained_model <- result$trained_model
test_data <- result$test_data
seed <- result$performance$seed
method <- result$trained_model$method
sensspec <- calc_model_sensspec(
trained_model,
test_data,
"dx"
) %>%
mutate(seed = seed, method = method)
return(sensspec)
}
sensspec_dat <- purrr::map_dfr(
seq(1, dim(results_mtx)[2]),
get_sensspec_seed
)
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
#> Using 'dx' as the outcome column.
Plot curves for a single model
sensspec_1 <- sensspec_dat %>% filter(seed == 100, method == "glmnet")
sensspec_1 %>%
ggplot(aes(x = specificity, y = sensitivity, )) +
geom_line() +
geom_abline(
intercept = 1, slope = 1,
linetype = "dashed", color = "grey50"
) +
coord_equal() +
scale_x_reverse(expand = c(0, 0), limits = c(1.01, -0.01)) +
scale_y_continuous(expand = c(0, 0), limits = c(-0.01, 1.01)) +
labs(x = "Specificity", y = "Sensitivity") +
theme_bw() +
theme(legend.title = element_blank())
baseline_precision_otu <- calc_baseline_precision(
otu_data_preproc,
"dx", "cancer"
)
#> Using 'dx' as the outcome column.
sensspec_1 %>%
rename(recall = sensitivity) %>%
ggplot(aes(x = recall, y = precision, )) +
geom_line() +
geom_hline(
yintercept = baseline_precision_otu,
linetype = "dashed", color = "grey50"
) +
coord_equal() +
scale_x_continuous(expand = c(0, 0), limits = c(-0.01, 1.01)) +
scale_y_continuous(expand = c(0, 0), limits = c(-0.01, 1.01)) +
labs(x = "Recall", y = "Precision") +
theme_bw() +
theme(legend.title = element_blank())
Plot mean ROC and PRC for all models
sensspec_dat %>%
calc_mean_roc() %>%
plot_mean_roc()
sensspec_dat %>%
calc_mean_prc() %>%
plot_mean_prc(baseline_precision = baseline_precision_otu)
Feature importance
The perf_metric_diff
from the feature importance data
frame contains the differences between the performance on the actual
test data and the performance on the permuted test data
(i.e. test minus permuted). If a
feature is important for model performance, we expect
perf_metric_diff
to be positive. In other words, the
features that resulted in the largest decrease in
performance when permuted are the most important features.
Feature importance for multiple models
You can select the top n most important features for your models and plot them like so:
feat_df <- results_mtx["feature_importance", ] %>%
dplyr::bind_rows()
top_n <- 5
top_feats <- feat_df %>%
group_by(method, feat) %>%
summarize(mean_diff = median(perf_metric_diff)) %>%
filter(mean_diff > 0) %>%
slice_max(order_by = mean_diff, n = top_n)
#> `summarise()` has grouped output by 'method'. You can override using the
#> `.groups` argument.
feat_df %>%
right_join(top_feats, by = c("method", "feat")) %>%
mutate(features = forcats::fct_reorder(factor(feat), mean_diff)) %>%
ggplot(aes(x = perf_metric_diff, y = features, color = method)) +
geom_boxplot() +
geom_vline(xintercept = 0, linetype = "dashed") +
labs(
x = "Decrease in performance (actual minus permutation)",
y = "Features",
caption = "Features which have a lower performance when permuted have a
difference in performance above zero. The features with the greatest
decrease are the most important for model performance." %>%
stringr::str_wrap(width = 100)
) +
theme_bw() +
theme(plot.caption = element_text(hjust = 0))
See the docs for get_feature_importance()
for more
details on how these values are computed.
Feature importance for a single model
You can also plot feature importance for a single model. Here we report the actual performance, the permutation performance, and the empirical 95% confidence interval for the permutation performance.
feat_imp_1 <- results_mtx[, 1][["feature_importance"]]
perf_metric_name <- results_mtx[, 1][["trained_model"]]$metric
perf_actual <- results_mtx[, 1][["performance"]] %>% pull(perf_metric_name)
feat_imp_1 %>%
filter(perf_metric_diff > 0) %>%
mutate(feat = if_else(pvalue < 0.05, paste0("*", feat), as.character(feat)) %>%
as.factor() %>%
forcats::fct_reorder(perf_metric_diff)) %>%
ggplot(aes(x = perf_metric, xmin = lower, xmax = upper, y = feat)) +
geom_pointrange() +
geom_vline(xintercept = perf_actual, linetype = "dashed") +
labs(
x = "Permutation performance",
y = "Features",
caption = "The dashed line represents the actual performance on the
test set. Features which have a lower performance when permuted are
important for model performance. Significant features (pvalue < 0.05)
are marked with an asterisk (*). Error bars represent the 95%
confidence interval." %>% stringr::str_wrap(width = 110)
) +
theme_bw() +
theme(plot.caption = element_text(hjust = 0))
Live progress updates
preprocess_data()
and
get_feature_importance()
support reporting live progress
updates using the progressr
package. The format is up to
you, but we recommend using a progress bar like this:
# optionally, specify the progress bar format with the `progress` package.
progressr::handlers(progressr::handler_progress(
format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
clear = FALSE,
show_after = 0
))
# tell progressr to always report progress in any functions that use it.
# set this to FALSE to turn it back off again.
progressr::handlers(global = TRUE)
# run your code and watch the live progress updates.
dat <- preprocess_data(otu_mini_bin, "dx")$dat_transformed
#> Using 'dx' as the outcome column.
#> preprocessing ========================>------- 78% | elapsed: 1s | eta: 0s
results <- run_ml(dat, "glmnet",
kfold = 2, cv_times = 2,
find_feature_importance = TRUE
)
#> Using 'dx' as the outcome column.
#> Training the model...
#> Training complete.
#> Feature importance =========================== 100% | elapsed: 37s | eta: 0s
Note that some future backends support “near-live” progress updates,
meaning the progress may not be reported immediately when parallel
processing with futures. Read more on that in
the progressr
vignette. For more on
progressr
and how to customize the format of progress
updates, see the progressr
docs.
Parallelizing with Snakemake
When parallelizing multiple calls to run_ml()
in R as in
the examples above, all of the results objects are held in memory. This
isn’t a big deal for a small dataset run with only a few seeds. However,
for large datasets run in parallel with, say, 100 seeds (recommended),
you may run into problems trying to store all of those objects in memory
at once.
Using a workflow manager such as Snakemake or Nextflow is highly recommend to maximize the scalability and reproducibility of computational analyses. We created a template Snakemake workflow here which you can use as a starting point for your ML project.