Function to preprocess your data for input into run_ml()
.
preprocess_data(
dataset,
outcome_colname,
method = c("center", "scale"),
remove_var = "nzv",
collapse_corr_feats = TRUE,
to_numeric = TRUE,
group_neg_corr = TRUE,
prefilter_threshold = 1
)
Dataframe with an outcome variable and other columns as features.
Column name as a string of the outcome variable
(default NULL
; the first column will be chosen automatically).
Methods to preprocess the data, described in
caret::preProcess()
(default: c("center","scale")
, use NULL
for
no normalization).
Whether to remove variables with near-zero variance
('nzv'
; default), zero variance ('zv'
), or none (NULL
).
Whether to keep only one of perfectly correlated features.
Whether to change features to numeric where possible.
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)).
Remove features which only have non-zero & non-NA
values N rows or fewer (default: 1). Set this to -1 to keep all columns at
this step. This step will also be skipped if to_numeric
is set to
FALSE
.
Named list including:
dat_transformed
: Preprocessed data.
grp_feats
: If features were grouped together, a named list of the features corresponding to each group.
removed_feats
: Any features that were removed during preprocessing (e.g. because there was zero variance or near-zero variance for those features).
If the progressr
package is installed, a progress bar with time elapsed
and estimated time to completion can be displayed.
See the preprocessing vignette for more details.
Note that if any values in outcome_colname
contain spaces, they will be
converted to underscores for compatibility with caret
.
preprocess_data(mikropml::otu_small, "dx")
#> Using 'dx' as the outcome column.
#> $dat_transformed
#> # A tibble: 200 × 61
#> dx Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006 Otu00007 Otu00008
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 norm… -0.420 -0.219 -0.174 -0.591 -0.0488 -0.167 -0.569 -0.0624
#> 2 norm… -0.105 1.75 -0.718 0.0381 1.54 -0.573 -0.643 -0.132
#> 3 norm… -0.708 0.696 1.43 0.604 -0.265 -0.0364 -0.612 -0.207
#> 4 norm… -0.494 -0.665 2.02 -0.593 -0.676 -0.586 -0.552 -0.470
#> 5 norm… 1.11 -0.395 -0.754 -0.586 -0.754 2.73 0.191 -0.676
#> 6 norm… -0.685 0.614 -0.174 -0.584 0.376 0.804 -0.337 -0.00608
#> 7 canc… -0.770 -0.496 -0.318 0.159 -0.658 2.20 -0.717 0.0636
#> 8 norm… -0.424 -0.478 -0.397 -0.556 -0.391 -0.0620 0.376 -0.0222
#> 9 norm… -0.556 1.14 1.62 -0.352 -0.275 -0.465 -0.804 0.294
#> 10 canc… 1.46 -0.451 -0.694 -0.0567 -0.706 0.689 -0.370 1.59
#> # … with 190 more rows, and 52 more variables: Otu00009 <dbl>, Otu00010 <dbl>,
#> # Otu00011 <dbl>, Otu00012 <dbl>, Otu00013 <dbl>, Otu00014 <dbl>,
#> # Otu00015 <dbl>, Otu00016 <dbl>, Otu00017 <dbl>, Otu00018 <dbl>,
#> # Otu00019 <dbl>, Otu00020 <dbl>, Otu00021 <dbl>, Otu00022 <dbl>,
#> # Otu00023 <dbl>, Otu00024 <dbl>, Otu00025 <dbl>, Otu00026 <dbl>,
#> # Otu00027 <dbl>, Otu00028 <dbl>, Otu00029 <dbl>, Otu00030 <dbl>,
#> # Otu00031 <dbl>, Otu00032 <dbl>, Otu00033 <dbl>, Otu00034 <dbl>, …
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
#>
# the function can show a progress bar if you have the progressr package installed
## optionally, specify the progress bar format
progressr::handlers(progressr::handler_progress(
format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
clear = FALSE,
show_after = 0
))
## tell progressor to always report progress
if (FALSE) {
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
dat_preproc <- preprocess_data(mikropml::otu_small, "dx")
}