Function to preprocess your data for input into run_ml()
.
Usage
preprocess_data(
dataset,
outcome_colname,
method = c("center", "scale"),
remove_var = "nzv",
collapse_corr_feats = TRUE,
to_numeric = TRUE,
group_neg_corr = TRUE,
prefilter_threshold = 1
)
Arguments
- dataset
Data frame with an outcome variable and other columns as features.
- outcome_colname
Column name as a string of the outcome variable (default
NULL
; the first column will be chosen automatically).- method
Methods to preprocess the data, described in
caret::preProcess()
(default:c("center","scale")
, useNULL
for no normalization).- remove_var
Whether to remove variables with near-zero variance (
'nzv'
; default), zero variance ('zv'
), or none (NULL
).- collapse_corr_feats
Whether to keep only one of perfectly correlated features.
- to_numeric
Whether to change features to numeric where possible.
- group_neg_corr
Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)).
- prefilter_threshold
Remove features which only have non-zero & non-NA values N rows or fewer (default: 1). Set this to -1 to keep all columns at this step. This step will also be skipped if
to_numeric
is set toFALSE
.
Value
Named list including:
dat_transformed
: Preprocessed data.grp_feats
: If features were grouped together, a named list of the features corresponding to each group.removed_feats
: Any features that were removed during preprocessing (e.g. because there was zero variance or near-zero variance for those features).
If the progressr
package is installed, a progress bar with time elapsed
and estimated time to completion can be displayed.
More details
See the preprocessing vignette for more details.
Note that if any values in outcome_colname
contain spaces, they will be
converted to underscores for compatibility with caret
.
Examples
preprocess_data(mikropml::otu_small, "dx")
#> Using 'dx' as the outcome column.
#> $dat_transformed
#> # A tibble: 200 × 61
#> dx Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006 Otu00007 Otu00008
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 norm… -0.420 -0.219 -0.174 -0.591 -0.0488 -0.167 -0.569 -0.0624
#> 2 norm… -0.105 1.75 -0.718 0.0381 1.54 -0.573 -0.643 -0.132
#> 3 norm… -0.708 0.696 1.43 0.604 -0.265 -0.0364 -0.612 -0.207
#> 4 norm… -0.494 -0.665 2.02 -0.593 -0.676 -0.586 -0.552 -0.470
#> 5 norm… 1.11 -0.395 -0.754 -0.586 -0.754 2.73 0.191 -0.676
#> 6 norm… -0.685 0.614 -0.174 -0.584 0.376 0.804 -0.337 -0.00608
#> 7 canc… -0.770 -0.496 -0.318 0.159 -0.658 2.20 -0.717 0.0636
#> 8 norm… -0.424 -0.478 -0.397 -0.556 -0.391 -0.0620 0.376 -0.0222
#> 9 norm… -0.556 1.14 1.62 -0.352 -0.275 -0.465 -0.804 0.294
#> 10 canc… 1.46 -0.451 -0.694 -0.0567 -0.706 0.689 -0.370 1.59
#> # ℹ 190 more rows
#> # ℹ 52 more variables: Otu00009 <dbl>, Otu00010 <dbl>, Otu00011 <dbl>,
#> # Otu00012 <dbl>, Otu00013 <dbl>, Otu00014 <dbl>, Otu00015 <dbl>,
#> # Otu00016 <dbl>, Otu00017 <dbl>, Otu00018 <dbl>, Otu00019 <dbl>,
#> # Otu00020 <dbl>, Otu00021 <dbl>, Otu00022 <dbl>, Otu00023 <dbl>,
#> # Otu00024 <dbl>, Otu00025 <dbl>, Otu00026 <dbl>, Otu00027 <dbl>,
#> # Otu00028 <dbl>, Otu00029 <dbl>, Otu00030 <dbl>, Otu00031 <dbl>, …
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
#>
# the function can show a progress bar if you have the progressr package installed
## optionally, specify the progress bar format
progressr::handlers(progressr::handler_progress(
format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
clear = FALSE,
show_after = 0
))
## tell progressor to always report progress
if (FALSE) {
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
dat_preproc <- preprocess_data(mikropml::otu_small, "dx")
}