Preprocess data prior to running machine learning — preprocess

Function to preprocess your data for input into run_ml().

Usage

preprocess_data(dataset, ...)

# S4 method for class 'TreeSummarizedExperiment'
preprocess_data(
  dataset,
  outcome_colname,
  assay.type = "counts",
  col.var = NULL,
  altexp = NULL,
  name = "preprocessed",
  ...
)

# S4 method for class 'ANY'
preprocess_data(
  dataset,
  outcome_colname,
  method = c("center", "scale"),
  remove_var = "nzv",
  collapse_corr_feats = TRUE,
  corr_method = "spearman",
  corr_thresh = 1,
  to_numeric = TRUE,
  group_neg_corr = TRUE,
  prefilter_threshold = 1,
  ...
)

Arguments

dataset: Data frame with an outcome variable and other columns as features. Alternatively, the input can be in TreeSummarizedExperiment format.
...: All additional arguments are passed on to caret::train(), such as case weights via the weights argument or ntree for rf models. See the caret::train() docs for more details.
outcome_colname: Column name as a string of the outcome variable (default NULL; the first column will be chosen automatically).
assay.type: The name of assay from dataset when the object is in TreeSummarizedExperiment format. This assay is used as an input.
col.var: The name of sample matdata variables from colData slot of dataset when the object is in TreeSummarizedExperiment format. These variables are used as predictors.
altexp: The name of alternative experiment (altExp) from dataset when the object is in TreeSummarizedExperiment format. This can be used to select an experiment for the input.
name: Name of results used when the input is TreeSummarizedExperiment. This same name is used for assay and altExp.
method: Methods to preprocess the data, described in caret::preProcess() (default: c("center","scale"), use NULL for no normalization).
remove_var: Whether to remove variables with near-zero variance ('nzv'; default), zero variance ('zv'), or none (NULL).
collapse_corr_feats: Whether to keep only one of correlated features (see corr_method and corr_thresh)
corr_method: Correlation method. Options are the same as those supported by stats::cor: spearman, pearson, kendall. (default: spearman)
corr_thresh: group correlations above or equal to corr_thresh (range 0 to 1; default: 1).
to_numeric: Whether to change features to numeric where possible.
group_neg_corr: Whether to group negatively correlated features together (e.g. c(0,1) and c(1,0)).
prefilter_threshold: Remove features which only have non-zero & non-NA values in N rows or fewer (default: 1). Set this to -1 to keep all columns at this step. This step will also be skipped if to_numeric is set to FALSE.

Value

Named list including:

dat_transformed: Preprocessed data.
grp_feats: If features were grouped together, a named list of the features corresponding to each group.
removed_feats: Any features that were removed during preprocessing (e.g. because there was zero variance or near-zero variance for those features).

If the input is TreeSummarizedExperiment, the output is added as an additional data to the input object. If the set of features match in output and input, the results are stored directly to assay slot. If they do not match, the output is stored to altExp slot of the object.

If the progressr package is installed, a progress bar with time elapsed and estimated time to completion can be displayed.

More details

See the preprocessing vignette for more details.

Note that if any values in outcome_colname contain spaces, they will be converted to underscores for compatibility with caret.

Author

Zena Lapp, zenalapp@umich.edu

Kelly Sovacool, sovacool@umich.edu

Examples

preprocess_data(mikropml::otu_small, "dx")
#> Using 'dx' as the outcome column.
#> $dat_transformed
#> # A tibble: 200 × 61
#>    dx    Otu00001 Otu00002 Otu00003 Otu00004 Otu00005 Otu00006 Otu00007 Otu00008
#>    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#>  1 norm…   -0.420   -0.219   -0.174  -0.591   -0.0488  -0.167    -0.569 -0.0624 
#>  2 norm…   -0.105    1.75    -0.718   0.0381   1.54    -0.573    -0.643 -0.132  
#>  3 norm…   -0.708    0.696    1.43    0.604   -0.265   -0.0364   -0.612 -0.207  
#>  4 norm…   -0.494   -0.665    2.02   -0.593   -0.676   -0.586    -0.552 -0.470  
#>  5 norm…    1.11    -0.395   -0.754  -0.586   -0.754    2.73      0.191 -0.676  
#>  6 norm…   -0.685    0.614   -0.174  -0.584    0.376    0.804    -0.337 -0.00608
#>  7 canc…   -0.770   -0.496   -0.318   0.159   -0.658    2.20     -0.717  0.0636 
#>  8 norm…   -0.424   -0.478   -0.397  -0.556   -0.391   -0.0620    0.376 -0.0222 
#>  9 norm…   -0.556    1.14     1.62   -0.352   -0.275   -0.465    -0.804  0.294  
#> 10 canc…    1.46    -0.451   -0.694  -0.0567  -0.706    0.689    -0.370  1.59   
#> # ℹ 190 more rows
#> # ℹ 52 more variables: Otu00009 <dbl>, Otu00010 <dbl>, Otu00011 <dbl>,
#> #   Otu00012 <dbl>, Otu00013 <dbl>, Otu00014 <dbl>, Otu00015 <dbl>,
#> #   Otu00016 <dbl>, Otu00017 <dbl>, Otu00018 <dbl>, Otu00019 <dbl>,
#> #   Otu00020 <dbl>, Otu00021 <dbl>, Otu00022 <dbl>, Otu00023 <dbl>,
#> #   Otu00024 <dbl>, Otu00025 <dbl>, Otu00026 <dbl>, Otu00027 <dbl>,
#> #   Otu00028 <dbl>, Otu00029 <dbl>, Otu00030 <dbl>, Otu00031 <dbl>, …
#> 
#> $grp_feats
#> NULL
#> 
#> $removed_feats
#> character(0)
#> 

# the function can show a progress bar if you have the progressr package installed
## optionally, specify the progress bar format
progressr::handlers(progressr::handler_progress(
  format = ":message :bar :percent | elapsed: :elapsed | eta: :eta",
  clear = FALSE,
  show_after = 0
))
## tell progressor to always report progress
if (FALSE) { # \dontrun{
progressr::handlers(global = TRUE)
## run the function and watch the live progress udpates
dat_preproc <- preprocess_data(mikropml::otu_small, "dx")

# Create TreeSE object
library(TreeSummarizedExperiment)
df <- mikropml::otu_small
assay <- df[, !colnames(df) %in% c("dx"), drop = FALSE] |> t() |> as.matrix()
tse <- TreeSummarizedExperiment(assays = SimpleList(counts = assay))
colData(tse)[["dx"]] <- df[["dx"]]

# Preprocess
tse <- preprocess_data(
  dataset = tse,
  assay.type = "counts",
  outcome_colname = "dx"
)
# The result is in assay slot
tse
} # }