Before training a model, it’s often necessary and prudent to
preprocess your input data. We provide a function
(`preprocess_data()`

) to preprocess input data. The defaults
we chose are based on best practices used in FIDDLE
(Tang et al. 2020). Feel free to check out
FIDDLE for more information about data preprocessing!

`preprocess_data()`

takes an input dataset where the rows
are the samples and the columns are the outcome variable and features.
We preprocess the data as follows:

- Remove missing outcome values.
- Convert any spaces in outcome names to underscores
(
`_`

). - Leave binary features as-is (except that categorical variables are converted to 0 and 1, and binary variables with missing features are split into two rows - see below for more details).
- Normalize continuous features using
`caret::preProcess()`

based on the method provided. - Convert categorical features with more than 2 categories to 0 and 1 in multiple columns (one for each category, so each category has it’s own column).
- Replace missing categorical data with 0.
- Impute missing continuous values with the median of the feature.
- By default, remove all features with near-zero variance (option to also remove only features with zero variance).
- By default, collapse correlated features.

## It’s running so slow!

Since I assume a lot of you won’t read this entire vignette, I’m
going to say this at the beginning. If the
`preprocess_data()`

function is running super slow, you
should consider parallelizing it so it goes faster!
`preprocess_data()`

also can report live progress updates.
See `vignette("parallel")`

for details.

## Examples

We’re going to start off simple and get more complicated, but if you want the whole shebang at once, just scroll to the bottom.

First, we have to load `mikropml`

:

### Binary data

Let’s start with only binary variables:

```
# raw binary dataset
bin_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("no", "yes", "no"),
var2 = c(0, 1, 1),
var3 = factor(c("a", "a", "b"))
)
bin_df
#> outcome var1 var2 var3
#> 1 normal no 0 a
#> 2 normal yes 1 a
#> 3 cancer no 1 b
```

In addition to the dataframe itself, you have to provide the name of
the outcome column to `preprocess_data()`

. Here’s what the
preprocessed data looks like:

```
# preprocess raw binary data
preprocess_data(dataset = bin_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 4
#> outcome var1_yes var2_1 var3_b
#> <chr> <dbl> <dbl> <dbl>
#> 1 normal 0 0 0
#> 2 normal 1 1 0
#> 3 cancer 0 1 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

The output is a list: `dat_transformed`

which has the
transformed data, `grp_feats`

which is a list of grouped
features, and `removed_feats`

which is a list of features
that were removed. Here, `grp_feats`

is `NULL`

because there are no perfectly correlated features
(e.g. `c(0,1,0)`

and `c(0,1,0)`

, or
`c(0,1,0)`

and `c(1,0,1)`

- see below for more
details).

The first column (`var1`

) in `dat_transformed`

is a character and is changed to `var1_yes`

that has zeros
(no) and ones (yes). The values in the second column (`var2`

)
stay the same because it’s already binary, but the name changes to
`var2_1`

. The third column (`var3`

) is a factor
and is also changed to binary where b is 1 and a is 0, as denoted by the
new column name `var3_b`

.

### Categorical data

On to non-binary categorical data:

```
# raw categorical dataset
cat_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("a", "b", "c")
)
cat_df
#> outcome var1
#> 1 normal a
#> 2 normal b
#> 3 cancer c
```

```
# preprocess raw categorical data
preprocess_data(dataset = cat_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 4
#> outcome var1_a var1_b var1_c
#> <chr> <dbl> <dbl> <dbl>
#> 1 normal 1 0 0
#> 2 normal 0 1 0
#> 3 cancer 0 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

As you can see, this variable was split into 3 different columns -
one for each type (a, b, and c). And again, `grp_feats`

is
`NULL`

.

### Continuous data

Now, looking at continuous variables:

```
# raw continuous dataset
cont_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c(1, 2, 3)
)
cont_df
#> outcome var1
#> 1 normal 1
#> 2 normal 2
#> 3 cancer 3
```

```
# preprocess raw continuous data
preprocess_data(dataset = cont_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 2
#> outcome var1
#> <chr> <dbl>
#> 1 normal -1
#> 2 normal 0
#> 3 cancer 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

Wow! Why did the numbers change? This is because the default is to
normalize the data using `"center"`

and `"scale"`

.
While this is often best practice, you may not want to normalize the
data, or you may want to normalize the data in a different way. If you
don’t want to normalize the data, you can use
`method=NULL`

:

```
# preprocess raw continuous data, no normalization
preprocess_data(dataset = cont_df, outcome_colname = "outcome", method = NULL)
```

You can also normalize the data in different ways. You can choose any
method supported by the `method`

argument of
`caret::preProcess()`

(see the
`caret::preProcess()`

docs for details). Note that these
methods are only applied to continuous variables.

Another feature of `preprocess_data()`

is that if you
provide continuous variables as characters, they will be converted to
numeric:

```
# raw continuous dataset as characters
cont_char_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("1", "2", "3")
)
cont_char_df
#> outcome var1
#> 1 normal 1
#> 2 normal 2
#> 3 cancer 3
```

```
# preprocess raw continuous character data as numeric
preprocess_data(dataset = cont_char_df, outcome_colname = "outcome")
```

If you don’t want this to happen, and you want character data to
remain character data even if it can be converted to numeric, you can
use `to_numeric=FALSE`

and they will be kept as
categorical:

```
# preprocess raw continuous character data as characters
preprocess_data(dataset = cont_char_df, outcome_colname = "outcome", to_numeric = FALSE)
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 4
#> outcome var1_1 var1_2 var1_3
#> <chr> <dbl> <dbl> <dbl>
#> 1 normal 1 0 0
#> 2 normal 0 1 0
#> 3 cancer 0 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

As you can see from this output, in this case the features are treated as groups rather than numbers (e.g. they are not normalized).

### Collapse perfectly correlated features

By default, `preprocess_data()`

collapses features that
are perfectly positively or negatively correlated. This is because
having multiple copies of those features does not add information to
machine learning, and it makes `run_ml`

faster.

```
# raw correlated dataset
corr_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("no", "yes", "no"),
var2 = c(0, 1, 0),
var3 = c(1, 0, 1)
)
corr_df
#> outcome var1 var2 var3
#> 1 normal no 0 1
#> 2 normal yes 1 0
#> 3 cancer no 0 1
```

```
# preprocess raw correlated dataset
preprocess_data(dataset = corr_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 2
#> outcome grp1
#> <chr> <dbl>
#> 1 normal 0
#> 2 normal 1
#> 3 cancer 0
#>
#> $grp_feats
#> $grp_feats$grp1
#> [1] "var1_yes" "var3_1"
#>
#>
#> $removed_feats
#> [1] "var2"
```

As you can see, we end up with only one variable, as all 3 are
grouped together. Also, the second element in the list is no longer
`NULL`

. Instead, it tells you that `grp1`

contains
`var1`

, `var2`

, and `var3`

.

If you want to group positively correlated features, but not
negatively correlated features (e.g. for interpretability, or another
downstream application), you can do that by using
`group_neg_corr=FALSE`

:

```
# preprocess raw correlated dataset; don't group negatively correlated features
preprocess_data(dataset = corr_df, outcome_colname = "outcome", group_neg_corr = FALSE)
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_yes var3_1
#> <chr> <dbl> <dbl>
#> 1 normal 0 1
#> 2 normal 1 0
#> 3 cancer 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var2"
```

Here, `var3`

is kept on it’s own because it’s negatively
correlated with `var1`

and `var2`

. You can also
choose to keep all features separate, even if they are perfectly
correlated, by using `collapse_corr_feats=FALSE`

:

```
# preprocess raw correlated dataset; don't group negatively correlated features
preprocess_data(dataset = corr_df, outcome_colname = "outcome", collapse_corr_feats = FALSE)
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_yes var3_1
#> <chr> <dbl> <dbl>
#> 1 normal 0 1
#> 2 normal 1 0
#> 3 cancer 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var2"
```

In this case, `grp_feats`

will always be
`NULL`

.

### Data with near-zero variance

What if we have variables that are all zero, or all “no”? Those ones won’t contribute any information, so we remove them:

```
# raw dataset with non-variable features
nonvar_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("no", "yes", "no"),
var2 = c(0, 1, 1),
var3 = c("no", "no", "no"),
var4 = c(0, 0, 0),
var5 = c(12, 12, 12)
)
nonvar_df
#> outcome var1 var2 var3 var4 var5
#> 1 normal no 0 no 0 12
#> 2 normal yes 1 no 0 12
#> 3 cancer no 1 no 0 12
```

Here, `var3`

, `var4`

, and `var5`

all
have no variability, so these variables are removed during
preprocessing:

```
# remove features with near-zero variance
preprocess_data(dataset = nonvar_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_yes var2_1
#> <chr> <dbl> <dbl>
#> 1 normal 0 0
#> 2 normal 1 1
#> 3 cancer 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var4" "var3" "var5"
```

You can read the `caret::preProcess()`

documentation for
more information. By default, we remove features with “near-zero
variance” (`remove_var='nzv'`

). This uses the default
arguments from `caret::nearZeroVar()`

. However, particularly
with smaller datasets, you might not want to remove features with
near-zero variance. If you want to remove only features with zero
variance, you can use `remove_var='zv'`

:

```
# remove features with zero variance
preprocess_data(dataset = nonvar_df, outcome_colname = "outcome", remove_var = "zv")
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_yes var2_1
#> <chr> <dbl> <dbl>
#> 1 normal 0 0
#> 2 normal 1 1
#> 3 cancer 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var4" "var3" "var5"
```

If you want to include all features, you can use the argument
`remove_zv=NULL`

. For this to work, you cannot collapse
correlated features (otherwise it errors out because of the underlying
`caret`

function we use).

```
# don't remove features with near-zero or zero variance
preprocess_data(dataset = nonvar_df, outcome_colname = "outcome", remove_var = NULL, collapse_corr_feats = FALSE)
#> Using 'outcome' as the outcome column.
#> $dat_transformed
#> # A tibble: 3 × 5
#> outcome var1_yes var2_1 var3 var5
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 normal 0 0 0 12
#> 2 normal 1 1 0 12
#> 3 cancer 0 1 0 12
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var4"
```

If you want to be more nuanced in how you remove near-zero variance
features (e.g. change the default 10% cutoff for the percentage of
distinct values out of the total number of samples), you can use the
`caret::preProcess()`

function after running
`preprocess_data`

with `remove_var=NULL`

(see the
`caret::nearZeroVar()`

function for more information).

### Missing data

`preprocess_data()`

also deals with missing data. It:

- Removes missing outcome variables.
- Maintains zero variability in a feature if it already has no variability (i.e. the feature is removed if removing features with near-zero variance).
- Replaces missing binary and categorical variables with zero (after splitting into multiple columns).
- Replaces missing continuous data with the median value of that feature.

If you’d like to deal with missing data in a different way, please do
that prior to inputting the data to `preprocess_data()`

.

#### Remove missing outcome variables

```
# raw dataset with missing outcome value
miss_oc_df <- data.frame(
outcome = c("normal", "normal", "cancer", NA),
var1 = c("no", "yes", "no", "no"),
var2 = c(0, 1, 1, 1)
)
miss_oc_df
#> outcome var1 var2
#> 1 normal no 0
#> 2 normal yes 1
#> 3 cancer no 1
#> 4 <NA> no 1
```

```
# preprocess raw dataset with missing outcome value
preprocess_data(dataset = miss_oc_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> Removed 1/4 (25%) of samples because of missing outcome value (NA).
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_yes var2_1
#> <chr> <dbl> <dbl>
#> 1 normal 0 0
#> 2 normal 1 1
#> 3 cancer 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

#### Maintain zero variability in a feature if it already has no variability

```
# raw dataset with missing value in non-variable feature
miss_nonvar_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("no", "yes", "no"),
var2 = c(NA, 1, 1)
)
miss_nonvar_df
#> outcome var1 var2
#> 1 normal no NA
#> 2 normal yes 1
#> 3 cancer no 1
```

```
# preprocess raw dataset with missing value in non-variable feature
preprocess_data(dataset = miss_nonvar_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> There are 1 missing value(s) in features with no variation. Missing values were replaced with the non-varying value.
#> $dat_transformed
#> # A tibble: 3 × 2
#> outcome var1_yes
#> <chr> <dbl>
#> 1 normal 0
#> 2 normal 1
#> 3 cancer 0
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var2"
```

Here, the non-variable feature with missing data is removed because we removed features with near-zero variance. If we maintained that feature, it’d be all ones:

```
# preprocess raw dataset with missing value in non-variable feature
preprocess_data(dataset = miss_nonvar_df, outcome_colname = "outcome", remove_var = NULL, collapse_corr_feats = FALSE)
#> Using 'outcome' as the outcome column.
#> There are 1 missing value(s) in features with no variation. Missing values were replaced with the non-varying value.
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_yes var2
#> <chr> <dbl> <dbl>
#> 1 normal 0 1
#> 2 normal 1 1
#> 3 cancer 0 1
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

#### Replace missing binary and categorical variables with zero

```
# raw dataset with missing value in categorical feature
miss_cat_df <- data.frame(
outcome = c("normal", "normal", "cancer"),
var1 = c("no", "yes", NA),
var2 = c(NA, 1, 0)
)
miss_cat_df
#> outcome var1 var2
#> 1 normal no NA
#> 2 normal yes 1
#> 3 cancer <NA> 0
```

```
# preprocess raw dataset with missing value in non-variable feature
preprocess_data(dataset = miss_cat_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> 2 categorical missing value(s) (NA) were replaced with 0. Note that the matrix is not full rank so missing values may be duplicated in separate columns.
#> $dat_transformed
#> # A tibble: 3 × 3
#> outcome var1_no var1_yes
#> <chr> <dbl> <dbl>
#> 1 normal 1 0
#> 2 normal 0 1
#> 3 cancer 0 0
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> [1] "var2"
```

Here each binary variable is split into two, and the missing value is considered zero for both of them.

#### Replace missing continuous data with the median value of that feature

```
# raw dataset with missing value in continuous feature
miss_cont_df <- data.frame(
outcome = c("normal", "normal", "cancer", "normal"),
var1 = c(1, 2, 2, NA),
var2 = c(1, 2, 3, NA)
)
miss_cont_df
#> outcome var1 var2
#> 1 normal 1 1
#> 2 normal 2 2
#> 3 cancer 2 3
#> 4 normal NA NA
```

Here we’re not normalizing continuous features so it’s easier to see what’s going on (i.e. the median value is used):

```
# preprocess raw dataset with missing value in continuous feature
preprocess_data(dataset = miss_cont_df, outcome_colname = "outcome", method = NULL)
#> Using 'outcome' as the outcome column.
#> 2 missing continuous value(s) were imputed using the median value of the feature.
#> $dat_transformed
#> # A tibble: 4 × 3
#> outcome var1 var2
#> <chr> <dbl> <dbl>
#> 1 normal 1 1
#> 2 normal 2 2
#> 3 cancer 2 3
#> 4 normal 2 2
#>
#> $grp_feats
#> NULL
#>
#> $removed_feats
#> character(0)
```

### Putting it all together

Here’s some more complicated example raw data that puts everything we discussed together:

```
test_df <- data.frame(
outcome = c("normal", "normal", "cancer", NA),
var1 = 1:4,
var2 = c("a", "b", "c", "d"),
var3 = c("no", "yes", "no", "no"),
var4 = c(0, 1, 0, 0),
var5 = c(0, 0, 0, 0),
var6 = c("no", "no", "no", "no"),
var7 = c(1, 1, 0, 0),
var8 = c(5, 6, NA, 7),
var9 = c(NA, "x", "y", "z"),
var10 = c(1, 0, NA, NA),
var11 = c(1, 1, NA, NA),
var12 = c("1", "2", "3", "4")
)
test_df
#> outcome var1 var2 var3 var4 var5 var6 var7 var8 var9 var10 var11 var12
#> 1 normal 1 a no 0 0 no 1 5 <NA> 1 1 1
#> 2 normal 2 b yes 1 0 no 1 6 x 0 1 2
#> 3 cancer 3 c no 0 0 no 0 NA y NA NA 3
#> 4 <NA> 4 d no 0 0 no 0 7 z NA NA 4
```

Let’s throw this into the preprocessing function with the default values:

```
preprocess_data(dataset = test_df, outcome_colname = "outcome")
#> Using 'outcome' as the outcome column.
#> Removed 1/4 (25%) of samples because of missing outcome value (NA).
#> There are 1 missing value(s) in features with no variation. Missing values were replaced with the non-varying value.
#> 2 categorical missing value(s) (NA) were replaced with 0. Note that the matrix is not full rank so missing values may be duplicated in separate columns.
#> 1 missing continuous value(s) were imputed using the median value of the feature.
#> $dat_transformed
#> # A tibble: 3 × 6
#> outcome grp1 var2_a grp2 grp3 var8
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 normal -1 1 0 0 -0.707
#> 2 normal 0 0 1 0 0.707
#> 3 cancer 1 0 0 1 0
#>
#> $grp_feats
#> $grp_feats$grp1
#> [1] "var1" "var12"
#>
#> $grp_feats$var2_a
#> [1] "var2_a"
#>
#> $grp_feats$grp2
#> [1] "var2_b" "var3_yes" "var9_x"
#>
#> $grp_feats$grp3
#> [1] "var2_c" "var7_1" "var9_y"
#>
#> $grp_feats$var8
#> [1] "var8"
#>
#>
#> $removed_feats
#> [1] "var4" "var5" "var10" "var6" "var11"
```

As you can see, we got several messages:

- One of the samples (row 4) was removed because the outcome value was missing.
- One of the variables in a feature with no variation had a missing
value that was replaced with the the non-varying value
(
`var11`

). - Four categorical missing values were replaced with zero
(
`var9`

). There are 4 missing rather than just 1 (like in the raw data) because we split the categorical variable into 4 different columns first. - One missing continuous value was imputed using the median value of
that feature (
`var8`

).

Additionally, you can see that the continuous variables were
normalized, the categorical variables were all changed to binary, and
several features were grouped together. The variables in each group can
be found in `grp_feats`

.

### Next step: train and evaluate your model!

After you preprocess your data (either using
`preprocess_data()`

or by preprocessing the data on your
own), you’re ready to train and evaluate machine learning models! Please
see `run_ml()`

information about training models.

*J Am Med Inform Assoc*, October. https://doi.org/10.1093/jamia/ocaa139.