This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.

The objects of this class define how a task is partitioned for resampling (e.g., in resample() or benchmark()), using a set of hyperparameters such as the number of folds in cross-validation.

Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a fixed partition of row_ids of the Task.

Predefined resamplings are stored in the mlr3misc::Dictionary mlr_resamplings, e.g. cv or bootstrap.

Format

R6::R6Class object.

Construction

Note: This object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.

r = Resampling$new(id, param_set, duplicated_ids = FALSE, man = NA_character_)
  • id :: character(1)
    Identifier for the resampling strategy.

  • param_set :: paradox::ParamSet
    Set of hyperparameters.

  • duplicated_ids :: logical(1)
    Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.

  • man :: character(1)
    String in the format [pkg]::[topic] pointing to a manual page for this object.

Fields

All variables passed to the constructor, and additionally:

  • iters :: integer(1)
    Return the number of resampling iterations, depending on the values stored in the param_set.

  • instance :: any
    During instantiate(), the instance is stored in this slot. The instance can be in any arbitrary format.

  • is_instantiated :: logical(1)
    Is TRUE, if the resampling has been instantiated.

  • task_hash :: character(1)
    The hash of the Task which was passed to r$instantiate().

  • task_nrow :: integer(1)
    The number of observations of the Task which was passed to r$instantiate().

  • hash :: character(1)
    Hash (unique identifier) for this object.

    E.g., this is TRUE for Bootstrap, and FALSE for cross validation. Only used internally.

Methods

  • instantiate(task)
    Task -> self
    Materializes fixed training and test splits for a given task and stores them in r$instance.

  • train_set(i)
    integer(1) -> (integer() | character())
    Returns the row ids of the i-th training set.

  • test_set(i)
    integer(1) -> (integer() | character())
    Returns the row ids of the i-th test set.

  • help()
    () -> NULL
    Opens the corresponding help page referenced by $man.

Stratification

All derived classes support stratified sampling. The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.

First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata.

Second, the sampling is performed in each of the k subpopulations separately. Each subgroup is divided into iter training sets and iter test sets by the derived Resampling. These sets are merged based on their iteration number: all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on. Same is done for all test sets. The merged sets can be accessed via $train_set(i) and $test_set(i), respectively.

Grouping / Blocking

All derived classes support grouping of observations. The grouping variable is assumed to be discrete and must be stored in the Task with column role "group".

Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.

The sampling is performed by the derived Resampling on the grouping variable. Next, the grouping information is replaced with the respective row ids to generate training and test sets. The sets can be accessed via $train_set(i) and $test_set(i), respectively.

See also

Dictionary of Resamplings: mlr_resamplings

as.data.table(mlr_resamplings) for a complete table of all (also dynamically created) Resampling implementations.

Other Resampling: mlr_resamplings

Examples

r = rsmp("subsampling") # Default parametrization r$param_set$values
#> $repeats #> [1] 30 #> #> $ratio #> [1] 0.6666667 #>
# Do only 3 repeats on 10% of the data r$param_set$values = list(ratio = 0.1, repeats = 3) r$param_set$values
#> $ratio #> [1] 0.1 #> #> $repeats #> [1] 3 #>
# Instantiate on iris task task = tsk("iris") r$instantiate(task) # Extract train/test sets train_set = r$train_set(1) print(train_set)
#> [1] 126 97 100 142 81 125 25 88 143 122 40 133 89 118 112
intersect(train_set, r$test_set(1))
#> integer(0)
# Another example: 10-fold CV r = rsmp("cv")$instantiate(task) r$train_set(1)
#> [1] 3 34 38 48 49 80 83 97 110 112 116 122 128 133 141 24 26 28 #> [19] 62 68 71 75 76 78 85 94 101 126 127 132 12 15 16 18 42 63 #> [37] 66 79 82 95 105 119 123 138 143 5 30 33 55 73 74 92 102 104 #> [55] 113 125 129 131 134 137 2 4 14 23 53 69 70 72 87 88 107 111 #> [73] 135 145 148 9 10 21 32 35 36 40 52 58 64 65 90 106 121 136 #> [91] 17 25 45 46 50 59 84 86 93 99 109 114 118 130 146 1 7 20 #> [109] 22 27 39 47 60 61 91 96 103 139 142 150 13 19 31 37 43 44 #> [127] 51 54 57 77 81 140 144 147 149
# Stratification task = tsk("pima") prop.table(table(task$truth())) # moderately unbalanced
#> #> pos neg #> 0.3489583 0.6510417
task$col_roles$stratum = task$target_names r = rsmp("subsampling") r$instantiate(task) prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#> #> pos neg #> 0.3496094 0.6503906