Resampling Class

This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.

The objects of this class define how a task is partitioned for resampling (e.g., in resample() or benchmark()), using a set of hyperparameters such as the number of folds in cross-validation.

Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a fixed partition of row_ids of the Task.

Predefined resamplings are stored in the dictionary mlr_resamplings, e.g. cv or bootstrap.

Stochasticity & Reproducibility

The Resampling class only defines an abstract resampling strategy. Concrete data splits are obtained by calling $instantiate() on a Task. To ensure repdocubility of results, you need to call set.seed before doing so. Note that benchmark_grid internally does instantiate resamplings, so you need to set the seed before calling it.

Stratification

All derived classes support stratified sampling. The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.

First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata.

Second, the sampling is performed in each of the k subpopulations separately. Each subgroup is divided into iter training sets and iter test sets by the derived Resampling. These sets are merged based on their iteration number: all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on. Same is done for all test sets. The merged sets can be accessed via $train_set(i) and $test_set(i), respectively. Note that this procedure can lead to set sizes that are slightly different from those without stratification.

Grouping / Blocking

All derived classes support grouping of observations. The grouping variable is assumed to be discrete and must be stored in the Task with column role "group".

Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.

The sampling is performed by the derived Resampling on the grouping variable. Next, the grouping information is replaced with the respective row ids to generate training and test sets. The sets can be accessed via $train_set(i) and $test_set(i), respectively.

Inheriting

It is possible to overwrite both private$.get_instance() to have full control, or only private$.sample() when one wants to use the pre-defined mechanism for stratification and grouping.

Public fields

label

(character(1))
Label for this object. Can be used in tables, plot and text output instead of the ID.

param_set

(paradox::ParamSet)
Set of hyperparameters.

instance

(any)
During instantiate(), the instance is stored in this slot in an arbitrary format. Note that if a grouping variable is present in the Task, a Resampling may operate on the group ids internally instead of the row ids (which may lead to confusion).

It is advised to not work directly with the instance, but instead only use the getters $train_set() and $test_set().

task_hash

(character(1))
The hash of the Task which was passed to r$instantiate().

task_row_hash

(character(1))
The hash of the row ids of the Task which was passed to r$instantiate().

task_nrow

(integer(1))
The number of observations of the Task which was passed to r$instantiate().

duplicated_ids

(logical(1))
If TRUE, duplicated rows can occur within a single training set or within a single test set. E.g., this is TRUE for Bootstrap, and FALSE for cross-validation. Only used internally.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. Defaults to NA, but can be set by child classes.

Active bindings

id: (character(1))
Identifier of the object. Used in tables, plot and text output.
is_instantiated: (logical(1))
Is TRUE if the resampling has been instantiated.
hash: (character(1))
Hash (unique identifier) for this object. If the object has not been instantiated yet, NA_character_ is returned. The hash is calculated based on the class name, the id, the parameter set, and the instance.

Methods

Method `new()`

Creates a new instance of this R6 class.

Usage

Resampling$new(
  id,
  param_set = ps(),
  duplicated_ids = FALSE,
  label = NA_character_,
  man = NA_character_
)

Arguments

id

(character(1))
Identifier for the new instance.

param_set

(paradox::ParamSet)
Set of hyperparameters.

duplicated_ids

(logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.

Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.

label

(character(1))
Label for the new instance.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. The referenced help package can be opened via method $help().

Method `format()`

Helper for print outputs.

Usage

Resampling$format(...)

Arguments

...: (ignored).

Method `print()`

Printer.

Usage

Resampling$print(...)

Arguments

...: (ignored).

Method `help()`

Opens the corresponding help page referenced by field $man.

Usage

Resampling$help()

Method `instantiate()`

Materializes fixed training and test splits for a given task and stores them in r$instance in an arbitrary format.

Usage

Resampling$instantiate(task)

Arguments

task: (Task)
Task used for instantiation.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.

Examples

task = tsk("penguins")
resampling = rsmp("holdout")
resampling$instantiate(task)

Method `train_set()`

Returns the row ids of the i-th training set.

Usage

Resampling$train_set(i)

Arguments

i: (integer(1))
Iteration.

Returns

(integer()) of row ids.

Examples

task = tsk("penguins")
resampling = rsmp("holdout")$instantiate(task)
resampling$train_set(1)

Method `test_set()`

Returns the row ids of the i-th test set.

Usage

Resampling$test_set(i)

Arguments

i: (integer(1))
Iteration.

Returns

(integer()) of row ids.

Examples

task = tsk("penguins")
resampling = rsmp("holdout")$instantiate(task)
resampling$test_set(1)

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Resampling$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

r = rsmp("subsampling")

# Default parametrization
r$param_set$values
#> $ratio
#> [1] 0.6666667
#> 
#> $repeats
#> [1] 30
#> 

# Do only 3 repeats on 10% of the data
r$param_set$set_values(ratio = 0.1, repeats = 3)
r$param_set$values
#> $ratio
#> [1] 0.1
#> 
#> $repeats
#> [1] 3
#> 

# Instantiate on penguins task
task = tsk("penguins")
r$instantiate(task)

# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
#>  [1] 228 166 340  60  17 192 238  59 154  81 241  32 218 327 325 113 157 277  48
#> [20] 334 186 215  11  87 150  92 273  34  97 291   4 246 118  41
intersect(train_set, r$test_set(1))
#> integer(0)

# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)
#>   [1]   4   7   9  36  37  40  64  79  80  89  95 117 124 137 140 153 192 207
#>  [19] 209 210 232 235 243 246 260 266 268 280 306 309 317 325 335 339 340  22
#>  [37]  25  32  48  50  84  86  98 119 128 130 134 136 156 175 180 193 199 205
#>  [55] 217 224 241 247 256 277 283 287 288 293 297 304 305 334 343 344  11  18
#>  [73]  27  35  41  54  66  73  97  99 105 123 127 132 139 158 168 173 176 194
#>  [91] 201 220 237 240 261 269 271 285 289 295 296 300 308 312 333   3  20  23
#> [109]  28  42  46  51  62  71  94 116 147 152 154 159 161 162 170 174 178 191
#> [127] 206 212 214 216 244 251 267 272 276 294 314 316 319   2   8  14  24  31
#> [145]  45  49  52  53  59  60  74  90  92 106 133 138 142 144 164 181 182 189
#> [163] 200 223 228 238 242 245 275 310 311 327 337   6  30  55  69  81  93 107
#> [181] 112 118 131 143 151 171 179 195 202 204 215 219 221 225 226 227 234 249
#> [199] 257 282 286 292 301 323 329 332 341  12  17  19  43  44  56  65  68  75
#> [217]  78  82  83 109 122 135 146 148 157 163 197 233 236 239 250 263 284 299
#> [235] 302 321 322 330 331 336 338   1  10  13  39  58  70  72  77  88  96 104
#> [253] 129 141 149 165 166 169 183 184 187 196 229 230 231 252 254 255 262 278
#> [271] 281 290 315 318 320  15  29  33  34  47  57  67  85  91 101 102 103 108
#> [289] 110 121 126 167 172 177 186 190 198 208 213 218 248 253 258 273 291 298
#> [307] 307 313 326

# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced
#> 
#>       pos       neg 
#> 0.3489583 0.6510417 
task$col_roles$stratum = task$target_names

r = rsmp("subsampling")
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#> 
#>       pos       neg 
#> 0.3496094 0.6503906 

## ------------------------------------------------
## Method `Resampling$instantiate`
## ------------------------------------------------

task = tsk("penguins")
resampling = rsmp("holdout")
resampling$instantiate(task)

## ------------------------------------------------
## Method `Resampling$train_set`
## ------------------------------------------------

task = tsk("penguins")
resampling = rsmp("holdout")$instantiate(task)
resampling$train_set(1)
#>   [1]   2   5   6   7   8   9  10  11  12  13  14  16  17  18  19  20  21  23
#>  [19]  24  25  26  27  30  33  34  35  37  38  39  40  41  42  43  45  46  47
#>  [37]  49  50  52  53  54  55  56  57  58  59  63  64  65  66  67  68  69  70
#>  [55]  71  75  76  78  79  80  82  83  84  85  86  87  88  90  91  92  93  94
#>  [73]  95  96  97  98  99 100 101 102 103 109 111 112 115 116 118 120 123 124
#>  [91] 126 128 129 130 131 132 136 138 140 141 143 144 145 146 147 148 150 152
#> [109] 153 156 157 158 159 160 161 163 164 165 166 167 170 175 176 178 179 180
#> [127] 182 183 184 185 187 188 189 191 192 193 195 196 197 198 200 201 202 203
#> [145] 206 207 208 209 210 211 212 214 216 218 219 221 223 224 227 228 229 230
#> [163] 232 233 237 240 241 244 245 246 247 248 252 255 256 257 260 261 262 264
#> [181] 265 267 268 269 270 271 274 277 278 279 281 282 284 285 286 290 292 293
#> [199] 294 295 296 299 301 302 305 308 309 311 313 314 316 318 319 320 321 325
#> [217] 327 329 330 332 333 335 336 337 338 339 340 341 344

## ------------------------------------------------
## Method `Resampling$test_set`
## ------------------------------------------------

task = tsk("penguins")
resampling = rsmp("holdout")$instantiate(task)
resampling$test_set(1)
#>   [1]   2   3   7   8  10  12  13  16  18  22  25  26  31  40  46  47  49  50
#>  [19]  57  58  60  66  67  70  71  72  74  76  79  83  85  86  88  90  93  94
#>  [37]  95  99 104 106 109 110 111 112 116 121 122 133 146 149 151 155 157 158
#>  [55] 159 160 162 167 170 171 172 174 177 178 179 181 183 186 187 189 202 207
#>  [73] 210 213 214 217 218 221 224 225 226 231 237 246 251 252 255 258 259 263
#>  [91] 265 269 274 285 291 292 293 297 299 304 305 308 311 316 318 319 321 323
#> [109] 329 333 334 337 339 340 344

Stochasticity & Reproducibility

Stratification

Grouping / Blocking

Inheriting

See also

Public fields

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method format()

Usage

Arguments

Method print()

Usage

Arguments

Method help()

Usage

Method instantiate()

Usage

Arguments

Returns

Examples

Method train_set()

Usage

Arguments

Returns

Examples

Method test_set()

Usage

Arguments

Returns

Examples

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `format()`

Method `print()`

Method `help()`

Method `instantiate()`

Method `train_set()`

Method `test_set()`

Method `clone()`