Resampling Class

This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.

The objects of this class define how a task is partitioned for resampling (e.g., in resample() or benchmark()), using a set of hyperparameters such as the number of folds in cross-validation.

Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a fixed partition of row_ids of the Task.

Predefined resamplings are stored in the dictionary mlr_resamplings, e.g. cv or bootstrap.

Stratification

All derived classes support stratified sampling. The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.

First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata.

Second, the sampling is performed in each of the k subpopulations separately. Each subgroup is divided into iter training sets and iter test sets by the derived Resampling. These sets are merged based on their iteration number: all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on. Same is done for all test sets. The merged sets can be accessed via $train_set(i) and $test_set(i), respectively. Note that this procedure can lead to set sizes that are slightly different from those without stratification.

Grouping / Blocking

All derived classes support grouping of observations. The grouping variable is assumed to be discrete and must be stored in the Task with column role "group".

Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.

The sampling is performed by the derived Resampling on the grouping variable. Next, the grouping information is replaced with the respective row ids to generate training and test sets. The sets can be accessed via $train_set(i) and $test_set(i), respectively.

Inheriting

It is possible to overwrite both private$.get_instance() to have full control, or only private$.sample() when one wants to use the pre-defined mechanism for stratification and grouping.

Public fields

label

(character(1))
Label for this object. Can be used in tables, plot and text output instead of the ID.

param_set

(paradox::ParamSet)
Set of hyperparameters.

instance

(any)
During instantiate(), the instance is stored in this slot in an arbitrary format. Note that if a grouping variable is present in the Task, a Resampling may operate on the group ids internally instead of the row ids (which may lead to confusion).

It is advised to not work directly with the instance, but instead only use the getters $train_set() and $test_set().

task_hash

(character(1))
The hash of the Task which was passed to r$instantiate().

task_row_hash

(character(1))
The hash of the row ids of the Task which was passed to r$instantiate().

task_nrow

(integer(1))
The number of observations of the Task which was passed to r$instantiate().

duplicated_ids

(logical(1))
If TRUE, duplicated rows can occur within a single training set or within a single test set. E.g., this is TRUE for Bootstrap, and FALSE for cross-validation. Only used internally.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. Defaults to NA, but can be set by child classes.

Active bindings

id: (character(1))
Identifier of the object. Used in tables, plot and text output.
is_instantiated: (logical(1))
Is TRUE if the resampling has been instantiated.
hash: (character(1))
Hash (unique identifier) for this object. If the object has not been instantiated yet, NA_character_ is returned. The hash is calculated based on the class name, the id, the parameter set, and the instance.

Methods

Method `new()`

Creates a new instance of this R6 class.

Usage

Resampling$new(
  id,
  param_set = ps(),
  duplicated_ids = FALSE,
  label = NA_character_,
  man = NA_character_
)

Arguments

id

(character(1))
Identifier for the new instance.

param_set

(paradox::ParamSet)
Set of hyperparameters.

duplicated_ids

(logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.

Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.

label

(character(1))
Label for the new instance.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. The referenced help package can be opened via method $help().

Method `format()`

Helper for print outputs.

Usage

Resampling$format(...)

Arguments

...: (ignored).

Method `print()`

Printer.

Usage

Resampling$print(...)

Arguments

...: (ignored).

Method `help()`

Opens the corresponding help page referenced by field $man.

Usage

Resampling$help()

Method `instantiate()`

Materializes fixed training and test splits for a given task and stores them in r$instance in an arbitrary format.

Usage

Resampling$instantiate(task)

Arguments

task: (Task)
Task used for instantiation.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.

Method `train_set()`

Returns the row ids of the i-th training set.

Usage

Resampling$train_set(i)

Arguments

i: (integer(1))
Iteration.

Returns

(integer()) of row ids.

Method `test_set()`

Returns the row ids of the i-th test set.

Usage

Resampling$test_set(i)

Arguments

i: (integer(1))
Iteration.

Returns

(integer()) of row ids.

Method `clone()`

The objects of this class are cloneable with this method.

Usage

Resampling$clone(deep = FALSE)

Arguments

deep: Whether to make a deep clone.

Examples

r = rsmp("subsampling")

# Default parametrization
r$param_set$values
#> $ratio
#> [1] 0.6666667
#> 
#> $repeats
#> [1] 30
#> 

# Do only 3 repeats on 10% of the data
r$param_set$set_values(ratio = 0.1, repeats = 3)
r$param_set$values
#> $ratio
#> [1] 0.1
#> 
#> $repeats
#> [1] 3
#> 

# Instantiate on penguins task
task = tsk("penguins")
r$instantiate(task)

# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
#>  [1]   8 165 268 340 261  57 306 232 179 163 133 108  37 183 154 316 184  31 308
#> [20] 265 181 273 319  96 238 155 284 121 223 270 141 303 315 231
intersect(train_set, r$test_set(1))
#> integer(0)

# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)
#>   [1]   7  20  34  58  66  71  80  81 104 107 126 138 159 163 171 173 174 194
#>  [19] 195 200 203 207 209 220 231 257 259 264 269 274 275 301 314 320 331   5
#>  [37]  16  27  39  44  45  57  62  69  73  88  93 101 111 118 129 133 134 144
#>  [55] 151 164 172 181 208 223 224 230 234 260 262 267 277 294 317 334   4   6
#>  [73]   8  22  26  47  48  52  72  84  89 108 113 130 149 153 156 176 179 185
#>  [91] 189 198 211 216 217 218 219 228 245 265 281 313 319 323 342  23  40  41
#> [109]  55  61  75  85  90  92  97 100 103 106 112 132 152 169 190 197 199 204
#> [127] 226 248 255 256 258 272 273 278 286 325 329 330 337  11  14  31  36  43
#> [145]  46  65  76  77 109 121 124 125 139 140 141 158 166 191 206 213 214 236
#> [163] 240 276 290 296 308 326 328 338 340 341 344  12  19  28  35  38  42  53
#> [181]  56  59  70  74  78  91  98  99 102 114 120 142 165 184 205 221 222 227
#> [199] 232 241 249 263 266 285 321 339 343  13  18  30  50  51 128 135 143 146
#> [217] 154 155 157 161 168 188 201 210 212 215 225 229 233 235 247 250 271 293
#> [235] 297 303 306 312 315 332 336   1   2   3  15  24  33  37  60  95 105 116
#> [253] 127 136 147 150 178 180 183 186 242 244 246 253 261 270 279 291 298 302
#> [271] 305 311 322 324 327  17  29  32  54  67  82  83  87  94 115 117 119 123
#> [289] 145 175 177 187 193 202 238 239 243 268 280 283 284 289 299 300 304 309
#> [307] 310 318 333

# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced
#> 
#>       pos       neg 
#> 0.3489583 0.6510417 
task$col_roles$stratum = task$target_names

r = rsmp("subsampling")
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#> 
#>       pos       neg 
#> 0.3496094 0.6503906

Stratification

Grouping / Blocking

Inheriting

See also

Public fields

Active bindings

Methods

Public methods

Method new()

Usage

Arguments

Method format()

Usage

Arguments

Method print()

Usage

Arguments

Method help()

Usage

Method instantiate()

Usage

Arguments

Returns

Method train_set()

Usage

Arguments

Returns

Method test_set()

Usage

Arguments

Returns

Method clone()

Usage

Arguments

Examples

Method `new()`

Method `format()`

Method `print()`

Method `help()`

Method `instantiate()`

Method `train_set()`

Method `test_set()`

Method `clone()`