Skip to contents

This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.

The objects of this class define how a task is partitioned for resampling (e.g., in resample() or benchmark()), using a set of hyperparameters such as the number of folds in cross-validation.

Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a fixed partition of row_ids of the Task.

Predefined resamplings are stored in the dictionary mlr_resamplings, e.g. cv or bootstrap.

Stratification

All derived classes support stratified sampling. The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.

First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata.

Second, the sampling is performed in each of the k subpopulations separately. Each subgroup is divided into iter training sets and iter test sets by the derived Resampling. These sets are merged based on their iteration number: all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on. Same is done for all test sets. The merged sets can be accessed via $train_set(i) and $test_set(i), respectively. Note that this procedure can lead to set sizes that are slightly different from those without stratification.

Grouping / Blocking

All derived classes support grouping of observations. The grouping variable is assumed to be discrete and must be stored in the Task with column role "group".

Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.

The sampling is performed by the derived Resampling on the grouping variable. Next, the grouping information is replaced with the respective row ids to generate training and test sets. The sets can be accessed via $train_set(i) and $test_set(i), respectively.

See also

Other Resampling: mlr_resamplings_bootstrap, mlr_resamplings_custom_cv, mlr_resamplings_custom, mlr_resamplings_cv, mlr_resamplings_holdout, mlr_resamplings_insample, mlr_resamplings_loo, mlr_resamplings_repeated_cv, mlr_resamplings_subsampling, mlr_resamplings

Public fields

id

(character(1))
Identifier of the object. Used in tables, plot and text output.

label

(character(1))
Label for this object. Can be used in tables, plot and text output instead of the ID.

param_set

(paradox::ParamSet)
Set of hyperparameters.

instance

(any)
During instantiate(), the instance is stored in this slot in an arbitrary format. Note that if a grouping variable is present in the Task, a Resampling may operate on the group ids internally instead of the row ids (which may lead to confusion).

It is advised to not work directly with the instance, but instead only use the getters $train_set() and $test_set().

task_hash

(character(1))
The hash of the Task which was passed to r$instantiate().

task_nrow

(integer(1))
The number of observations of the Task which was passed to r$instantiate().

duplicated_ids

(logical(1))
If TRUE, duplicated rows can occur within a single training set or within a single test set. E.g., this is TRUE for Bootstrap, and FALSE for cross-validation. Only used internally.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. Defaults to NA, but can be set by child classes.

Active bindings

is_instantiated

(logical(1))
Is TRUE if the resampling has been instantiated.

hash

(character(1))
Hash (unique identifier) for this object.

Methods


Method new()

Creates a new instance of this R6 class.

Usage

Resampling$new(
  id,
  param_set = ps(),
  duplicated_ids = FALSE,
  label = NA_character_,
  man = NA_character_
)

Arguments

id

(character(1))
Identifier for the new instance.

param_set

(paradox::ParamSet)
Set of hyperparameters.

duplicated_ids

(logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.

Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.

label

(character(1))
Label for the new instance.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. The referenced help package can be opened via method $help().


Method format()

Helper for print outputs.

Usage

Resampling$format(...)

Arguments

...

(ignored).


Method print()

Printer.

Usage

Resampling$print(...)

Arguments

...

(ignored).


Method help()

Opens the corresponding help page referenced by field $man.

Usage

Resampling$help()


Method instantiate()

Materializes fixed training and test splits for a given task and stores them in r$instance in an arbitrary format.

Usage

Resampling$instantiate(task)

Arguments

task

(Task)
Task used for instantiation.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method train_set()

Returns the row ids of the i-th training set.

Usage

Resampling$train_set(i)

Arguments

i

(integer(1))
Iteration.

Returns

(integer()) of row ids.


Method test_set()

Returns the row ids of the i-th test set.

Usage

Resampling$test_set(i)

Arguments

i

(integer(1))
Iteration.

Returns

(integer()) of row ids.


Method clone()

The objects of this class are cloneable with this method.

Usage

Resampling$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples

r = rsmp("subsampling")

# Default parametrization
r$param_set$values
#> $repeats
#> [1] 30
#> 
#> $ratio
#> [1] 0.6666667
#> 

# Do only 3 repeats on 10% of the data
r$param_set$values = list(ratio = 0.1, repeats = 3)
r$param_set$values
#> $ratio
#> [1] 0.1
#> 
#> $repeats
#> [1] 3
#> 

# Instantiate on penguins task
task = tsk("penguins")
r$instantiate(task)

# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
#>  [1] 180  89 171  67  17 137 141 201 169  80 332 225 325 125 139 326 299 239 274
#> [20] 306  98 330  18 314 244  24  82 211 108 307  64  31 235 234
intersect(train_set, r$test_set(1))
#> integer(0)

# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)
#>   [1]  10  12  14  18  19  27  29  30  53  62  94  99 101 113 142 162 173 175
#>  [19] 179 187 188 189 219 233 242 254 261 267 268 271 272 284 303 308 321  16
#>  [37]  25  37  39  42  67  72  80  86  98 114 120 126 130 143 177 178 181 190
#>  [55] 193 198 208 218 221 225 236 237 244 248 275 279 312 327 338 340  22  24
#>  [73]  28  36  43  47  51  69  76  83  92 104 105 115 157 163 166 184 203 222
#>  [91] 231 253 256 264 270 274 278 292 293 307 309 322 326 331 344   6   8  13
#> [109]  34  40  45  49  74  82  84  88 102 122 138 140 147 155 172 196 197 211
#> [127] 213 243 252 262 288 296 301 304 325 329 335 337 342   2  11  32  41  52
#> [145]  70  90 117 124 133 135 146 152 153 159 161 164 176 185 186 192 217 220
#> [163] 235 239 257 258 280 289 291 298 299 313 341   5  17  31  35  38  48  56
#> [181]  61  65  68  78  95 108 111 112 116 123 131 167 195 210 216 224 250 263
#> [199] 276 282 285 290 314 315 320 333 336   9  20  33  46  54  66  85 103 109
#> [217] 118 128 134 148 151 154 156 165 171 206 215 223 234 238 247 259 286 294
#> [235] 297 306 310 311 324 330 343   1   4  21  55  58  59  63  64  71  75  77
#> [253]  81  91 121 127 132 137 144 149 174 191 194 199 214 226 228 232 240 265
#> [271] 281 283 305 319 334  15  23  44  57  60  73  89  97 106 136 139 141 145
#> [289] 150 158 160 200 201 202 204 205 209 229 246 249 251 255 266 287 316 318
#> [307] 323 332 339

# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced
#> 
#>       pos       neg 
#> 0.3489583 0.6510417 
task$col_roles$stratum = task$target_names

r = rsmp("subsampling")
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#> 
#>       pos       neg 
#> 0.3496094 0.6503906