This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.

The objects of this class define how a task is partitioned for resampling (e.g., in resample() or benchmark()), using a set of hyperparameters such as the number of folds in cross-validation.

Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a fixed partition of row_ids of the Task.

Predefined resamplings are stored in the dictionary mlr_resamplings, e.g. cv or bootstrap.

Stratification

All derived classes support stratified sampling. The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum". In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.

First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata.

Second, the sampling is performed in each of the k subpopulations separately. Each subgroup is divided into iter training sets and iter test sets by the derived Resampling. These sets are merged based on their iteration number: all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on. Same is done for all test sets. The merged sets can be accessed via $train_set(i) and $test_set(i), respectively.

Grouping / Blocking

All derived classes support grouping of observations. The grouping variable is assumed to be discrete and must be stored in the Task with column role "group".

Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.

The sampling is performed by the derived Resampling on the grouping variable. Next, the grouping information is replaced with the respective row ids to generate training and test sets. The sets can be accessed via $train_set(i) and $test_set(i), respectively.

See also

Public fields

id

(character(1))
Identifier of the object. Used in tables, plot and text output.

param_set

(paradox::ParamSet)
Set of hyperparameters.

instance

(any)
During instantiate(), the instance is stored in this slot in an arbitrary format. Note that if a grouping variable is present in the Task, a Resampling may operate on the group ids internally instead of the row ids (which may lead to confusion).It is advised to not work directly with the instance, but instead only use the getters $train_set() and $test_set().

task_hash

(character(1))
The hash of the Task which was passed to r$instantiate().

task_nrow

(integer(1))
The number of observations of the Task which was passed to r$instantiate().

duplicated_ids

(logical(1))
If TRUE, duplicated rows can occur within a single training set or within a single test set. E.g., this is TRUE for Bootstrap, and FALSE for cross-validation. Only used internally.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. Defaults to NA, but can be set by child classes.

Active bindings

is_instantiated

(logical(1))
Is TRUE if the resampling has been instantiated.

hash

(character(1))
Hash (unique identifier) for this object.

Methods

Public methods


Method new()

Creates a new instance of this R6 class.

Usage

Resampling$new(
  id,
  param_set = ParamSet$new(),
  duplicated_ids = FALSE,
  man = NA_character_
)

Arguments

id

(character(1))
Identifier for the new instance.

param_set

(paradox::ParamSet)
Set of hyperparameters.

duplicated_ids

(logical(1))
Set to TRUE if this resampling strategy may have duplicated row ids in a single training set or test set.Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. The referenced help package can be opened via method $help().


Method format()

Helper for print outputs.

Usage

Resampling$format()


Method print()

Printer.

Usage

Resampling$print(...)

Arguments

...

(ignored).


Method help()

Opens the corresponding help page referenced by field $man.

Usage

Resampling$help()


Method instantiate()

Materializes fixed training and test splits for a given task and stores them in r$instance in an arbitrary format.

Usage

Resampling$instantiate(task)

Arguments

task

(Task)
Task used for instantiation.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method train_set()

Returns the row ids of the i-th training set.

Usage

Resampling$train_set(i)

Arguments

i

(integer(1))
Iteration.

Returns

(integer()) of row ids.


Method test_set()

Returns the row ids of the i-th test set.

Usage

Resampling$test_set(i)

Arguments

i

(integer(1))
Iteration.

Returns

(integer()) of row ids.


Method clone()

The objects of this class are cloneable with this method.

Usage

Resampling$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples

r = rsmp("subsampling") # Default parametrization r$param_set$values
#> $repeats #> [1] 30 #> #> $ratio #> [1] 0.6666667 #>
# Do only 3 repeats on 10% of the data r$param_set$values = list(ratio = 0.1, repeats = 3) r$param_set$values
#> $ratio #> [1] 0.1 #> #> $repeats #> [1] 3 #>
# Instantiate on penguins task task = tsk("penguins") r$instantiate(task) # Extract train/test sets train_set = r$train_set(1) print(train_set)
#> [1] 147 67 44 161 258 302 263 209 109 344 97 261 2 314 295 337 237 308 189 #> [20] 123 319 64 290 68 72 169 200 330 40 53 243 88 185 139
intersect(train_set, r$test_set(1))
#> integer(0)
# Another example: 10-fold CV r = rsmp("cv")$instantiate(task) r$train_set(1)
#> [1] 15 17 25 30 43 47 58 72 73 88 92 101 108 121 132 145 152 154 #> [19] 175 178 192 205 210 213 229 241 261 275 279 283 291 312 314 322 334 18 #> [37] 32 49 51 59 79 81 98 99 104 109 111 116 120 123 136 170 171 191 #> [55] 196 198 207 214 215 222 249 260 266 278 281 282 285 289 296 320 7 16 #> [73] 24 29 33 40 41 44 66 75 85 100 103 107 126 139 150 157 160 163 #> [91] 184 186 190 242 251 252 294 297 299 309 315 318 321 328 333 13 19 21 #> [109] 26 35 38 46 60 61 87 89 113 134 140 155 173 177 179 185 200 234 #> [127] 238 250 253 256 262 271 286 290 305 325 330 335 344 14 36 39 42 52 #> [145] 53 67 69 74 82 102 127 128 129 138 143 144 148 149 194 195 223 227 #> [163] 248 255 264 273 280 287 298 307 337 338 340 1 4 31 68 71 90 97 #> [181] 105 115 125 130 135 147 151 161 164 165 167 202 206 208 225 232 240 244 #> [199] 259 277 292 310 311 329 331 332 342 5 12 27 48 78 80 84 112 114 #> [217] 119 124 162 174 181 183 204 216 226 230 236 246 254 263 267 268 269 274 #> [235] 276 303 304 308 319 327 341 6 10 45 50 76 77 91 95 110 117 118 #> [253] 131 133 142 146 159 197 199 201 203 211 220 224 239 243 245 247 257 284 #> [271] 295 300 306 316 317 8 11 20 23 28 37 54 57 62 63 65 70 94 #> [289] 96 137 153 169 176 182 188 193 209 212 217 218 219 221 231 235 258 270 #> [307] 288 301 339
# Stratification task = tsk("pima") prop.table(table(task$truth())) # moderately unbalanced
#> #> pos neg #> 0.3489583 0.6510417
task$col_roles$stratum = task$target_names r = rsmp("subsampling") r$instantiate(task) prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#> #> pos neg #> 0.3496094 0.6503906