This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.
The objects of this class define how a task is partitioned for resampling (e.g., in resample()
or benchmark()
),
using a set of hyperparameters such as the number of folds in cross-validation.
Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a
fixed partition of row_ids
of the Task.
Predefined resamplings are stored in the dictionary mlr_resamplings,
e.g. cv
or bootstrap
.
Stratification
All derived classes support stratified sampling.
The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum"
.
In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.
First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata
.
Second, the sampling is performed in each of the k
subpopulations separately.
Each subgroup is divided into iter
training sets and iter
test sets by the derived Resampling
.
These sets are merged based on their iteration number:
all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on.
Same is done for all test sets.
The merged sets can be accessed via $train_set(i)
and $test_set(i)
, respectively.
Note that this procedure can lead to set sizes that are slightly different from those
without stratification.
Grouping / Blocking
All derived classes support grouping of observations.
The grouping variable is assumed to be discrete and must be stored in the Task with column role "group"
.
Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.
The sampling is performed by the derived Resampling on the grouping variable.
Next, the grouping information is replaced with the respective row ids to generate training and test sets.
The sets can be accessed via $train_set(i)
and $test_set(i)
, respectively.
See also
Chapter in the mlr3book: https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html#sec-resampling
Package mlr3spatiotempcv for spatio-temporal resamplings.
Dictionary of Resamplings: mlr_resamplings
as.data.table(mlr_resamplings)
for a table of available Resamplings in the running session (depending on the loaded packages).mlr3spatiotempcv for additional Resamplings for spatio-temporal tasks.
Other Resampling:
mlr_resamplings
,
mlr_resamplings_bootstrap
,
mlr_resamplings_custom
,
mlr_resamplings_custom_cv
,
mlr_resamplings_cv
,
mlr_resamplings_holdout
,
mlr_resamplings_insample
,
mlr_resamplings_loo
,
mlr_resamplings_repeated_cv
,
mlr_resamplings_subsampling
Public fields
label
(
character(1)
)
Label for this object. Can be used in tables, plot and text output instead of the ID.param_set
(paradox::ParamSet)
Set of hyperparameters.instance
(any)
Duringinstantiate()
, the instance is stored in this slot in an arbitrary format. Note that if a grouping variable is present in the Task, a Resampling may operate on the group ids internally instead of the row ids (which may lead to confusion).It is advised to not work directly with the
instance
, but instead only use the getters$train_set()
and$test_set()
.task_hash
(
character(1)
)
The hash of the Task which was passed tor$instantiate()
.task_row_hash
(
character(1)
)
The hash of the row ids of the Task which was passed tor$instantiate()
.task_nrow
(
integer(1)
)
The number of observations of the Task which was passed tor$instantiate()
.duplicated_ids
(
logical(1)
)
IfTRUE
, duplicated rows can occur within a single training set or within a single test set. E.g., this isTRUE
for Bootstrap, andFALSE
for cross-validation. Only used internally.man
(
character(1)
)
String in the format[pkg]::[topic]
pointing to a manual page for this object. Defaults toNA
, but can be set by child classes.
Active bindings
id
(
character(1)
)
Identifier of the object. Used in tables, plot and text output.is_instantiated
(
logical(1)
)
IsTRUE
if the resampling has been instantiated.hash
(
character(1)
)
Hash (unique identifier) for this object. If the object has not been instantiated yet,NA_character_
is returned. The hash is calculated based on the class name, the id, the parameter set, and the instance.
Methods
Method new()
Creates a new instance of this R6 class.
Usage
Resampling$new(
id,
param_set = ps(),
duplicated_ids = FALSE,
label = NA_character_,
man = NA_character_
)
Arguments
id
(
character(1)
)
Identifier for the new instance.param_set
(paradox::ParamSet)
Set of hyperparameters.duplicated_ids
(
logical(1)
)
Set toTRUE
if this resampling strategy may have duplicated row ids in a single training set or test set.Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.
label
(
character(1)
)
Label for the new instance.man
(
character(1)
)
String in the format[pkg]::[topic]
pointing to a manual page for this object. The referenced help package can be opened via method$help()
.
Method instantiate()
Materializes fixed training and test splits for a given task and stores them in r$instance
in an arbitrary format.
Arguments
task
(Task)
Task used for instantiation.
Examples
r = rsmp("subsampling")
# Default parametrization
r$param_set$values
#> $ratio
#> [1] 0.6666667
#>
#> $repeats
#> [1] 30
#>
# Do only 3 repeats on 10% of the data
r$param_set$set_values(ratio = 0.1, repeats = 3)
r$param_set$values
#> $ratio
#> [1] 0.1
#>
#> $repeats
#> [1] 3
#>
# Instantiate on penguins task
task = tsk("penguins")
r$instantiate(task)
# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
#> [1] 165 192 216 317 315 320 224 269 73 256 273 253 8 344 268 261 57 306 232
#> [20] 179 163 133 108 37 183 154 316 184 31 308 265 181 334 96
intersect(train_set, r$test_set(1))
#> integer(0)
# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)
#> [1] 6 17 30 43 67 74 79 88 90 110 122 125 130 149 159 168 172 181
#> [19] 182 183 200 203 232 242 266 276 282 283 286 287 295 298 304 313 331 3
#> [37] 15 26 37 48 53 54 66 71 77 81 98 103 107 115 123 144 145 155
#> [55] 173 178 180 211 217 226 230 234 255 260 271 296 317 322 324 327 7 11
#> [73] 14 16 18 32 36 56 57 61 80 96 120 124 137 141 163 166 198 207
#> [91] 222 223 229 231 236 253 254 265 279 314 325 328 333 336 342 33 45 49
#> [109] 50 64 70 89 94 100 105 109 111 140 143 162 199 213 218 219 221 247
#> [127] 263 264 267 272 285 293 307 308 312 315 318 339 341 5 10 24 41 52
#> [145] 55 73 84 85 104 108 114 117 129 133 135 136 150 151 152 167 175 185
#> [163] 190 224 243 244 248 250 256 257 288 326 343 2 8 22 29 38 44 47
#> [181] 51 62 65 68 78 82 86 99 112 113 116 131 132 134 153 157 170 188
#> [199] 206 233 245 249 258 273 277 290 323 23 28 40 59 60 121 139 146 154
#> [217] 165 171 177 179 193 194 197 208 210 212 214 216 237 251 278 281 291 292
#> [235] 301 302 309 311 321 329 338 1 12 13 25 34 46 69 92 97 119 127
#> [253] 147 160 174 189 192 195 205 220 225 227 241 246 261 268 280 289 294 299
#> [271] 306 316 320 335 340 9 21 27 39 42 63 75 91 95 102 106 126 128
#> [289] 138 161 164 176 184 186 187 202 204 235 238 239 259 297 300 303 310 319
#> [307] 330 332 344
# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced
#>
#> pos neg
#> 0.3489583 0.6510417
task$col_roles$stratum = task$target_names
r = rsmp("subsampling")
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#>
#> pos neg
#> 0.3496094 0.6503906