This is the abstract base class for resampling objects like ResamplingCV and ResamplingBootstrap.
The objects of this class define how a task is partitioned for resampling (e.g., in resample()
or benchmark()
),
using a set of hyperparameters such as the number of folds in cross-validation.
Resampling objects can be instantiated on a Task, which applies the strategy on the task and manifests in a
fixed partition of row_ids
of the Task.
Predefined resamplings are stored in the dictionary mlr_resamplings,
e.g. cv
or bootstrap
.
Stratification
All derived classes support stratified sampling.
The stratification variables are assumed to be discrete and must be stored in the Task with column role "stratum"
.
In case of multiple stratification variables, each combination of the values of the stratification variables forms a strata.
First, the observations are divided into subpopulations based one or multiple stratification variables (assumed to be discrete), c.f. task$strata
.
Second, the sampling is performed in each of the k
subpopulations separately.
Each subgroup is divided into iter
training sets and iter
test sets by the derived Resampling
.
These sets are merged based on their iteration number:
all training sets from all subpopulations with iteration 1 are combined, then all training sets with iteration 2, and so on.
Same is done for all test sets.
The merged sets can be accessed via $train_set(i)
and $test_set(i)
, respectively.
Note that this procedure can lead to set sizes that are slightly different from those
without stratification.
Grouping / Blocking
All derived classes support grouping of observations.
The grouping variable is assumed to be discrete and must be stored in the Task with column role "group"
.
Observations in the same group are treated like a "block" of observations which must be kept together. These observations either all go together into the training set or together into the test set.
The sampling is performed by the derived Resampling on the grouping variable.
Next, the grouping information is replaced with the respective row ids to generate training and test sets.
The sets can be accessed via $train_set(i)
and $test_set(i)
, respectively.
See also
Chapter in the mlr3book: https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html#sec-resampling
Package mlr3spatiotempcv for spatio-temporal resamplings.
Dictionary of Resamplings: mlr_resamplings
as.data.table(mlr_resamplings)
for a table of available Resamplings in the running session (depending on the loaded packages).mlr3spatiotempcv for additional Resamplings for spatio-temporal tasks.
Other Resampling:
mlr_resamplings
,
mlr_resamplings_bootstrap
,
mlr_resamplings_custom
,
mlr_resamplings_custom_cv
,
mlr_resamplings_cv
,
mlr_resamplings_holdout
,
mlr_resamplings_insample
,
mlr_resamplings_loo
,
mlr_resamplings_repeated_cv
,
mlr_resamplings_subsampling
Public fields
label
(
character(1)
)
Label for this object. Can be used in tables, plot and text output instead of the ID.param_set
(paradox::ParamSet)
Set of hyperparameters.instance
(any)
Duringinstantiate()
, the instance is stored in this slot in an arbitrary format. Note that if a grouping variable is present in the Task, a Resampling may operate on the group ids internally instead of the row ids (which may lead to confusion).It is advised to not work directly with the
instance
, but instead only use the getters$train_set()
and$test_set()
.task_hash
(
character(1)
)
The hash of the Task which was passed tor$instantiate()
.task_nrow
(
integer(1)
)
The number of observations of the Task which was passed tor$instantiate()
.duplicated_ids
(
logical(1)
)
IfTRUE
, duplicated rows can occur within a single training set or within a single test set. E.g., this isTRUE
for Bootstrap, andFALSE
for cross-validation. Only used internally.man
(
character(1)
)
String in the format[pkg]::[topic]
pointing to a manual page for this object. Defaults toNA
, but can be set by child classes.
Active bindings
id
(
character(1)
)
Identifier of the object. Used in tables, plot and text output.is_instantiated
(
logical(1)
)
IsTRUE
if the resampling has been instantiated.hash
(
character(1)
)
Hash (unique identifier) for this object. If the object has not been instantiated yet,NA_character_
is returned. The hash is calculated based on the class name, the id, the parameter set, and the instance.
Methods
Method new()
Creates a new instance of this R6 class.
Usage
Resampling$new(
id,
param_set = ps(),
duplicated_ids = FALSE,
label = NA_character_,
man = NA_character_
)
Arguments
id
(
character(1)
)
Identifier for the new instance.param_set
(paradox::ParamSet)
Set of hyperparameters.duplicated_ids
(
logical(1)
)
Set toTRUE
if this resampling strategy may have duplicated row ids in a single training set or test set.Note that this object is typically constructed via a derived classes, e.g. ResamplingCV or ResamplingHoldout.
label
(
character(1)
)
Label for the new instance.man
(
character(1)
)
String in the format[pkg]::[topic]
pointing to a manual page for this object. The referenced help package can be opened via method$help()
.
Method instantiate()
Materializes fixed training and test splits for a given task and stores them in r$instance
in an arbitrary format.
Arguments
task
(Task)
Task used for instantiation.
Examples
r = rsmp("subsampling")
# Default parametrization
r$param_set$values
#> $ratio
#> [1] 0.6666667
#>
#> $repeats
#> [1] 30
#>
# Do only 3 repeats on 10% of the data
r$param_set$values = list(ratio = 0.1, repeats = 3)
r$param_set$values
#> $ratio
#> [1] 0.1
#>
#> $repeats
#> [1] 3
#>
# Instantiate on penguins task
task = tsk("penguins")
r$instantiate(task)
# Extract train/test sets
train_set = r$train_set(1)
print(train_set)
#> [1] 42 75 33 106 32 249 85 300 190 96 40 79 224 199 270 304 89 259 279
#> [20] 343 317 123 318 195 255 188 49 284 205 337 233 307 31 230
intersect(train_set, r$test_set(1))
#> integer(0)
# Another example: 10-fold CV
r = rsmp("cv")$instantiate(task)
r$train_set(1)
#> [1] 10 12 15 22 79 97 104 126 133 135 136 137 139 141 159 166 172 183
#> [19] 184 186 188 225 228 237 250 253 264 271 275 279 298 312 326 329 340 5
#> [37] 6 20 28 42 47 58 72 80 81 82 107 117 119 123 130 164 168 169
#> [55] 185 191 197 209 215 227 234 244 249 266 270 276 280 300 311 316 4 7
#> [73] 16 33 48 51 52 54 59 67 70 88 103 121 138 155 173 176 179 196
#> [91] 201 204 239 261 267 284 290 299 304 305 308 328 330 337 339 19 34 35
#> [109] 44 77 84 85 90 92 93 109 112 116 124 145 158 162 200 203 206 207
#> [127] 212 213 221 226 241 245 274 277 287 289 301 310 322 2 23 24 32 41
#> [145] 61 71 74 83 101 105 106 128 143 144 149 153 154 193 195 218 220 230
#> [163] 247 248 257 268 272 286 303 313 314 324 332 18 25 31 40 50 62 65
#> [181] 69 96 114 122 125 131 140 156 165 174 175 194 217 251 258 263 269 281
#> [199] 285 288 294 297 327 331 333 334 343 17 27 29 56 57 89 94 98 100
#> [217] 102 108 120 146 147 170 181 182 205 211 216 219 222 223 231 256 265 282
#> [235] 291 295 309 319 341 342 344 14 21 30 38 39 45 53 60 66 73 75
#> [253] 95 111 132 163 171 199 202 232 235 236 242 252 254 262 283 293 296 302
#> [271] 307 318 321 323 325 3 8 9 36 46 49 63 64 76 78 110 113 129
#> [289] 134 142 148 157 160 161 180 187 189 192 224 229 240 243 255 278 306 317
#> [307] 320 336 338
# Stratification
task = tsk("pima")
prop.table(table(task$truth())) # moderately unbalanced
#>
#> pos neg
#> 0.3489583 0.6510417
task$col_roles$stratum = task$target_names
r = rsmp("subsampling")
r$instantiate(task)
prop.table(table(task$truth(r$train_set(1)))) # roughly same proportion
#>
#> pos neg
#> 0.3496094 0.6503906