This introduction is about resampling and benchmarking.

## Objects

Again, we consider the iris task and a simple classification tree here.

library(mlr3)
task = mlr_tasks$get("iris") learner = mlr_learners$get("classif.rpart")

Additionally, we need to define how we want to resample. mlr3 comes with the following resampling strategies implemented:

mlr_resamplings$keys() #> [1] "bootstrap" "custom" "cv" "holdout" "repeated_cv" #> [6] "subsampling" Additional resampling methods for special use cases will be available via extension packages, such as mlr3spatiotemporal for spatial data (still in development). The experiment conducted in the introduction on train/predict/score is equivalent to a simple “holdout”, so let’s consider this one first. resampling = mlr_resamplings$get("holdout")
print(resampling)
#> <ResamplingHoldout> with 1 iterations
#> Instantiated: FALSE
#> Parameters: ratio=0.6667
#>
#> Public: clone, duplicated_ids, format, hash, id, instance,
#>   instantiate, is_instantiated, iters, param_set, stratify,
print(resampling$param_set$values)
#> $ratio #> [1] 0.6666667 To change the ratio to $$0.8$$, we simply overwrite the slot: resampling$param_set$values = list(ratio = 0.8) ## Resampling Now, we can pass all created objects to the resample() function to get an object of class ResampleResult: rr = resample(task, learner, resampling) #> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 1/1)' print(rr) #> <ResampleResult> of learner 'iris' on task 'classif.rpart' with 1 iterations #> Measure Min. 1st Qu. Median Mean 3rd Qu. Max. Sd #> classif.mmce 0.03333 0.03333 0.03333 0.03333 0.03333 0.03333 NA Before we go into more detail, let’s change the resampling to a 3-fold cross-validation to better illustrate what operations are possible with a resampling result. resampling = mlr_resamplings$get("cv", param_vals = list(folds = 3L))
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 1/3)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 2/3)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 3/3)'
print(rr)
#> <ResampleResult> of learner 'iris' on task 'classif.rpart' with 3 iterations
#>       Measure Min. 1st Qu. Median    Mean 3rd Qu. Max.      Sd
#>  classif.mmce 0.06    0.07   0.08 0.07333    0.08 0.08 0.01155

We can do different things with resampling results, e.g.:

• Extract the performance for the individual resampling iterations:
rr$performance("classif.mmce") #> [1] 0.06 0.08 0.08 • Extract and inspect the now created resampling: rr$resampling
#> <ResamplingCV> with 3 iterations
#> Instantiated: TRUE
#> Parameters: folds=3
#>
#> Public: clone, duplicated_ids, format, hash, id, instance,
#>   instantiate, is_instantiated, iters, param_set, stratify,
rr$resampling$iters
#> [1] 3
rr$resampling$test_set(1)
#>  [1]   4   5   6   7   8  10  11  12  13  15  21  24  26  28  30  32  36
#> [18]  39  44  45  46  47  52  64  65  70  76  79  89  91  94  95  96  97
#> [35]  98 111 112 120 124 127 129 131 132 133 136 139 142 143 144 147
rr$resampling$test_set(2)
#>  [1]   1   3  14  16  17  22  34  38  40  50  53  54  58  61  62  66  67
#> [18]  68  69  71  72  73  74  77  83  85  86  87  88  90  92  99 100 101
#> [35] 104 105 106 107 109 115 116 118 119 121 123 130 134 141 148 150
rr$resampling$test_set(3)
#>  [1]   2   9  18  19  20  23  25  27  29  31  33  35  37  41  42  43  48
#> [18]  49  51  55  56  57  59  60  63  75  78  80  81  82  84  93 102 103
#> [35] 108 110 113 114 117 122 125 126 128 135 137 138 140 145 146 149
• Retrieve the experiment of a specific iteration and inspect it:
e = rr$experiment(iter = 1) e$model
#> n= 100
#>
#> node), split, n, loss, yval, (yprob)
#>       * denotes terminal node
#>
#>  1) root 100 63 versicolor (0.28000000 0.37000000 0.35000000)
#>    2) Petal.Length< 2.45 28  0 setosa (1.00000000 0.00000000 0.00000000) *
#>    3) Petal.Length>=2.45 72 35 versicolor (0.00000000 0.51388889 0.48611111)
#>      6) Petal.Length< 4.85 34  1 versicolor (0.00000000 0.97058824 0.02941176) *
#>      7) Petal.Length>=4.85 38  4 virginica (0.00000000 0.10526316 0.89473684)
#>       14) Petal.Width< 1.75 7  3 versicolor (0.00000000 0.57142857 0.42857143) *
#>       15) Petal.Width>=1.75 31  0 virginica (0.00000000 0.00000000 1.00000000) *

## Manual instantiation

If you want to compare multiple learners, you should use the same resampling per task to reduce the variance of the performance estimation. Until now, we have just passed a resampling strategy to resample(), without specifying the actual splits into training and test. Here, we manually instantiate the resampling:

resampling = mlr_resamplings$get("cv", param_vals = list(folds = 3L)) resampling$instantiate(task)
resampling$iters #> [1] 3 resampling$train_set(1)
#>   [1]   2   3   8  11  14  16  17  21  22  25  29  30  35  39  43  53  62
#>  [18]  63  66  67  68  70  71  74  75  76  79  80  81  82  83  88  92  93
#>  [35]  94 100 101 107 110 114 121 123 124 125 127 134 138 141 142 145   1
#>  [52]   4   5   7  10  18  20  24  32  36  40  42  44  47  48  49  50  52
#>  [69]  56  57  61  65  69  73  85  86  90  95  96  98 102 103 104 109 111
#>  [86] 116 117 120 128 130 132 136 139 140 143 144 147 148 149 150

If we now pass this instantiated object to resample, the pre-calculated training and test splits will be used for both learners:

learner1 = mlr_learners$get("classif.rpart") # simple classification tree learner2 = mlr_learners$get("classif.featureless") # featureless learner, prediction majority class
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 1/3)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 2/3)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 3/3)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'iris' (iteration 1/3)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'iris' (iteration 2/3)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'iris' (iteration 3/3)'

setequal(rr1$experiment(1)$train_set, rr2$experiment(1)$train_set)
#> [1] TRUE

We can also combine the created result objects into a BenchmarkResult (see below for an introduction to simple benchmarking):

bmr = rr1$combine(rr2) bmr$aggregated(objects = FALSE)
#>                hash resampling_id task_id          learner_id classif.mmce
#>              <char>        <char>  <char>              <char>        <num>
#> 1: 5ff5e1a526d2f716            cv    iris       classif.rpart   0.05333333
#> 2: 9b120d237dd7f6ee            cv    iris classif.featureless   0.71333333

## Custom resampling

Sometimes it is necessary to perform resampling with custom splits, e.g. to reproduce a study. For this purpose, splits can be manually set for ResamplingCustom:

resampling = mlr_resamplings$get("custom") resampling$instantiate(task,
list(c(1:10, 51:60, 101:110)),
list(c(11:20, 61:70, 111:120))
)
resampling$iters #> [1] 1 resampling$train_set(1)
#>  [1]   1   2   3   4   5   6   7   8   9  10  51  52  53  54  55  56  57
#> [18]  58  59  60 101 102 103 104 105 106 107 108 109 110
resampling$test_set(1) #> [1] 11 12 13 14 15 16 17 18 19 20 61 62 63 64 65 66 67 #> [18] 68 69 70 111 112 113 114 115 116 117 118 119 120 ## Benchmarking Comparing the performance of different learners on multiple tasks is a recurrent task. mlr3 offers the benchmark() function for convenience. The interface of the benchmark() function accepts a design of tasks, learners, and resampling strategies as data.frame. Here, we call benchmark() to perform a single holdout split on a single task and two learners: library(data.table) design = data.table( task = mlr_tasks$mget("iris"),
learner = mlr_learners$mget(c("classif.rpart", "classif.featureless")), resampling = mlr_resamplings$mget("holdout")
)
print(design)
#>           <list>                      <list>              <list>
bmr = benchmark(design)
#> INFO [mlr3] Benchmarking 2 experiments
#> INFO [mlr3] Running learner 'classif.rpart' on task 'iris' (iteration 1/1)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'iris' (iteration 1/1)'
#> INFO [mlr3] Finished benchmark

Note that the holdout splits have been automatically instantiated for each row of the design. As a result, the rpart learner used a different training set than the featureless learner. However, for comparison of learners you usually want the learners to see the same splits into train and test sets. To overcome this issue, the resampling strategy needs to be manually instantiated before creating the design.

While the interface of benchmark() allows full flexibility, the creation of such design tables can be tedious. Therefore, mlr3 provides a helper function to quickly generate design tables and instantiate resampling strategies in an exhaustive grid fashion: mlr3::expand_grid().

# get some example tasks
tasks = mlr_tasks$mget(c("pima", "sonar", "spam")) # set measures for all tasks: accuracy (acc) and area under the curve (auc) measures = mlr_measures$mget(c("classif.acc", "classif.auc"))
tasks = lapply(tasks, function(task) { task$measures = measures; task }) # get a featureless learner and a classification tree learners = mlr_learners$mget(c("classif.featureless", "classif.rpart"))

# let the learners predict probabilities instead of class labels (required for AUC measure)
learners$classif.featureless$predict_type = "prob"
learners$classif.rpart$predict_type = "prob"

# compare via 10-fold cross validation
resamplings = mlr_resamplings$mget("cv") # create a BenchmarkResult object design = expand_grid(tasks, learners, resamplings) print(design) #> task learner resampling #> <list> <list> <list> #> 1: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> #> 2: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> #> 3: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> #> 4: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> #> 5: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV> #> 6: <TaskClassif> <LearnerClassifRpart> <ResamplingCV> bmr = benchmark(design) #> INFO [mlr3] Benchmarking 60 experiments #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 1/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 2/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 3/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 4/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 5/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 6/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 7/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 8/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 9/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 10/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 1/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 2/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 3/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 4/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 5/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 6/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 7/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 8/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 9/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 10/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 1/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 2/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 3/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 4/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 5/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 6/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 7/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 8/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 9/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 10/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 1/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 2/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 3/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 4/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 5/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 6/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 7/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 8/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 9/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 10/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 1/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 2/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 3/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 4/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 5/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 6/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 7/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 8/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 9/10)' #> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 10/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 1/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 2/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 3/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 4/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 5/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 6/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 7/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 8/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 9/10)' #> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 10/10)' #> INFO [mlr3] Finished benchmark The aggregated resampling results can be accessed with: bmr$aggregated(objects = FALSE)
#>                hash resampling_id task_id          learner_id classif.acc
#>              <char>        <char>  <char>              <char>       <num>
#> 1: ba2947f69f5352f8            cv    pima classif.featureless   0.6511449
#> 2: 67b52f854f359b40            cv    pima       classif.rpart   0.7604238
#> 3: 904e6745fca9f36e            cv   sonar classif.featureless   0.5340476
#> 4: 268df21d6bc39ace            cv   sonar       classif.rpart   0.6819048
#> 5: 250fcab166d1028b            cv    spam classif.featureless   0.6059582
#> 6: 804918033b403e5f            cv    spam       classif.rpart   0.8898005
#>    classif.auc
#>          <num>
#> 1:   0.5000000
#> 2:   0.8022679
#> 3:   0.5000000
#> 4:   0.7536219
#> 5:   0.5000000
#> 6:   0.8946978

We can aggregate it further, i.e. if we are interested which learner performed best over all tasks:

bmr$aggregated(objects = FALSE)[, list(acc = mean(classif.acc), auc = mean(classif.auc)), by = "learner_id"] #> learner_id acc auc #> <char> <num> <num> #> 1: classif.featureless 0.5970502 0.5000000 #> 2: classif.rpart 0.7773764 0.8168625 Unsurprisingly, the classification tree outperformed the featureless learner. ### Converting specific benchmark objects to resample objects As a BenchmarkResult object is basically a collection of multiple ResampleResult objects, we can extract specific ResampleResult objects using the stored hashes: tab = bmr$aggregated(objects = FALSE)[task_id == "spam" & learner_id == "classif.rpart"]
print(tab)
#>                hash resampling_id task_id    learner_id classif.acc
#>              <char>        <char>  <char>        <char>       <num>
#> 1: 804918033b403e5f            cv    spam classif.rpart   0.8898005
#>    classif.auc
#>          <num>
#> 1:   0.8946978

rr = bmr$resample_result(tab$hash)
print(rr)
#> <ResampleResult> of learner 'spam' on task 'classif.rpart' with 10 iterations
#>      Measure   Min. 1st Qu. Median   Mean 3rd Qu.   Max.      Sd
#>  classif.acc 0.8457  0.8842 0.8924 0.8898  0.9000 0.9217 0.02386
#>  classif.auc 0.8354  0.8858 0.8953 0.8947  0.9092 0.9251 0.02595

We can now investigate this resampling and even single experiments using the previously introduced API:

rr$aggregated #> classif.acc classif.auc #> 0.8898005 0.8946978 # get the iteration with worst AUC worst = as.data.table(rr)[which.min(classif.auc), c("iteration", "classif.auc")] print(worst) #> iteration classif.auc #> <int> <num> #> 1: 10 0.8354365 # get the corresponding experiment e = rr$experiment(worst\$iteration)
print(e)
#> <Experiment> [scored]:
#>   validation_set