This introduction is about resampling and benchmarking.

Objects

Again, we consider the iris task and a simple classification tree here.

Additionally, we need to define how we want to resample. mlr3 comes with the following resampling strategies implemented:

Additional resampling methods for special use cases will be available via extension packages, such as mlr3spatiotemporal for spatial data (still in development).

The experiment conducted in the introduction on train/predict/score is equivalent to a simple “holdout”, so let’s consider this one first.

To change the ratio to \(0.8\), we simply overwrite the slot:

Resampling

Now, we can pass all created objects to the resample() function to get an object of class ResampleResult:

Before we go into more detail, let’s change the resampling to a 3-fold cross-validation to better illustrate what operations are possible with a resampling result.

We can do different things with resampling results, e.g.:

  • Extract the performance for the individual resampling iterations:
  • Extract and inspect the now created resampling:
  • Retrieve the experiment of a specific iteration and inspect it:

Manual instantiation

If you want to compare multiple learners, you should use the same resampling per task to reduce the variance of the performance estimation. Until now, we have just passed a resampling strategy to resample(), without specifying the actual splits into training and test. Here, we manually instantiate the resampling:

If we now pass this instantiated object to resample, the pre-calculated training and test splits will be used for both learners:

We can also combine the created result objects into a BenchmarkResult (see below for an introduction to simple benchmarking):

Custom resampling

Sometimes it is necessary to perform resampling with custom splits, e.g. to reproduce a study. For this purpose, splits can be manually set for ResamplingCustom:

Benchmarking

Comparing the performance of different learners on multiple tasks is a recurrent task. mlr3 offers the benchmark() function for convenience. The interface of the benchmark() function accepts a design of tasks, learners, and resampling strategies as data.frame.

Here, we call benchmark() to perform a single holdout split on a single task and two learners:

Note that the holdout splits have been automatically instantiated for each row of the design. As a result, the rpart learner used a different training set than the featureless learner. However, for comparison of learners you usually want the learners to see the same splits into train and test sets. To overcome this issue, the resampling strategy needs to be manually instantiated before creating the design.

While the interface of benchmark() allows full flexibility, the creation of such design tables can be tedious. Therefore, mlr3 provides a helper function to quickly generate design tables and instantiate resampling strategies in an exhaustive grid fashion: mlr3::expand_grid().

# get some example tasks
tasks = mlr_tasks$mget(c("pima", "sonar", "spam"))

# set measures for all tasks: accuracy (acc) and area under the curve (auc)
measures = mlr_measures$mget(c("classif.acc", "classif.auc"))
tasks = lapply(tasks, function(task) { task$measures = measures; task })

# get a featureless learner and a classification tree
learners = mlr_learners$mget(c("classif.featureless", "classif.rpart"))

# let the learners predict probabilities instead of class labels (required for AUC measure)
learners$classif.featureless$predict_type = "prob"
learners$classif.rpart$predict_type = "prob"

# compare via 10-fold cross validation
resamplings = mlr_resamplings$mget("cv")

# create a BenchmarkResult object
design = expand_grid(tasks, learners, resamplings)
print(design)
#>             task                     learner     resampling
#>           <list>                      <list>         <list>
#> 1: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV>
#> 2: <TaskClassif>       <LearnerClassifRpart> <ResamplingCV>
#> 3: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV>
#> 4: <TaskClassif>       <LearnerClassifRpart> <ResamplingCV>
#> 5: <TaskClassif> <LearnerClassifFeatureless> <ResamplingCV>
#> 6: <TaskClassif>       <LearnerClassifRpart> <ResamplingCV>
bmr = benchmark(design)
#> INFO [mlr3] Benchmarking 60 experiments
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 1/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 2/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 3/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 4/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 5/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 6/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 7/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 8/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 9/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'pima' (iteration 10/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 1/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 2/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 3/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 4/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 5/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 6/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 7/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 8/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 9/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'pima' (iteration 10/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 1/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 2/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 3/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 4/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 5/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 6/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 7/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 8/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 9/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'sonar' (iteration 10/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 1/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 2/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 3/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 4/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 5/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 6/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 7/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 8/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 9/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'sonar' (iteration 10/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 1/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 2/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 3/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 4/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 5/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 6/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 7/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 8/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 9/10)'
#> INFO [mlr3] Running learner 'classif.featureless' on task 'spam' (iteration 10/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 1/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 2/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 3/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 4/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 5/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 6/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 7/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 8/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 9/10)'
#> INFO [mlr3] Running learner 'classif.rpart' on task 'spam' (iteration 10/10)'
#> INFO [mlr3] Finished benchmark

The aggregated resampling results can be accessed with:

We can aggregate it further, i.e. if we are interested which learner performed best over all tasks:

Unsurprisingly, the classification tree outperformed the featureless learner.

Converting specific benchmark objects to resample objects

As a BenchmarkResult object is basically a collection of multiple ResampleResult objects, we can extract specific ResampleResult objects using the stored hashes:

We can now investigate this resampling and even single experiments using the previously introduced API: