This introduction shows how to parallelize mlr3 with the packages future and future.apply.

Installation

Make sure you have installed future and future.apply:

if (!requireNamespace("future"))
  install.packages("future")
if (!requireNamespace("future.apply"))
  install.packages("future.apply")

Parallel resampling

The most outer loop in resampling runs independent repetitions of applying a learner on a subset of a task, predict on a different subset and score the performance by comparing true and predicted labels. This loop is what is called embarrassingly parallel.

In the following, we will consider the spam task and a simple classification tree ("classif.rpart") to illustrate the parallelization.

library("mlr3")

task = mlr_tasks$get("spam")
learner = mlr_learners$get("classif.rpart")
resampling = mlr_resamplings$get("subsampling")

system.time(
  resample(task, learner, resampling)
)[3L]

We now use the future package to parallelize the resampling by selecting a backend via the function plan and then repeat the resampling. We use the “multiprocess” backend here which uses threads on linux/mac and a socket cluster on windows:

future::plan("multiprocess")
system.time(
  resample(task, learner, resampling)
)[3L]

On most systems you should see a decrease in the reported real CPU time. On some systems (e.g. windows), the overhead for parallelization is quite large though. Therefore, you should only enable parallelization for experiments which run more than 10s each.

Benchmarking is also parallelized. The following code sends 64 jobs (4 tasks * 16 resampling repeats) to the future backend:

tasks = mlr_tasks$mget(c("iris", "spam", "pima"))
learners = mlr_learners$mget("classif.rpart")
resamplings = mlr_resamplings$mget("subsampling")
resamplings$subsampling$param_vals = list(ratio = 0.8, repeats = 16)

future::plan("multiprocess")
system.time(
  benchmark(expand_grid(tasks, learners, resamplings))
)[3L]