Make sure you have installed
The most outer loop in resampling runs independent repetitions of applying a learner on a subset of a task, predict on a different subset and score the performance by comparing true and predicted labels. This loop is what is called embarrassingly parallel.
In the following, we will consider the spam task and a simple classification tree (
"classif.rpart") to illustrate the parallelization.
We now use the
future package to parallelize the resampling by selecting a backend via the function
plan and then repeat the resampling. We use the “multiprocess” backend here which uses threads on linux/mac and a socket cluster on windows:
On most systems you should see a decrease in the reported real CPU time. On some systems (e.g. windows), the overhead for parallelization is quite large though. Therefore, you should only enable parallelization for experiments which run more than 10s each.
Benchmarking is also parallelized. The following code sends 64 jobs (4 tasks * 16 resampling repeats) to the future backend:
tasks = mlr_tasks$mget(c("iris", "spam", "pima")) learners = mlr_learners$mget("classif.rpart") resamplings = mlr_resamplings$mget("subsampling") resamplings$subsampling$param_vals = list(ratio = 0.8, repeats = 16) future::plan("multiprocess") system.time( benchmark(expand_grid(tasks, learners, resamplings)) )[3L]