This is the result container object returned by benchmark()
.
A BenchmarkResult consists of the data of multiple ResampleResults.
The contents of a BenchmarkResult
and ResampleResult are almost identical and the stored ResampleResults can be extracted via the $resample_result(i)
method, where i is the index of the performed resample experiment.
This allows us to investigate the extracted ResampleResult and individual resampling iterations, as well as the predictions and models from each fold.
BenchmarkResults can be visualized via mlr3viz's autoplot()
function.
For statistical analysis of benchmark results and more advanced plots, see mlr3benchmark.
Note
All stored objects are accessed by reference. Do not modify any extracted object without cloning it first.
S3 Methods
as.data.table(rr, ..., reassemble_learners = TRUE, convert_predictions = TRUE, predict_sets = "test", task_characteristics = FALSE)
BenchmarkResult ->data.table::data.table()
Returns a tabular view of the internal data.c(...)
(BenchmarkResult, ...) -> BenchmarkResult
Combines multiple objects convertible to BenchmarkResult into a new BenchmarkResult.
See also
Chapter in the mlr3book: https://mlr3book.mlr-org.com/chapters/chapter3/evaluation_and_benchmarking.html#sec-benchmarking
Package mlr3viz for some generic visualizations.
mlr3benchmark for post-hoc analysis of benchmark results.
Other benchmark:
benchmark()
,
benchmark_grid()
Active bindings
task_type
(
character(1)
)
Task type of objects in theBenchmarkResult
. All stored objects (Task, Learner, Prediction) in a singleBenchmarkResult
are required to have the same task type, e.g.,"classif"
or"regr"
. This isNA
for empty BenchmarkResults.tasks
(
data.table::data.table()
)
Table of included Tasks with three columns:"task_hash"
(character(1)
),"task_id"
(character(1)
), and"task"
(Task).
learners
(
data.table::data.table()
)
Table of included Learners with three columns:"learner_hash"
(character(1)
),"learner_id"
(character(1)
), and"learner"
(Learner).
Note that it is not feasible to access learned models via this field, as the training task would be ambiguous. For this reason the returned learner are reset before they are returned. Instead, select a row from the table returned by
$score()
.resamplings
(
data.table::data.table()
)
Table of included Resamplings with three columns:"resampling_hash"
(character(1)
),"resampling_id"
(character(1)
), and"resampling"
(Resampling).
resample_results
(
data.table::data.table()
)
Returns a table with three columns:uhash
(character()
).resample_result
(ResampleResult).
n_resample_results
(
integer(1)
)
Returns the total number of stored ResampleResults.uhashes
(
character()
)
Set of (unique) hashes of all included ResampleResults.uhash_table
(data.table::data.table)
Table with columnsuhash
,learner_id
,task_id
andresampling_id
.
Methods
Method new()
Creates a new instance of this R6 class.
Usage
BenchmarkResult$new(data = NULL)
Arguments
data
(
ResultData
)
An object of typeResultData
, either extracted from another ResampleResult, another BenchmarkResult, or manually constructed withas_result_data()
.
Method format()
Helper for print outputs.
Method combine()
Fuses a second BenchmarkResult into itself, mutating the BenchmarkResult in-place.
If the second BenchmarkResult bmr
is NULL
, simply returns self
.
Note that you can alternatively use the combine function c()
which calls this method internally.
Method marshal()
Marshals all stored models.
Arguments
...
(any)
Additional arguments passed tomarshal_model()
.
Method unmarshal()
Unmarshals all stored models.
Arguments
...
(any)
Additional arguments passed tounmarshal_model()
.
Method score()
Returns a table with one row for each resampling iteration, including
all involved objects: Task, Learner, Resampling, iteration number
(integer(1)
), and Prediction. If ids
is set to TRUE
, character
column of extracted ids are added to the table for convenient
filtering: "task_id"
, "learner_id"
, and "resampling_id"
.
Additionally calculates the provided performance measures and binds the performance scores as extra columns. These columns are named using the id of the respective Measure.
Arguments
measures
ids
(
logical(1)
)
Adds object ids ("task_id"
,"learner_id"
,"resampling_id"
) as extra character columns to the returned table.conditions
(
logical(1)
)
Adds condition messages ("warnings"
,"errors"
) as extra list columns of character vectors to the returned tablepredictions
(
logical(1)
)
Additionally return prediction objects, one column for eachpredict_set
of all learners combined. Columns are named"prediction_train"
,"prediction_test"
and"prediction_internal_valid"
, if present.
Method obs_loss()
Calculates the observation-wise loss via the loss function set in the
Measure's field obs_loss
.
Returns a data.table()
with the columns row_ids
, truth
, response
and
one additional numeric column for each measure, named with the respective measure id.
If there is no observation-wise loss function for the measure, the column is filled with
NA
values.
Note that some measures such as RMSE, do have an $obs_loss
, but they require an
additional transformation after aggregation, in this example taking the square-root.
Arguments
measures
predict_sets
(
character()
)
The predict sets.
Method aggregate()
Returns a result table where resampling iterations are combined into ResampleResults. A column with the aggregated performance score is added for each Measure, named with the id of the respective measure.
The method for aggregation is controlled by the Measure, e.g. micro aggregation, macro aggregation or custom aggregation. Most measures default to macro aggregation.
Note that the aggregated performances just give a quick impression which approaches work well and which approaches are probably underperforming. However, the aggregates do not account for variance and cannot replace a statistical test. See mlr3viz to get a better impression via boxplots or mlr3benchmark for critical difference plots and significance tests.
For convenience, different flags can be set to extract more information from the returned ResampleResult.
Usage
BenchmarkResult$aggregate(
measures = NULL,
ids = TRUE,
uhashes = FALSE,
params = FALSE,
conditions = FALSE
)
Arguments
measures
ids
(
logical(1)
)
Adds object ids ("task_id"
,"learner_id"
,"resampling_id"
) as extra character columns for convenient subsetting.uhashes
(
logical(1)
)
Adds the uhash values of the ResampleResult as extra character column"uhash"
.params
(
logical(1)
)
Adds the hyperparameter values as extra list column"params"
. You can unnest them withmlr3misc::unnest()
.conditions
(
logical(1)
)
Adds the number of resampling iterations with at least one warning as extra integer column"warnings"
, and the number of resampling iterations with errors as extra integer column"errors"
.
Method filter()
Subsets the benchmark result.
You can either directly provide the row IDs or the uhashes of the resample results to keep,
or use the learner_ids
, task_ids
and resampling_ids
arguments to filter for learner, task and resampling IDs.
The three options are mutually exclusive.
Usage
BenchmarkResult$filter(
i = NULL,
uhashes = NULL,
learner_ids = NULL,
task_ids = NULL,
resampling_ids = NULL
)
Arguments
i
(
integer()
|NULL
)
The iteration values to filter for.uhashes
(
character()
|NULL
)
The uhashes of the resample results to filter for.learner_ids
(
character()
|NULL
)
The learner IDs to filter for.task_ids
(
character()
|NULL
)
The task IDs to filter for.resampling_ids
(
character()
|NULL
)
The resampling IDs to filter for.
Method resample_result()
Retrieve the i-th ResampleResult, by position, by unique hash uhash
or by learner,
task and resampling IDs.
All three options are mutually exclusive.
Usage
BenchmarkResult$resample_result(
i = NULL,
uhash = NULL,
task_id = NULL,
learner_id = NULL,
resampling_id = NULL
)
Arguments
i
(
integer(1)
|NULL
)
The iteration value to filter for.uhash
(
character(1)
|NULL
)
The unique identifier to filter for.task_id
(
character(1)
|NULL
)
The task ID to filter for.learner_id
(
character(1)
|NULL
)
The learner ID to filter for.resampling_id
(
character(1)
|NULL
)
The resampling ID to filter for.
Method discard()
Shrinks the BenchmarkResult by discarding parts of the internally stored data. Note that certain operations might stop work, e.g. extracting importance values from learners or calculating measures requiring the task's data.
Arguments
backends
(
logical(1)
)
IfTRUE
, the DataBackend is removed from all stored Tasks.models
(
logical(1)
)
IfTRUE
, the stored model is removed from all Learners.
Method set_threshold()
Sets the threshold for the response prediction of classification learners, given they have output a probability prediction for a binary classification task.
The resample results for which to change the threshold can either be specified directly
via uhashes
, by selecting the specific iterations (i
) or by filtering according to
learner, task and resampling IDs.
If none of the three options is specified, the threshold is set for all resample results.
Usage
BenchmarkResult$set_threshold(
threshold,
i = NULL,
uhashes = NULL,
learner_ids = NULL,
task_ids = NULL,
resampling_ids = NULL,
ties_method = "random"
)
Arguments
threshold
(
numeric(1)
)
Threshold value.i
(
integer()
|NULL
)
The iteration values to filter for.uhashes
(
character()
|NULL
)
The unique identifiers of the ResampleResults for which the threshold should be set.learner_ids
(
character()
|NULL
)
The learner IDs for which the threshold should be set.task_ids
(
character()
|NULL
)
The task IDs for which the threshold should be set.resampling_ids
(
character()
|NULL
)
The resampling IDs for which the threshold should be set.ties_method
(
character(1)
)
Method to handle ties in probabilities when selecting a class label. Must be one of"random"
,"first"
or"last"
(corresponding to the same options inmax.col()
)."random"
: Randomly select one of the tied class labels (default)."first"
: Select the first class label among tied values."last"
: Select the last class label among tied values.
Examples
design = benchmark_grid(
tsk("sonar"),
lrns(c("classif.debug", "classif.featureless"), predict_type = "prob"),
rsmp("holdout")
)
bmr = benchmark(design)
bmr$set_threshold(0.8, learner_ids = "classif.featureless")
bmr$set_threshold(0.3, i = 2)
bmr$set_threshold(0.7, uhashes = uhashes(bmr, learner_ids = "classif.featureless"))
Examples
set.seed(123)
learners = list(
lrn("classif.featureless", predict_type = "prob"),
lrn("classif.rpart", predict_type = "prob")
)
design = benchmark_grid(
tasks = list(tsk("sonar"), tsk("penguins")),
learners = learners,
resamplings = rsmp("cv", folds = 3)
)
print(design)
#> task learner resampling
#> <char> <char> <char>
#> 1: sonar classif.featureless cv
#> 2: sonar classif.rpart cv
#> 3: penguins classif.featureless cv
#> 4: penguins classif.rpart cv
bmr = benchmark(design)
print(bmr)
#> <BenchmarkResult> of 12 rows with 4 resampling runs
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 sonar classif.featureless cv 3 0 0
#> 2 sonar classif.rpart cv 3 0 0
#> 3 penguins classif.featureless cv 3 0 0
#> 4 penguins classif.rpart cv 3 0 0
bmr$tasks
#> Key: <task_hash>
#> task_hash task_id task
#> <char> <char> <list>
#> 1: 389105c365bf9405 sonar <TaskClassif:sonar>
#> 2: a91e40052f01cdb1 penguins <TaskClassif:penguins>
bmr$learners
#> Key: <learner_hash>
#> learner_hash learner_id
#> <char> <char>
#> 1: 24129222692c2943 classif.rpart
#> 2: 38adf5e650d6602c classif.featureless
#> learner
#> <list>
#> 1: <LearnerClassifRpart:classif.rpart>
#> 2: <LearnerClassifFeatureless:classif.featureless>
# first 5 resampling iterations
head(as.data.table(bmr, measures = c("classif.acc", "classif.auc")), 5)
#> uhash task
#> <char> <list>
#> 1: 42323adc-d295-4c5e-a773-3301b1ada8e2 <TaskClassif:sonar>
#> 2: 42323adc-d295-4c5e-a773-3301b1ada8e2 <TaskClassif:sonar>
#> 3: 42323adc-d295-4c5e-a773-3301b1ada8e2 <TaskClassif:sonar>
#> 4: 03892d45-d2de-4534-a949-823874d570b6 <TaskClassif:sonar>
#> 5: 03892d45-d2de-4534-a949-823874d570b6 <TaskClassif:sonar>
#> learner resampling iteration
#> <list> <list> <int>
#> 1: <LearnerClassifFeatureless:classif.featureless> <ResamplingCV> 1
#> 2: <LearnerClassifFeatureless:classif.featureless> <ResamplingCV> 2
#> 3: <LearnerClassifFeatureless:classif.featureless> <ResamplingCV> 3
#> 4: <LearnerClassifRpart:classif.rpart> <ResamplingCV> 1
#> 5: <LearnerClassifRpart:classif.rpart> <ResamplingCV> 2
#> prediction task_id learner_id resampling_id
#> <list> <char> <char> <char>
#> 1: <PredictionClassif> sonar classif.featureless cv
#> 2: <PredictionClassif> sonar classif.featureless cv
#> 3: <PredictionClassif> sonar classif.featureless cv
#> 4: <PredictionClassif> sonar classif.rpart cv
#> 5: <PredictionClassif> sonar classif.rpart cv
# aggregate results
bmr$aggregate()
#> nr task_id learner_id resampling_id iters classif.ce
#> <int> <char> <char> <char> <int> <num>
#> 1: 1 sonar classif.featureless cv 3 0.46604555
#> 2: 2 sonar classif.rpart cv 3 0.27391304
#> 3: 3 penguins classif.featureless cv 3 0.55814900
#> 4: 4 penguins classif.rpart cv 3 0.05812357
#> Hidden columns: resample_result
# aggregate results with hyperparameters as separate columns
mlr3misc::unnest(bmr$aggregate(params = TRUE), "params")
#> nr task_id learner_id resampling_id iters classif.ce method
#> <int> <char> <char> <char> <int> <num> <char>
#> 1: 1 sonar classif.featureless cv 3 0.46604555 mode
#> 2: 2 sonar classif.rpart cv 3 0.27391304 <NA>
#> 3: 3 penguins classif.featureless cv 3 0.55814900 mode
#> 4: 4 penguins classif.rpart cv 3 0.05812357 <NA>
#> xval
#> <int>
#> 1: NA
#> 2: 0
#> 3: NA
#> 4: 0
#> Hidden columns: resample_result
# extract resample result for classif.rpart
rr = bmr$aggregate()[learner_id == "classif.rpart", resample_result][[1]]
print(rr)
#> <ResampleResult> with 3 resampling iterations
#> task_id learner_id resampling_id iteration prediction_test warnings
#> sonar classif.rpart cv 1 <PredictionClassif> 0
#> sonar classif.rpart cv 2 <PredictionClassif> 0
#> sonar classif.rpart cv 3 <PredictionClassif> 0
#> errors
#> 0
#> 0
#> 0
# access the confusion matrix of the first resampling iteration
rr$predictions()[[1]]$confusion
#> truth
#> response M R
#> M 30 18
#> R 3 19
# reduce to subset with task id "sonar"
bmr$filter(task_ids = "sonar")
print(bmr)
#> <BenchmarkResult> of 6 rows with 2 resampling runs
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 sonar classif.featureless cv 3 0 0
#> 2 sonar classif.rpart cv 3 0 0
## ------------------------------------------------
## Method `BenchmarkResult$filter`
## ------------------------------------------------
design = benchmark_grid(
tsks(c("iris", "sonar")),
lrns(c("classif.debug", "classif.featureless")),
rsmp("holdout")
)
bmr = benchmark(design)
bmr
#> <BenchmarkResult> of 4 rows with 4 resampling runs
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 iris classif.debug holdout 1 0 0
#> 2 iris classif.featureless holdout 1 0 0
#> 3 sonar classif.debug holdout 1 0 0
#> 4 sonar classif.featureless holdout 1 0 0
bmr2 = bmr$clone(deep = TRUE)
bmr2$filter(learner_ids = "classif.featureless")
bmr2
#> <BenchmarkResult> of 2 rows with 2 resampling runs
#> nr task_id learner_id resampling_id iters warnings errors
#> 1 iris classif.featureless holdout 1 0 0
#> 2 sonar classif.featureless holdout 1 0 0
## ------------------------------------------------
## Method `BenchmarkResult$resample_result`
## ------------------------------------------------
design = benchmark_grid(
tsk("iris"),
lrns(c("classif.debug", "classif.featureless")),
rsmp("holdout")
)
bmr = benchmark(design)
bmr$resample_result(learner_id = "classif.featureless")
#> <ResampleResult> with 1 resampling iterations
#> task_id learner_id resampling_id iteration prediction_test
#> iris classif.featureless holdout 1 <PredictionClassif>
#> warnings errors
#> 0 0
bmr$resample_result(i = 1)
#> <ResampleResult> with 1 resampling iterations
#> task_id learner_id resampling_id iteration prediction_test warnings
#> iris classif.debug holdout 1 <PredictionClassif> 0
#> errors
#> 0
bmr$resample_result(uhash = uhashes(bmr, learner_id = "classif.debug"))
#> <ResampleResult> with 1 resampling iterations
#> task_id learner_id resampling_id iteration prediction_test warnings
#> iris classif.debug holdout 1 <PredictionClassif> 0
#> errors
#> 0
## ------------------------------------------------
## Method `BenchmarkResult$set_threshold`
## ------------------------------------------------
design = benchmark_grid(
tsk("sonar"),
lrns(c("classif.debug", "classif.featureless"), predict_type = "prob"),
rsmp("holdout")
)
bmr = benchmark(design)
bmr$set_threshold(0.8, learner_ids = "classif.featureless")
#> Key: <uhash, iteration>
#> uhash iteration learner_state prediction
#> <char> <int> <list> <list>
#> 1: 00d4fe13-c0af-47a7-8587-0daf7b0add61 1 <learner_state[9]> <list[1]>
#> 2: 126b0fc4-dc3b-4d52-8a77-120a53c6e294 1 <learner_state[8]> <list[1]>
#> learner_hash task_hash learner_phash resampling_hash
#> <char> <char> <char> <char>
#> 1: 729485e635936fe0 389105c365bf9405 c1ddf900c095e7ef 35db3d2bb507d357
#> 2: 38adf5e650d6602c 389105c365bf9405 3f40f5f172de95d9 35db3d2bb507d357
bmr$set_threshold(0.3, i = 2)
#> Key: <uhash, iteration>
#> uhash iteration learner_state prediction
#> <char> <int> <list> <list>
#> 1: 00d4fe13-c0af-47a7-8587-0daf7b0add61 1 <learner_state[9]> <list[1]>
#> 2: 126b0fc4-dc3b-4d52-8a77-120a53c6e294 1 <learner_state[8]> <list[1]>
#> learner_hash task_hash learner_phash resampling_hash
#> <char> <char> <char> <char>
#> 1: 729485e635936fe0 389105c365bf9405 c1ddf900c095e7ef 35db3d2bb507d357
#> 2: 38adf5e650d6602c 389105c365bf9405 3f40f5f172de95d9 35db3d2bb507d357
bmr$set_threshold(0.7, uhashes = uhashes(bmr, learner_ids = "classif.featureless"))
#> Key: <uhash, iteration>
#> uhash iteration learner_state prediction
#> <char> <int> <list> <list>
#> 1: 00d4fe13-c0af-47a7-8587-0daf7b0add61 1 <learner_state[9]> <list[1]>
#> 2: 126b0fc4-dc3b-4d52-8a77-120a53c6e294 1 <learner_state[8]> <list[1]>
#> learner_hash task_hash learner_phash resampling_hash
#> <char> <char> <char> <char>
#> 1: 729485e635936fe0 389105c365bf9405 c1ddf900c095e7ef 35db3d2bb507d357
#> 2: 38adf5e650d6602c 389105c365bf9405 3f40f5f172de95d9 35db3d2bb507d357