Prerequisites

library(mlr3)

Learning tasks encapsulate the data set and additional meta information about a machine learning problem, for example the name of the target variable for supervised problems.

Task Creation

To manually create a task from a data.frame() or data.table(), you must first determine the task type to select the respective constructor:

  • Classification Task: Target column is labels (stored as character/factor) with only few distinct values.
    \(\Rightarrow\) mlr3::TaskClassif.
  • Regression Task: Target column is numeric (stored as integer/double).
    \(\Rightarrow\) mlr3::TaskRegr.
  • Cluster Task: You don’t have a target but want to identify similarities in the feature space.
    \(\Rightarrow\) Not yet implemented.
  • Survival Task: Target is the (right-censored) time to event.
    \(\Rightarrow\) TaskSurvival in add-on package mlr3surival.
  • Ordinal Regression Task: Target is ordinal
    \(\Rightarrow\) TaskOrdinal in add-on package mlr3ordinal.

Let’s assume we want to create a simple regression task using the mtcars data set from the package datasets to predict the column "mpg" (miles per gallon). We only take the first two features here to keep the output in the following examples short.

Next, we create the task by providing

  1. id: identifier for the task, used in plots and summaries
  2. backend: here, we simply provide the data.frame() which is internally converted to a mlr3::DataBackendDataTable. For more fine-grain control over how the data is stored internally, we could also construct a mlr3::DataBackend manually.
  3. target: Column name of the target column for the regression problem.

The print() method gives a short summary of the task: It has 32 observations, 3 columns of which 2 columns are features.

Predefined tasks

mlr3 ships with some predefined machine learning tasks. These are stored in a mlr3::Dictionary, which is a simple key-value store named mlr3::mlr_tasks. We can obtain a summarizing overview of all stored tasks by converting the dictionary to a data.table()

For illustration purposes, we now retrieve the popular iris data set from mlr_tasks as a classification task:

Task API

The task properties and characteristics can be queried using the task’s public member values and methods (see ?mlr3::Task). Most of them should be self-explanatory, e.g.,

Retrieve Data

In mlr3, each row (observation) has a unique identifier which can be either integer() or character(). These can be used to select specific rows.

Note that the method $data() is only an accessor and does not modify the underlying data/task.

Analogously, each column has an identifier, which is often just called column name. These are stored in the public fields feature_names and target_names:

To retrieve the complete data set, e.g. for a closer inspection, convert to a data.table():

Roles

It is possible to assign special roles to (subsets of) rows and columns.

For example, the previously constructed mtcars task has the following column roles:

Now, we want the original rownames() of mtcars to be a regular feature column. Thus, we first pre-process the data.frame and then re-create the task.

The column “rn” is now a regular feature. As this is a unique string column, most machine learning algorithms will have problems to process this feature without some kind of preprocessing. However, we still might want to carry rn around for different reasons. E.g., we can use the row names in plots or to associate outliers with the row names. This being said, we need to change the role of the row names column rn and remove it from the set of active features.

Note that no copies of the underlying data is inflicted by this operation. By changing roles, only the view on the data is changed, not the data itself.

Just like columns, it is also possible to assign different roles to rows. Rows can have two different roles:

  1. Role "use": Rows that are generally available for model fitting (although they may also be used as test set in resampling). This is the default role.
  2. Role "validation": Rows that are held back (see below). Rows which have missing values in the target column upon task creation are automatically moved to the validation set.

There are several reasons to hold some observations back or treat them differently:

  1. It is often good practice to validate the final model on an external validation set to uncover possible overfitting
  2. Some observations may be unlabeled, e.g. in data mining cups or Kaggle competitions. These observations cannot be used for training a model, but you can still predict labels.

Task Mutators

The methods set_col_role() and set_row_role() change the view on the data and can be used to subset the task. For convenience, the method filter() subsets the task based on row ids, and select() subsets the task based on feature names. All these operations only change the view on the data, without creating a copy of it, but modify the task in-place.

Additionally, the methods rbind() and cbind() allow to add extra rows and columns to a task, respectively. The method replace_features() is a convenience wrapper around select() and cbind(). Again, the original data set stored in the original mlr3::DataBackend is not altered in any way.