Retrieving tasks

For illustration purposes, we now retrieve the popular iris data set from mlr_tasks:

The task object is a mlr3::Task, which contains several information on the respective task. The task properties and characteristics can be queried using the task’s public member values and methods (see ?mlr3::Task). Most of them should be self-explanatory, e.g.,

In mlr3 tasks, each row has a unique identifier (row name) which can be either integer or character. These can be used to select specific rows.

Note that the method $data() is only an accesor and does not modify the underlying data/task. To modify the underlying data/task, you can use the $filter() and $select() methods, which are mutators.

Each task comes with at least one associated performance measure, stored as list inside the task:

To change a measure for a task, simply overwrite this slot.

Manual task creation

To manually create a task from a data.frame, you must determine the task type to select the respective constructor:

  • Classification Task: Target column is labels (stored as character/factor) with only few distinct values.
    \(\Rightarrow\) TaskClassif.
  • Regression Task: Target column is numeric (stored as integer/double).
    \(\Rightarrow\) TaskRegr.
  • Cluster Task: You don’t have a target but want to identify similarities in the feature space.
    \(\Rightarrow\) Not yet implemented.

Let’s assume we want to create a simple regression task using the mtcars data set from the package datasets to predict the column "mpg" (miles per gallon). We only take the first two features here to keep the output in the following examples short.

Before we can create a regression task, we must create a mlr3::DataBackend, an abstraction for data storage system. Here, we will stick to the simplest form of data storage: an in-memory table format using data.table::data.table(). We construct the backend first, and then pass it to the regression task constructor:

Note that the cars data.frame has character row names, which will automatically be used as row_ids. Analogous to the filtering of row ids by integers, we can also filter the row ids by the respective characters, e.g.:

Column roles

Now, we want the original rownames() of mtcars to be a regular column. Thus, we first pre-process the data.frame and then re-create the task.

In mlr3, columns (and rows) can be assigned roles. We have seen three different roles for columns so far:

  1. The target column (here "mpg"), also called dependent variable.
  2. Features, also called independent variables.
  3. The row_id. This column is there for technical reasons, and is typically useless for learning.

The different roles are stored as as a list of column name vectors:

As the output shows, the column is "mpg" is the target column and are three features: "rn" (row names), "cyl", and "disp". More roles are documented in the help for tasks.

In the following, we do not want to learn on neither the primary key (which is taken care of mlr3) nor the new column rn with the row names. However, we still might want to carry rn around for different reasons. E.g., we can use the row names in plots or to associate outliers with the row names. This being said, we need to change the role of the row names column rn and remove it from the set of features.

Row roles

Just like columns, it is also possible to assign different roles to rows. Rows can have two different roles:

  1. Role "use": Rows that are generally available for model fitting (although they may also be used as test set in resampling). This is the default role.
  2. Role "validation": Rows that are held back (see below). Rows which have missing values in the target column upon task creation are automatically moved to the validation set.

There are several reasons to hold some observations back or treat them differently:

  1. It is often good practice to validate the final model on an external validation set to uncover possible overfitting
  2. Some observations may be unlabeled, e.g. in data mining cups or Kaggle competitions. These observations cannot be used for training a model, but you can still predict labels.

Instead of creating a task with only a subset of observations and then manually apply the fitted model on an hold-back data.frame, you can just call the function validate() later on. Marking observations as validation works analogously to changing column roles:

All pre- and post-processing you have used on the training data is also applied to the validation data in exactly the same way.

Task mutation

A task can be mutated using methods filter(), select(), rbind(), cbind() and overwrite().

The iris task is used again to showcase the mutators.

$filter()

Subsetting based on rows is done with $filter(). Afterwards we can check the modified task by either quering the data slot or by checking the number of rows.

$select()

The equivalent method for subsetting columns is select().

You might wonder why there are still two columns left even if we selected only one? The subsetting only applies to the columns listed as “feature” (task$col_roles). The “target” column is not touched.

$rbind(), $cbind()

These methods add rows or columns to the data set, respectively. In the following example we duplicate the rows of the iris task (please do not do this in practice):

The same logic applies to $cbind().