## Prerequisites

library(mlr3)

Learning tasks encapsulate the data set and additional meta information about a machine learning problem, for example the name of the target variable for supervised problems.

To manually create a task from a data.frame() or data.table(), you must first determine the task type to select the respective constructor:

• Classification Task: Target column is labels (stored as character/factor) with only few distinct values.
$$\Rightarrow$$ mlr3::TaskClassif.
• Regression Task: Target column is numeric (stored as integer/double).
$$\Rightarrow$$ mlr3::TaskRegr.
• Cluster Task: You don’t have a target but want to identify similarities in the feature space.
$$\Rightarrow$$ Not yet implemented.
• Survival Task: Target is the (right-censored) time to event.
$$\Rightarrow$$ TaskSurvival in add-on package mlr3surival.
• Ordinal Regression Task: Target is ordinal
$$\Rightarrow$$ TaskOrdinal in add-on package mlr3ordinal.

Let’s assume we want to create a simple regression task using the mtcars data set from the package datasets to predict the column "mpg" (miles per gallon). We only take the first two features here to keep the output in the following examples short.

data("mtcars", package = "datasets")
data = mtcars[, 1:3]
str(data)
#> 'data.frame':    32 obs. of  3 variables:
#>  $mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... #>$ cyl : num  6 6 4 6 8 6 8 4 4 6 ...
#>  $disp: num 160 160 108 258 360 ... Next, we create the task by providing 1. id: identifier for the task, used in plots and summaries 2. backend: here, we simply provide the data.frame() which is internally converted to a mlr3::DataBackendDataTable. For more fine-grain control over how the data is stored internally, we could also construct a mlr3::DataBackend manually. 3. target: Column name of the target column for the regression problem. task_mtcars = TaskRegr$new(id = "cars", backend = data, target = "mpg")
#> Target: mpg
#> Features (2):
#> * dbl (2): cyl, disp
#>
#> Public: backend, cbind(), clone(), col_info, col_roles,
#>   data_formats, data(), droplevels(), feature_names,
#>   feature_types, filter(), formula(), groups, hash, head(), id,
#>   levels(), measures, missings(), ncol, nrow, properties, rbind(),
#>   replace_features(), row_ids, row_roles, select(),
#>   truth(), weights

The print() method gives a short summary of the task: It has 32 observations, 3 columns of which 2 columns are features.

mlr3 ships with some predefined machine learning tasks. These are stored in a mlr3::Dictionary, which is a simple key-value store named mlr3::mlr_tasks. We can obtain a summarizing overview of all stored tasks by converting the dictionary to a data.table()

as.data.table(mlr_tasks)
#>       key task_type     measures  nrow  ncol   lgl   int   dbl   chr   fct
#>    <char>    <char>       <list> <int> <int> <int> <int> <int> <int> <int>
#> 1:     bh      regr     regr.mse   506    19     0     3    13     0     2
#> 2:   iris   classif classif.mmce   150     5     0     0     4     0     0
#> 3: mtcars      regr     regr.mse    32    11     0     0    10     0     0
#> 4:   pima   classif classif.mmce   768     9     0     0     8     0     0
#> 5:  sonar   classif classif.mmce   208    61     0     0    60     0     0
#> 6:   spam   classif classif.mmce  4601    58     0     0    57     0     0
#> 7:    zoo   classif classif.mmce   101    17    15     1     0     0     0
#>      ord
#>    <int>
#> 1:     0
#> 2:     0
#> 3:     0
#> 4:     0
#> 5:     0
#> 6:     0
#> 7:     0

For illustration purposes, we now retrieve the popular iris data set from mlr_tasks as a classification task:

task_iris = mlr_tasks$get("iris") print(task_iris) #> <TaskClassif:iris> (150 x 5) #> Target: Species #> Features (4): #> * dbl (4): Petal.Length, Petal.Width, Sepal.Length, Sepal.Width #> #> Public: backend, cbind(), class_n, class_names, clone(), #> col_info, col_roles, data_formats, data(), droplevels(), #> feature_names, feature_types, filter(), formula(), groups, hash, #> head(), id, levels(), measures, missings(), ncol, negative, #> nrow, positive, properties, rbind(), replace_features(), #> row_ids, row_roles, select(), set_col_role(), set_row_role(), #> target_names, task_type, truth(), weights ## Task API The task properties and characteristics can be queried using the task’s public member values and methods (see ?mlr3::Task). Most of them should be self-explanatory, e.g., task_iris = mlr_tasks$get("iris")

# public member values
task_iris$nrow #> [1] 150 task_iris$ncol
#> [1] 5

# public member methods
task_iris$head(n = 3) #> Species Petal.Length Petal.Width Sepal.Length Sepal.Width #> <fctr> <num> <num> <num> <num> #> 1: setosa 1.4 0.2 5.1 3.5 #> 2: setosa 1.4 0.2 4.9 3.0 #> 3: setosa 1.3 0.2 4.7 3.2 ### Retrieve Data In mlr3, each row (observation) has a unique identifier which can be either integer() or character(). These can be used to select specific rows. # iris uses integer row_ids head(task_iris$row_ids)
#> [1] 1 2 3 4 5 6

# retrieve data for rows with ids 1, 51, and 101
task_iris$data(rows = c(1, 51, 101)) #> Species Petal.Length Petal.Width Sepal.Length Sepal.Width #> <fctr> <num> <num> <num> <num> #> 1: setosa 1.4 0.2 5.1 3.5 #> 2: versicolor 4.7 1.4 7.0 3.2 #> 3: virginica 6.0 2.5 6.3 3.3 # mtcars uses the rownames of the original data set head(task_mtcars$row_ids)
#> [1] "AMC Javelin"        "Cadillac Fleetwood" "Camaro Z28"
#> [4] "Chrysler Imperial"  "Datsun 710"         "Dodge Challenger"

# retrieve data for rows with id "Datsun 710"
task_mtcars$data(rows = "Datsun 710") #> mpg cyl disp #> <num> <num> <num> #> 1: 22.8 4 108 Note that the method $data() is only an accessor and does not modify the underlying data/task.

Analogously, each column has an identifier, which is often just called column name. These are stored in the public fields feature_names and target_names:

task_iris$feature_names #> [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" task_iris$target_names
#> [1] "Species"

# retrieve data for rows 1, 51, and 101 and only select column "Species"
task_iris$data(rows = c(1, 51, 101), cols = "Species") #> Species #> <fctr> #> 1: setosa #> 2: versicolor #> 3: virginica To retrieve the complete data set, e.g. for a closer inspection, convert to a data.table(): summary(as.data.table(task_iris)) #> Species Petal.Length Petal.Width Sepal.Length #> setosa :50 Min. :1.000 Min. :0.100 Min. :4.300 #> versicolor:50 1st Qu.:1.600 1st Qu.:0.300 1st Qu.:5.100 #> virginica :50 Median :4.350 Median :1.300 Median :5.800 #> Mean :3.758 Mean :1.199 Mean :5.843 #> 3rd Qu.:5.100 3rd Qu.:1.800 3rd Qu.:6.400 #> Max. :6.900 Max. :2.500 Max. :7.900 #> Sepal.Width #> Min. :2.000 #> 1st Qu.:2.800 #> Median :3.000 #> Mean :3.057 #> 3rd Qu.:3.300 #> Max. :4.400 ### Roles It is possible to assign special roles to (subsets of) rows and columns. For example, the previously constructed mtcars task has the following column roles: task_mtcars$col_roles
#> $feature #> [1] "cyl" "disp" #> #>$target
#> [1] "mpg"
#>
#> $label #> character(0) #> #>$order
#> character(0)
#>
#> $groups #> character(0) #> #>$weights
#> character(0)

Now, we want the original rownames() of mtcars to be a regular feature column. Thus, we first pre-process the data.frame and then re-create the task.

library("data.table")
# with keep.rownames, data.table stores the row names in an extra column "rn"
data = as.data.table(mtcars[, 1:3], keep.rownames = TRUE)
task = TaskRegr$new(id = "cars", backend = data, target = "mpg") # we now have integer row_ids task$row_ids
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32

# there is a new "feature" called "rn"
task$feature_names #> [1] "cyl" "disp" "rn" The column “rn” is now a regular feature. As this is a unique string column, most machine learning algorithms will have problems to process this feature without some kind of preprocessing. However, we still might want to carry rn around for different reasons. E.g., we can use the row names in plots or to associate outliers with the row names. This being said, we need to change the role of the row names column rn and remove it from the set of active features. task$feature_names
#> [1] "cyl"  "disp" "rn"
task$set_col_role("rn", new_roles = "label") # "rn" not listed as feature any more task$feature_names
#> [1] "cyl"  "disp"

# also vanished from "data" and "head"
task$data(rows = 1:2) #> mpg cyl disp #> <num> <num> <num> #> 1: 21 6 160 #> 2: 21 6 160 task$head(2)
#>      mpg   cyl  disp
#>    <num> <num> <num>
#> 1:    21     6   160
#> 2:    21     6   160

Note that no copies of the underlying data is inflicted by this operation. By changing roles, only the view on the data is changed, not the data itself.

Just like columns, it is also possible to assign different roles to rows. Rows can have two different roles:

1. Role "use": Rows that are generally available for model fitting (although they may also be used as test set in resampling). This is the default role.
2. Role "validation": Rows that are held back (see below). Rows which have missing values in the target column upon task creation are automatically moved to the validation set.

There are several reasons to hold some observations back or treat them differently:

1. It is often good practice to validate the final model on an external validation set to uncover possible overfitting
2. Some observations may be unlabeled, e.g. in data mining cups or Kaggle competitions. These observations cannot be used for training a model, but you can still predict labels.

The methods set_col_role() and set_row_role() change the view on the data and can be used to subset the task. For convenience, the method filter() subsets the task based on row ids, and select() subsets the task based on feature names. All these operations only change the view on the data, without creating a copy of it, but modify the task in-place.

task = mlr_tasks$get("iris") task$select(c("Sepal.Width", "Sepal.Length")) # keep only these features
task$filter(1:3) # keep only these rows task$head()
#>    Species Sepal.Length Sepal.Width
#>     <fctr>        <num>       <num>
#> 1:  setosa          5.1         3.5
#> 2:  setosa          4.9         3.0
#> 3:  setosa          4.7         3.2

Additionally, the methods rbind() and cbind() allow to add extra rows and columns to a task, respectively. The method replace_features() is a convenience wrapper around select() and cbind(). Again, the original data set stored in the original mlr3::DataBackend is not altered in any way.

task$cbind(data.table(foo = letters[1:3])) # add column foo task$head()
#>    Species Sepal.Length Sepal.Width    foo
#>     <fctr>        <num>       <num> <char>
#> 1:  setosa          5.1         3.5      a
#> 2:  setosa          4.9         3.0      b
#> 3:  setosa          4.7         3.2      c