Skip to contents

This is the abstract base class for TaskSupervised and TaskUnsupervised. TaskClassif and TaskRegr inherit from TaskSupervised. More supervised tasks are implemented in mlr3proba, unsupervised cluster tasks in package mlr3cluster.

Tasks serve two purposes:

  1. Tasks wrap a DataBackend, an object to transparently interface different data storage types.

  2. Tasks store meta-information, such as the role of the individual columns in the DataBackend. For example, for a classification task a single column must be marked as target column, and others as features.

Predefined (toy) tasks are stored in the dictionary mlr_tasks, e.g. penguins or california_housing. More toy tasks can be found in the dictionary after loading mlr3data.

S3 methods

Task mutators

The following methods change the task in-place:

  • Any modification of the lists $col_roles or $row_roles. This provides a different "view" on the data without altering the data itself. This may affects, e.g., $data, $nrow, $ncol, n_features, row_ids, and $feature_names. Altering $col_roles may affect, e.g., $data, $ncol, $n_features, and $feature_names. Altering $row_roles may affect, e.g., $data, $nrow, and $row_ids.

  • Modification of column or row roles via $set_col_roles() or $set_row_roles(), respectively. They are an alternative to directly accessing $col_roles or $row_roles, with the same side effects.

  • $select() and $filter() subset the set of active features or rows in $col_roles or $row_roles, respectively.

  • $cbind() and $rbind() change the task in-place by binding new columns or rows to the data.

  • $rename() changes column names.

  • $set_levels() and $droplevels() update the field $col_info() to automatically repair factor levels while querying data with $data().

See also

Other Task: TaskClassif, TaskRegr, TaskSupervised, TaskUnsupervised, california_housing, mlr_tasks, mlr_tasks_breast_cancer, mlr_tasks_german_credit, mlr_tasks_iris, mlr_tasks_mtcars, mlr_tasks_penguins, mlr_tasks_pima, mlr_tasks_sonar, mlr_tasks_spam, mlr_tasks_wine, mlr_tasks_zoo

Public fields

label

(character(1))
Label for this object. Can be used in tables, plot and text output instead of the ID.

task_type

(character(1))
Task type, e.g. "classif" or "regr".

For a complete list of possible task types (depending on the loaded packages), see mlr_reflections$task_types$type.

backend

(DataBackend)
Abstract interface to the data of the task.

col_info

(data.table::data.table())
Table with with 4 columns, mainly for internal purposes:

  • "id" (character()) stores the name of the column.

  • "type" (character()) holds the storage type of the variable, e.g. integer, numeric or character. See mlr_reflections$task_feature_types for a complete list of allowed types.

  • "levels" (list()) stores a vector of distinct values (levels) for ordered and unordered factor variables.

  • "label" (character()) stores a vector of prettier, formated column names.

  • "fix_factor_levels" (logical()) stores flags which determine if the levels of the respective variable need to be reordered after querying the data from the DataBackend.

Note that all columns of the DataBackend, also columns which are not selected or have any role, are listed in this table.

man

(character(1))
String in the format [pkg]::[topic] pointing to a manual page for this object. Defaults to NA, but can be set by child classes.

extra_args

(named list())
Additional arguments set during construction. Required for convert_task().

mlr3_version

(package_version)
Package version of mlr3 used to create the task.

characteristics

(list())
List of characteristics of the task, e.g. list(n = 5, p = 7).

Active bindings

id

(character(1))
Identifier of the object. Used in tables, plot and text output.

internal_valid_task

(Task or integer() or NULL)
Optional validation task that can, e.g., be used for early stopping with learners such as XGBoost. See also the $validate field of Learner. If integers are assigned they are removed from the primary task and an internal validation task with those ids is created from the primary task using only those ids. When assigning a new task, it is always cloned.

hash

(character(1))
Hash (unique identifier) for this object. The hash is calculated based on the complete task object and $row_ids. If an internal validation task is set, the hash is recalculated.

row_ids

(positive integer())
Returns the row ids of the DataBackend for observations with role "use".

row_names

(data.table::data.table())
Returns a table with two columns:

feature_names

(character())
Returns all column names with role == "feature".

Note that this vector determines the default order of columns for task$data(cols = NULL, ...). However, it is recommended to not rely on the order of columns, but instead always address columns by their name. The default order is not well defined after some operations, e.g. after task$cbind() or after processing via mlr3pipelines.

target_names

(character())
Returns all column names with role "target".

properties

(character())
Set of task properties. Possible properties are are stored in mlr_reflections$task_properties. The following properties are currently standardized and understood by tasks in mlr3:

  • "strata": The task is resampled using one or more stratification variables (role "stratum").

  • "groups": The task comes with grouping/blocking information (role "group").

  • "weights": The task comes with observation weights (role "weight").

Note that above listed properties are calculated from the $col_roles and may not be set explicitly.

row_roles

(named list())
Each row (observation) can have an arbitrary number of roles in the learning task:

  • "use": Use in train / predict / resampling.

row_roles is a named list whose elements are named by row role and each element is an integer() vector of row ids. To alter the roles, just modify the list, e.g. with R's set functions (intersect(), setdiff(), union(), ...).

col_roles

(named list())
Each column can be in one or more of the following groups to fulfill different roles:

  • "feature": Regular feature used in the model fitting process.

  • "target": Target variable. Most tasks only accept a single target column.

  • "name": Row names / observation labels. To be used in plots. Can be queried with $row_names. Not more than a single column can be associated with this role.

  • "order": Data returned by $data() is ordered by this column (or these columns). Columns must be sortable with order().

  • "group": During resampling, observations with the same value of the variable with role "group" are marked as "belonging together". For each resampling iteration, observations of the same group will be exclusively assigned to be either in the training set or in the test set. Not more than a single column can be associated with this role.

  • "stratum": Stratification variables. Multiple discrete columns may have this role.

  • "weight": Observation weights. Not more than one numeric column may have this role.

col_roles is a named list whose elements are named by column role and each element is a character() vector of column names. To alter the roles, just modify the list, e.g. with R's set functions (intersect(), setdiff(), union(), ...). The method $set_col_roles provides a convenient alternative to assign columns to roles.

nrow

(integer(1))
Returns the total number of rows with role "use".

ncol

(integer(1))
Returns the total number of columns with role "target" or "feature".

n_features

(integer(1))
Returns the total number of columns with role "feature" (i.e. the number of "active" features in the task).

feature_types

(data.table::data.table())
Returns a table with columns id and type where id are the column names of "active" features of the task and type is the storage type.

data_formats

(character())
Supported data format. Always "data.table".. This is deprecated and will be removed in the future.

strata

(data.table::data.table())
If the task has columns designated with role "stratum", returns a table with one subpopulation per row and two columns:

  • N (integer()) with the number of observations in the subpopulation, and

  • row_id (list of integer()) as list column with the row ids in the respective subpopulation. Returns NULL if there are is no stratification variable. See Resampling for more information on stratification.

groups

(data.table::data.table())
If the task has a column with designated role "group", a table with two columns:

Returns NULL if there are is no grouping column. See Resampling for more information on grouping.

order

(data.table::data.table())
If the task has at least one column with designated role "order", a table with two columns:

Returns NULL if there are is no order column.

weights

(data.table::data.table())
If the task has a column with designated role "weight", a table with two columns:

Returns NULL if there are is no weight column.

labels

(named character())
Retrieve labels (prettier formated names) from columns. Internally queries the column label of the table in field col_info. Columns ids referenced by the name of the vector, the labels are the actual string values.

Assigning to this column update the task by reference. You have to provide a character vector of labels, named with column ids. To remove a label, set it to NA. Alternatively, you can provide a data.frame() with the two columns "id" and "label".

col_hashes

(named character)
Hash (unique identifier) for all columns except the primary_key: A character vector, named by the columns that each element refers to.
Columns of different Tasks or DataBackends that have agreeing col_hashes always represent the same data, given that the same rows are selected. The reverse is not necessarily true: There can be columns with the same content that have different col_hashes.

Methods


Method new()

Creates a new instance of this R6 class.

Note that this object is typically constructed via a derived classes, e.g. TaskClassif or TaskRegr.

Usage

Task$new(id, task_type, backend, label = NA_character_, extra_args = list())

Arguments

id

(character(1))
Identifier for the new instance.

task_type

(character(1))
Type of task, e.g. "regr" or "classif". Must be an element of mlr_reflections$task_types$type.

backend

(DataBackend)
Either a DataBackend, or any object which is convertible to a DataBackend with as_data_backend(). E.g., a data.frame() will be converted to a DataBackendDataTable.

label

(character(1))
Label for the new instance.

extra_args

(named list())
Named list of constructor arguments, required for converting task types via convert_task().


Method divide()

Deprecated.

Usage

Task$divide(ratio = NULL, ids = NULL, remove = TRUE)

Arguments

ratio

(numeric(1))
The proportion of datapoints to use as validation data.

ids

(integer())
The row ids to use as validation data.

remove

(logical(1))
If TRUE (default), the row_ids are removed from the primary task's active "use" rows, ensuring a disjoint split between the train and validation data.

Returns

Modified Self.


Method help()

Opens the corresponding help page referenced by field $man.

Usage

Task$help()


Method format()

Helper for print outputs.

Usage

Task$format(...)

Arguments

...

(ignored).


Method print()

Printer.

Usage

Task$print(...)

Arguments

...

(ignored).


Method data()

Returns a slice of the data from the DataBackend as a data.table. Rows default to observations with role "use", and columns default to features with roles "target" or "feature". If rows or cols are specified which do not exist in the DataBackend, an exception is raised.

Rows and columns are returned in the order specified via the arguments rows and cols. If rows is NULL, rows are returned in the order of task$row_ids. If cols is NULL, the column order defaults to c(task$target_names, task$feature_names). Note that it is recommended to not rely on the order of columns, and instead always address columns with their respective column name.

Usage

Task$data(rows = NULL, cols = NULL, data_format, ordered = FALSE)

Arguments

rows

(positive integer())
Vector or row indices. Always refers to the complete data set, even after filtering.

cols

(character())
Vector of column names.

data_format

(character(1))
Deprecated. Ignored, and will be removed in the future.

ordered

(logical(1))
If TRUE, data is ordered according to the columns with column role "order".

Returns

Depending on the DataBackend, but usually a data.table::data.table().


Method formula()

Constructs a formula(), e.g. [target] ~ [feature_1] + [feature_2] + ... + [feature_k], using the features provided in argument rhs (defaults to all columns with role "feature", symbolized by ".").

Note that it is currently not possible to change the formula. However, mlr3pipelines provides a pipe operator interfacing stats::model.matrix() for this purpose: "modelmatrix".

Usage

Task$formula(rhs = ".")

Arguments

rhs

(character(1))
Right hand side of the formula. Defaults to "." (all features of the task).

Returns

formula().


Method head()

Get the first n observations with role "use" of all columns with role "target" or "feature".

Usage

Task$head(n = 6L)

Arguments

n

(integer(1)).

Returns

data.table::data.table() with n rows.


Method levels()

Returns the distinct values for columns referenced in cols with storage type "factor" or "ordered". Argument cols defaults to all such columns with role "target" or "feature".

Note that this function ignores the row roles, it returns all levels available in the DataBackend. To update the stored level information, e.g. after subsetting a task with $filter(), call $droplevels().

Usage

Task$levels(cols = NULL)

Arguments

cols

(character())
Vector of column names.

Returns

named list().


Method missings()

Returns the number of missing observations for columns referenced in cols. Considers only active rows with row role "use". Argument cols defaults to all columns with role "target" or "feature".

Usage

Task$missings(cols = NULL)

Arguments

cols

(character())
Vector of column names.

Returns

Named integer().


Method filter()

Subsets the task, keeping only the rows specified via row ids rows.

This operation mutates the task in-place. See the section on task mutators for more information.

Usage

Task$filter(rows)

Arguments

rows

(positive integer())
Vector or row indices. Always refers to the complete data set, even after filtering.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method select()

Subsets the task, keeping only the features specified via column names cols. Note that you cannot deselect the target column, for obvious reasons.

This operation mutates the task in-place. See the section on task mutators for more information.

Usage

Task$select(cols)

Arguments

cols

(character())
Vector of column names.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method rbind()

Adds additional rows to the DataBackend stored in $backend. New row ids are automatically created, unless data has a column whose name matches the primary key of the DataBackend (task$backend$primary_key). In case of name clashes of row ids, rows in data have higher precedence and virtually overwrite the rows in the DataBackend.

All columns with the roles "target", "feature", "weight", "group", "stratum", and "order" must be present in data. Columns only present in data but not in the DataBackend of task will be discarded.

This operation mutates the task in-place. See the section on task mutators for more information.

Usage

Task$rbind(data)

Arguments

data

(data.frame()).

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method cbind()

Adds additional columns to the DataBackend stored in $backend.

The row ids must be provided as column in data (with column name matching the primary key name of the DataBackend). If this column is missing, it is assumed that the rows are exactly in the order of $row_ids. In case of name clashes of column names in data and DataBackend, columns in data have higher precedence and virtually overwrite the columns in the DataBackend.

This operation mutates the task in-place. See the section on task mutators for more information.

Usage

Task$cbind(data)

Arguments

data

(data.frame()).


Method rename()

Renames columns by mapping column names in old to new column names in new (element-wise).

This operation mutates the task in-place. See the section on task mutators for more information.

Usage

Task$rename(old, new)

Arguments

old

(character())
Old names.

new

(character())
New names.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method set_row_roles()

Modifies the roles in $row_roles in-place.

Usage

Task$set_row_roles(rows, roles = NULL, add_to = NULL, remove_from = NULL)

Arguments

rows

(integer())
Row ids for which to change the roles for.

roles

(character())
Exclusively set rows to the specified roles (remove from other roles).

add_to

(character())
Add rows with row ids rows to roles specified in add_to. Rows keep their previous roles.

remove_from

(character())
Remove rows with row ids rows from roles specified in remove_from. Other row roles are preserved.

Details

Roles are first set exclusively (argument roles), then added (argument add_to) and finally removed (argument remove_from) from different roles. Duplicated row ids are explicitly allowed, so you can add replicate an observation by repeating its row_id.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method set_col_roles()

Modifies the roles in $col_roles in-place. See $col_roles for a list of possible roles.

Usage

Task$set_col_roles(cols, roles = NULL, add_to = NULL, remove_from = NULL)

Arguments

cols

(character())
Column names for which to change the roles for.

roles

(character())
Exclusively set columns to the specified roles (remove from other roles).

add_to

(character())
Add columns with column names cols to roles specified in add_to. Columns keep their previous roles.

remove_from

(character())
Remove columns with columns names cols from roles specified in remove_from. Other column roles are preserved.

Details

Roles are first set exclusively (argument roles), then added (argument add_to) and finally removed (argument remove_from) from different roles. Duplicated columns are removed from the same role. For tasks that only allow one target, the target column cannot be set with $set_col_roles(). Use the $col_roles field to swap the target column.

Returns

Returns the object itself, but modified by reference. You need to explicitly $clone() the object beforehand if you want to keeps the object in its previous state.


Method set_levels()

Set levels for columns of type factor and ordered in field col_info. You can add, remove or reorder the levels, affecting the data returned by $data() and $levels(). If you just want to remove unused levels, use $droplevels() instead.

Note that factor levels which are present in the data but not listed in the task as valid levels are converted to missing values.

Usage

Task$set_levels(levels)

Arguments

levels

(named list() of character())
List of character vectors of new levels, named by column names.

Returns

Modified self.


Method droplevels()

Updates the cache of stored factor levels, removing all levels not present in the current set of active rows. cols defaults to all columns with storage type "factor" or "ordered".

Usage

Task$droplevels(cols = NULL)

Arguments

cols

(character())
Vector of column names.

Returns

Modified self.


Method add_strata()

Cuts numeric variables into new factors columns which are added to the task with role "stratum". This ensures that all training and test splits contain observations from all bins. The columns are named "..stratum_[col_name]".

Usage

Task$add_strata(cols, bins = 3L)

Arguments

cols

(character())
Names of columns to operate on.

bins

(integer())
Number of bins to cut into (passed to cut() as breaks). Replicated to have the same length as cols.

Returns

self (invisibly).


Method clone()

The objects of this class are cloneable with this method.

Usage

Task$clone(deep = FALSE)

Arguments

deep

Whether to make a deep clone.

Examples

# We use the inherited class TaskClassif here,
# because the base class `Task` is not intended for direct use
task = TaskClassif$new("penguings", palmerpenguins::penguins, target = "species")

task$nrow
#> [1] 344
task$ncol
#> [1] 8
task$feature_names
#> [1] "bill_depth_mm"     "bill_length_mm"    "body_mass_g"      
#> [4] "flipper_length_mm" "island"            "sex"              
#> [7] "year"             
task$formula()
#> species ~ .
#> NULL

# de-select "year"
task$select(setdiff(task$feature_names, "year"))

task$feature_names
#> [1] "bill_depth_mm"     "bill_length_mm"    "body_mass_g"      
#> [4] "flipper_length_mm" "island"            "sex"              

# Add new column "foo"
task$cbind(data.frame(foo = 1:344))
head(task)
#>    species bill_depth_mm bill_length_mm body_mass_g flipper_length_mm    island
#>     <fctr>         <num>          <num>       <int>             <int>    <fctr>
#> 1:  Adelie          18.7           39.1        3750               181 Torgersen
#> 2:  Adelie          17.4           39.5        3800               186 Torgersen
#> 3:  Adelie          18.0           40.3        3250               195 Torgersen
#> 4:  Adelie            NA             NA          NA                NA Torgersen
#> 5:  Adelie          19.3           36.7        3450               193 Torgersen
#> 6:  Adelie          20.6           39.3        3650               190 Torgersen
#>       sex   foo
#>    <fctr> <int>
#> 1:   male     1
#> 2: female     2
#> 3: female     3
#> 4:   <NA>     4
#> 5: female     5
#> 6:   male     6