This is the abstract base class for task objects like TaskClassif and TaskRegr.

Tasks serve two purposes:

  1. Tasks wrap a DataBackend, an object to transparently interface different data storage types.

  2. Tasks store meta-information, such as the role of the individual columns in the DataBackend. For example, for a classification task a single column must be marked as target column, and others as features.

Predefined (toy) tasks are stored in the Dictionary mlr_tasks, e.g. iris or boston_housing.

Format

R6::R6Class object.

Construction

Note: This object is typically constructed via a derived classes, e.g. TaskClassif or TaskRegr.

t = Task$new(id, task_type, backend)

Fields

  • backend :: DataBackend.

  • col_info :: data.table::data.table()
    Table with with 3 columns:

    • "id" stores the name of the column.

    • "type" holds the storage type of the variable, e.g. integer, numeric or character.

    • "levels" stores a vector of distinct values (levels) for factor and character variables.

  • col_roles :: named list()
    Each column (feature) can have an arbitrary number of roles in the learning task:

    • "feature": Regular feature used in the model fitting process.

    • "target": Target variable.

    • "label": Observation labels. May be used in plots.

    • "order": Data returned by data() is ordered by this column (or these columns).

    • "groups": During resampling, observations with the same value of the variable with role "groups" are marked as "belonging together". They will be exclusively assigned to be either in the training set or the test set for each resampling iteration. Only a single column may be marked as grouping column.

    • "weights": Observation weights. Only a single column may be marked as weights.

    col_roles keeps track of the roles with a named list of vectors of feature names. To alter the roles, use t$set_col_role().

  • row_roles :: named list()
    Each row (observation) can have an arbitrary number of roles in the learning task:

    • "use": Use in train / predict / resampling.

    • "validation": Hold the observations back unless explicitly requested. Validation sets are not yet completely integrated into the package.

    row_roles keeps track of the roles with a named list of vectors of feature names. To alter the role, use set_row_role().

  • feature_names :: character()
    Return all column names with role == "feature".

  • feature_types :: data.table::data.table()
    Returns a table with columns id and type where id are the column names of "active" features of the task and type is the storage type.

  • hash :: character(1)
    Hash (unique identifier) for this object.

  • id :: character(1)
    Identifier of the Task.

  • measures :: list() of Measure
    Stores the default measures to use for this task.

  • ncol :: integer(1)
    Returns the total number of cols with role "target" or "feature".

  • nrow :: integer(1)
    Return the total number of rows with role "use".

  • row_ids :: (integer() | character())
    Returns the row ids of the DataBackend for observations with with role "use".

  • target_names :: character()
    Returns all column names with role "target".

  • task_type :: character(1)
    Stores the type of the Task.

  • properties :: character()
    Set of task properties. Possible properties are are stored in mlr_reflections$task_properties.

  • groups :: data.table::data.table()
    If the task has a designated column role "groups", table with two columns: row_id (integer() | character()) and the grouping variable group (vector()). Returns NULL if there are is no grouping column.

  • weights :: data.table::data.table()
    If the task has a designated column role "weights", table with two columns: row_id (integer() | character()) and the observation weights weight (numeric()). Returns NULL if there are is no weight column.

Methods

  • data(rows = NULL, cols = NULL, data_format = NULL)
    (integer() | character(), character(1), character()) -> any
    Returns a slice of the data from the DataBackend in the data format specified by data_format (depending on the DataBackend, but usually a data.table::data.table()).

    Rows are additionally subsetted to only contain observations with role "use", and columns are filtered to only contain features with roles "target" and "feature". If invalid rows or cols are specified, an exception is raised.

  • formula(rhs = NULL)
    character() -> formula
    Constructs a stats::formula, e.g. [target] ~ [feature_1] + [feature_2] + ... + [feature_k], using the features provided in argument rhs (defaults to all columns with role "feature").

  • levels(cols = NULL)
    character() -> named list()
    Returns the distinct values for columns referenced in cols with storage type "character", "factor" or "ordered". Argument cols defaults to all such columns with role "target" or "feature".

    Note that this function ignores the row roles, it returns all levels available in the DataBackend. To update the stored level information, e.g. after filtering a task, call $droplevels().

  • missings(cols = NULL)
    character() -> named integer()
    Returns the number of missing observations for columns referenced in cols. Argument cols defaults to all columns with role "target" or "feature".

  • head(n = 6)
    integer() -> data.table::data.table()
    Get the first n observations with role "use".

  • set_col_role(cols, new_roles, exclusive = TRUE)
    (character(), character(), logical(1)) -> self
    Adds the roles new_roles to columns referred to by cols. If exclusive is TRUE, the referenced columns will be removed from all other roles.

  • set_row_role(rows, new_roles, exclusive = TRUE)
    (character(), character(), logical(1)) -> self
    Adds the roles new_roles to rows referred to by rows. If exclusive is TRUE, the referenced rows will be removed from all other roles.

  • filter(rows)
    (integer() | character()) -> self
    Subsets the task, reducing it to only keep the rows specified. See the section on task mutators for more information.

  • select(cols)
    character() -> self
    Subsets the task, reducing it to only keep the columns specified. See the section on task mutators for more information.

  • cbind(data)
    data.frame() -> self
    Extends the DataBackend with additional columns. The row ids must be provided as column in data (with column name matching the primary key name of the DataBackend). If this column is missing, it is assumed that the rows are exactly in the order of t$row_ids. See the section on task mutators for more information.

  • rbind(data)
    data.frame() -> self
    Extends the DataBackend with additional rows. The new row ids must be provided as column in data. If this column is missing, new row ids are constructed automatically. See the section on task mutators for more information.

  • replace_features(data)
    data.frame() -> self
    Replaces some features of the task with features in data. This operation is similar to calling select() and cbind(). See the section on task mutators for more information.

  • droplevels(cols = NULL)
    character -> self
    Updates the cache of stored factor levels, removing all levels not present in the current set of active rows. cols defaults to all columns with storage type "character", "factor", or "ordered".

S3 methods

Task mutators

The following methods change the task in-place:

  • set_row_role() and set_col_role() alter the row or column information in row_roles or col_roles, respectively. This provides a different "view" on the data without altering the data itself.

  • filter() and select() subset the set of active rows or columns in row_roles or col_roles, respectively. This provides a different "view" on the data without altering the data itself.

  • rbind() and cbind() change the task in-place by binding rows or columns to the data, but without modifying the original DataBackend. Instead, the methods first create a new DataBackendDataTable from the provided new data, and then merge both backends into an abstract DataBackend which combines the results on-demand.

  • replace_features() is a convenience wrapper around select() and cbind(). Again, the original DataBackend remains unchanged.

See also

Examples

task = Task$new("iris", task_type = "classif", backend = iris) task$nrow
#> [1] 150
task$ncol
#> [1] 5
task$head()
#> Petal.Length Petal.Width Sepal.Length Sepal.Width Species #> 1: 1.4 0.2 5.1 3.5 setosa #> 2: 1.4 0.2 4.9 3.0 setosa #> 3: 1.3 0.2 4.7 3.2 setosa #> 4: 1.5 0.2 4.6 3.1 setosa #> 5: 1.4 0.2 5.0 3.6 setosa #> 6: 1.7 0.4 5.4 3.9 setosa
task$feature_names
#> [1] "Petal.Length" "Petal.Width" "Sepal.Length" "Sepal.Width" "Species"
task$formula()
#> ~. #> NULL
# Remove "Petal.Length" task$set_col_role("Petal.Length", character(0L)) # Remove "Petal.Width", alternative way task$select(setdiff(task$feature_names, "Petal.Width")) task$feature_names
#> [1] "Sepal.Length" "Sepal.Width" "Species"
# Add new column "foo" task$cbind(data.frame(foo = 1:150))