From 367148a4bf669a61ceb18b2cbf98884516e013b9 Mon Sep 17 00:00:00 2001 From: Nicholas Reich Date: Mon, 24 Apr 2023 10:45:06 -0400 Subject: [PATCH] adding changes to document modeling tasks more clearly on its own page --- docs/source/format/intro-data-formats.md | 43 ----------------- docs/source/format/model-output.md | 2 +- docs/source/format/tasks.md | 59 ++++++++++++++++++++++++ docs/source/index.md | 1 + 4 files changed, 61 insertions(+), 44 deletions(-) create mode 100644 docs/source/format/tasks.md diff --git a/docs/source/format/intro-data-formats.md b/docs/source/format/intro-data-formats.md index 5470482d..6f933dae 100644 --- a/docs/source/format/intro-data-formats.md +++ b/docs/source/format/intro-data-formats.md @@ -73,49 +73,6 @@ Projections are requested for each combination of the following variables. * `horizon`: 1 ``` -(task_id_vars)= -## Task ID variables - -### Overview of task ID variables -Hubs typically specify that modeling outputs (e.g., forecasts or projections) should be generated for each combination of values across a set of task ID variables. For modeling exercises where the model outputs correspond to estimates or predictions of a quantity that could in principle be calculated from observable data, these task ID variables should be sufficient to uniquely identify an observed value for the modeling target that could be compared to model outputs to evaluate model accuracy. This is discussed more in the section on [target (a.k.a. truth) data](target-data). - -Because they are central to Hubs, these task ID variables serve several purposes: -* They are used in the Hub metadata to define modeling tasks of the hub -* They are used in model outputs to identify the modeling task to which forecasts correspond -* They are used in the specification of [target data](target-data) and methods to calculate "ground truth" target data values, to allow for alignment of model outputs with true target values -The relationships between these items are illustrated at a high level in the following diagram; sections to follow provide more detail. - -```{figure} img/hub-data-relations.jpeg ---- -figclass: margin-caption -alt: A figure showing where data from hubs is created. -name: hub-data-relations ---- -The figure shows that Hub metadata and target data are specified by the hub itself, along with any necessary functions to calculate scores or "observed values" from target data. Teams provide model output data that must conform with standards identified in the Hub metadata. -``` - -### Usage of task ID variables - -Task ID variables can be thought of as columns of a tabular representation in a model output file, where a combination of values of task ID variables would uniquely define a row of data. - -In our running Example 1 above, the task ID variables are `target`, `location`, `origin_date`, and `horizon`. We note that some task ID variables are special in that they conceptually define a modeling "target" (these are referred to in the [tasks metadata](tasks-metadata) as a `target_key`). In this example, `target` is the target key. In other examples, (such as Running Example 3) more than one variable can serve as target keys together. - -In general, there are no restrictions on what task ID variables may be named, however when appropriate, we suggest that Hubs adopt the following standard column names and definitions: - -* `origin_date`: the starting point that can be used for calculating a target_date via the formula target_date = origin_date + horizon * time_units_per_horizon (e.g., with weekly data, target_date is calculated as origin_date + horizon * 7 days). -* `scenario_id`: a unique identifier for a scenario -* `location`: a unique identifier for a location -* `target`: a unique identifier for the target. It is recommended, although not required, that hubs set up a single variable to define the target (i.e., as a target key), with additional detail specified in the `target_metadata` section of the [tasks metadata](tasks-metadata). -* `target_date`: for short-term forecasts, the target_date specifies the date of occurrence of the outcome of interest. For instance, if models are requested to forecast the number of hospitalizations that will occur on 2022-07-15, the target_date is 2022-07-15. -* `horizon`: The difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months) -* `age_group`: a unique identifier for an age group - -```{note} -We encourage Hubs to avoid redundancy in the model task columns. For example, Hubs should not include all three of `target_date`, `origin_date`, and `horizon` as task ID columns because if any two are specified, the third can be calculated directly. Similarly, if a variable is constant, it should not be included. For example, if a Hub does not include multiple targets, the `target` column could be omitted from the task ID columns. -``` - -As Hubs define new modeling tasks, they may need to introduce new task ID variables that have not been used before. In those cases, the new variables should be added to this list to ensure that the concepts are documented in a central place and can be reused in future efforts. - (submission-rounds)= ## Submission rounds Many Hubs will accept model output submissions over multiple rounds. In the case of the forecast hubs there has typically been one submission round per week, while the scenario hubs have had submission rounds less frequently, typically about once per month. As part of the [Hub metadata](hub-metadata), Hubs should specify a set of `round_id` values that uniquely identify the submission round. For instance, for weekly submissions the round id might be the date that submissions are due to the Hub or a specification of an epidemic week. In instances where the rounds do not follow a predetermined schedule, more generic identifiers such as “round1” may be preferred. The round id will be used as the file names of model output submissions and round-specific model abstract submissions, as well as in the Hub metadata to specify model tasks that may vary across rounds. diff --git a/docs/source/format/model-output.md b/docs/source/format/model-output.md index bdd10b3e..ea3366c2 100644 --- a/docs/source/format/model-output.md +++ b/docs/source/format/model-output.md @@ -31,7 +31,7 @@ Note on `pmf` model output type: Values are required to sum to 1 across all `typ ```{margin} Note on `sample` model output type: Depending on the Hub specification, samples with the same sample index (specified by the `type_id`) may be assumed to correspond to a joint distribution across multiple levels of the task id variables. This is discussed more below. ``` - +(output_type_table)= | `type` | `type_id` | `value` | | ------ | ------ | ------ | | `mean` | NA (not used for mean predictions) | Numeric: the mean of the predictive distribution | diff --git a/docs/source/format/tasks.md b/docs/source/format/tasks.md new file mode 100644 index 00000000..bdd1fdf3 --- /dev/null +++ b/docs/source/format/tasks.md @@ -0,0 +1,59 @@ +(tasks)= +# Defining modeling tasks + +Every Hub is organized around "modeling tasks" that are defined to meet the needs of a project. Modeling tasks are defined for a hub in the [tasks.json configuration file](tasks_metadata) for a hub. Modeling tasks are defined for either a single round, or for multiple rounds that are distinguished by different values of a specific `task_id` variable. The three components of modeling tasks are [task ID variables](task_id_vars), [output types](output_types), and [target metadata](target_metadata). Broadly speaking these three components function as follows: + + - The [task_ids](task_id_vars) object defines both labels for columns in submission files and the set of valid values for each column. Any unique combination of the values define a single modeling task, or target. + - The [output_type](output_types) object defines accepted representations for each task. More on the different output types can be found in [this table](output_type_table). + - The [target_metadata](target_metadata) array provides additional information about each target. + +(task_id_vars)= +## task ID variables +Hubs typically specify that modeling outputs (e.g., forecasts or projections) should be generated for each combination of values across a set of task ID variables. For modeling exercises where the model outputs correspond to estimates or predictions of a quantity that could in principle be calculated from observable data, these task ID variables should be sufficient to uniquely identify an observed value for the modeling target that could be compared to model outputs to evaluate model accuracy. This is discussed more in the section on [target (a.k.a. truth) data](target-data). + +Because they are central to Hubs, these task ID variables serve several purposes: +* They are used in the Hub metadata to define modeling tasks of the hub +* They are used in model outputs to identify the modeling task to which forecasts correspond +* They are used in the specification of [target data](target-data) and methods to calculate "ground truth" target data values, to allow for alignment of model outputs with true target values +The relationships between these items are illustrated at a high level in the following diagram; sections to follow provide more detail. + +```{figure} img/hub-data-relations.jpeg +--- +figclass: margin-caption +alt: A figure showing where data from hubs is created. +name: hub-data-relations +--- +The figure shows that Hub metadata and target data are specified by the hub itself, along with any necessary functions to calculate scores or "observed values" from target data. Teams provide model output data that must conform with standards identified in the Hub metadata. +``` + +### Usage of task ID variables + +Task ID variables can be thought of as columns of a tabular representation in a model output file, where a combination of values of task ID variables would uniquely define a row of data. + +In our [Running Example 1](running-examples), the task ID variables are `target`, `location`, `origin_date`, and `horizon`. We note that some task ID variables are special in that they conceptually define a modeling "target" (these are referred to in the [tasks metadata](tasks-metadata) as a `target_key`). In this example, `target` is the target key. In other examples, (such as [Running Example 3](running-examples)) more than one variable can serve as target keys together. + +In general, there are no restrictions on what task ID variables may be named, however when appropriate, we suggest that Hubs adopt the following standard column names and definitions: + +* `origin_date`: the starting point that can be used for calculating a target_date via the formula target_date = origin_date + horizon * time_units_per_horizon (e.g., with weekly data, target_date is calculated as origin_date + horizon * 7 days). +* `scenario_id`: a unique identifier for a scenario +* `location`: a unique identifier for a location +* `target`: a unique identifier for the target. It is recommended, although not required, that hubs set up a single variable to define the target (i.e., as a target key), with additional detail specified in the `target_metadata` section of the [tasks metadata](tasks-metadata). +* `target_date`: for short-term forecasts, the target_date specifies the date of occurrence of the outcome of interest. For instance, if models are requested to forecast the number of hospitalizations that will occur on 2022-07-15, the target_date is 2022-07-15. +* `horizon`: The difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months) +* `age_group`: a unique identifier for an age group + +```{note} +We encourage Hubs to avoid redundancy in the model task columns. For example, Hubs should not include all three of `target_date`, `origin_date`, and `horizon` as task ID columns because if any two are specified, the third can be calculated directly. Similarly, if a variable is constant, it should not be included. For example, if a Hub does not include multiple targets, the `target` column could be omitted from the task ID columns. +``` + +As Hubs define new modeling tasks, they may need to introduce new task ID variables that have not been used before. In those cases, the new variables should be added to this list to ensure that the concepts are documented in a central place and can be reused in future efforts. + +(output_types)= +## Output types + +The [output_type](output_types) object defines accepted representations for each task. More on the different output types can be found in [this table](output_type_table). + +(target_metadata)= +## Target metadata + +Document here the properties of a target, as listed in the schema. diff --git a/docs/source/index.md b/docs/source/index.md index 66fe9709..813ad70e 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -56,6 +56,7 @@ overview/definitions.md :hidden: format/intro-data-formats.md format/hub-structure.md +format/task-id-vars.md format/hub-metadata.md format/model-metadata.md format/model-output.md