diff --git a/README.md b/README.md index 98ab7861..1f7d8022 100644 --- a/README.md +++ b/README.md @@ -94,9 +94,20 @@ To use Live Server: # Versioning -Documentation is [versioned by using releases](https://docs.readthedocs.io/en/stable/versions.html). Releases should track releases of Hub schema versions in [`schemas` repository](https://github.com/Infectious-Disease-Modeling-Hubs/schemas). While changes to documentation text can be commited without creating a new release and will appear in th `latest` version of the documentation, **changes to documentation related to a new schema release must be accompanied by a new release in this repository**. New releases on `hubDocs` should use the same version number as the `schemas` release but without the `v` (e.g. a `v0.0.1` `schemas` version number would be released as `0.0.1` on `hubDocs`). +Documentation is [versioned by using releases](https://docs.readthedocs.io/en/stable/versions.html). Releases should track releases of Hub schema versions in [`schemas` repository](https://github.com/Infectious-Disease-Modeling-Hubs/schemas). While changes to documentation text can be commited without creating a new release and will appear in the `latest` version of the documentation, **changes to documentation related to a new schema release must be accompanied by a new release in this repository**. New releases on `hubDocs` should use the same version number as the `schemas` release but without the `v` (e.g. a `v0.0.1` `schemas` version number would be released as `0.0.1` on `hubDocs`). + +When creating a new release version: + +1. Checkout the main branch and ensure you pull all changes from the remote repository. +2. Create a new branch of the main branch and name it using the convention `br-` +3. Open `docs/source/conf.py` file and update the value of the `schema_version` variable with the version of the schema in the `schemas` repository you want the release to accompany (e.g. `v0.0.1`). This propagates the appropriate version to various substitution text elements within the docs, including the URLs pointing docson widgets to raw config schema files. +4. If the version of the schema you are preparing the release for has not been released to `main` branch in the `schemas` repository, you can set the value of the `schema_branch` variable to the name of the branch in the `schemas` repository in which the version is being prepared (e.g. `br-v1.0.0`). This allows you to see what development versions of the schema will look like in the docson widgets while developing locally and in a pull request. If the schema has been released to `main` in the `schemas` repo, set `schema_branch` to `"main"`. The value of this variable is overriden automatically when READTHEDOCS builds the documentation on the `main` branch (or any other branch for that matter, in contrast to a pull request build) after a merge or a new release. +5. Any any changes to the documentation needed. +6. Commit and push changes (including changes to `conf.py`) +7. Create pull request and merge after review. +8. [Create a release on GitHub](https://docs.github.com/en/repositories/releasing-projects-on-github/managing-releases-in-a-repository?tool=webui#creating-a-release) labelling it with the same version number as the `schemas` release this release is associated with but without the `v` (e.g. a `v0.0.1` `schemas` version number would be released as `0.0.1` on `hubDocs`). + -Before making a new release, **ensure that URLs to schema files which power the interactive docson widget visualisation of schemas in the [`docs/source/format/hub-metadata.md` page](https://github.com/Infectious-Disease-Modeling-Hubs/hubDocs/blob/main/docs/source/format/hub-metadata.md?plain=1) are updated** to display the new versions of `admin.json` and `tasks.json` schema files. ## Contribution guidelines In general, contributions should be made via pull requests to the `main` branch. diff --git a/docs/source/conf.py b/docs/source/conf.py index abad9d8b..7bf18f50 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -61,7 +61,24 @@ # -- Options for EPUB output epub_show_urls = 'footnote' +schema_version = "v1.0.0" +# Use schema_branch variable to specify a branch in the schemas repository from which config schema will be source, especially for docson widgets. +# Useful if the schema being documented hasn't been released to the `main` branch in the schemas repo yet. If version has been released already, set this to "main". +schema_branch = "br-"+schema_version + +# The following statements override any custom branch assigned to schema branch if the build is being run on READTHEDOCS and is either a build for a new tag or on a branch +# (in contrast to being run in a pull request or locally). This ensures that any production versions of the docs published on the hubDocs `main` branch always +# point to schema on the `main` branch of the `schemas` repository. It also allows for previewing docson widgets and links to schema in branches other than the +# main branch in the schemas repos when developing locally or in pull requests +import os +build_type = os.environ.get("READTHEDOCS_VERSION_TYPE") +if build_type is None: + build_type = "unknown" + +if build_type in ("tag", "branch"): + schema_branch = "main" myst_substitutions = { - 'schema_version': "v0.0.1" + 'schema_version': schema_version, + 'schema_branch': schema_branch } diff --git a/docs/source/format/hub-config.md b/docs/source/format/hub-config.md new file mode 100644 index 00000000..d9069b82 --- /dev/null +++ b/docs/source/format/hub-config.md @@ -0,0 +1,59 @@ +(hub-config)= +# Hub configuration files + +## Directory Structure +The `hub-config` directory in a modeling hub is required to contain three files: + 1. `admin.json` - JSON file containing generic information about the hub as well as static configuration settings for downstream tools such as validations, visualizations, etc. + 2. `tasks.json` - JSON file specifing modeling tasks and model output formats, which may be round-specific. + 3. `model-metadata-schema.json` - JSON or YAML file defining format of model metadata files + +```{caution} +Note: Due to technical issues, we do not currently support json references or yaml metadata files. +``` + +## Purpose +The files withing the `hub-config` directory specify general configurations for a hub as well as (possibly round-specific) details of what model outputs are requested or required. Hub configuration files are used for: +* Validating model output submissions + * `tasks.json` file specifies the file format and task id, output type, value combinations (both required or optional) that submitted model output data must adhere to. + * `tasks.json` file also specifies the window of submission for each round (with the time zone information in the `admin.json` file). +* Scoring model outputs + * the hub configuration files specify the scores that are used + * the task id variables specified in the `tasks.json` can be used to join model output data with truth data for the purpose of scoring forecasts. +* Configuring model output visualizations + * Visualization tools may benefit from the ability to programmatically identify task id variables so that a separate visualization of model outputs can be generated for each combination of those variables (e.g. via facetting or menu selections). For example, it may be beneficial to produce separate visualizations for different locations or scenario ids. + * Visualization tools may give special treatment to the hub’s ensemble and baseline models, which are identified in the hub configuration files. + * The `tasks.json` file contains metadata regarding the targets including human readable description and units which can be used for visualization +* Report generation + * `admin.json` allows configuration of ensemble and baseline models to be treated specially in reports. + + +## Hub administrative configuration (`admin.json` file) + +The administrative hub configuration file contains global administrative settings that are expected to remain fixed throughout a hub’s existence and applies to all rounds in a hub. + +### Hub administrative configuration (`admin.json`) Interactive Schema + +#### Schema Version: {{schema_version}} +{{'[See raw schema](https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/BRANCH/SCHEMA_VERSION/admin-schema.json)'.replace('SCHEMA_VERSION', schema_version).replace('BRANCH', schema_branch)}} + +{{''.replace('SCHEMA_VERSION', schema_version).replace('BRANCH', schema_branch)}} + +```{note} + Other things we may want to consider adding here: +* Something about truth data? +* Something about scoring? +* Something about report generation? +``` + +(tasks_metadata)= +## Hub model task configuration (`tasks.json` file) +The hub model task configuration file specifies the model tasks (tasks id and targets) as well as model output types. The `tasks.json` file is flexible enough to accomodate different style of hubs. Hubs can varie from a simple forecast hub (see [US Forecast Hub example](/format/intro-data-formats.md) to a more complex round related scenario hub (see [US Scenario Modeling Hub example](/format/intro-data-formats.md)). + + +### Model Tasks (`tasks.json`) Interactive Schema + +#### Schema Version: {{schema_version}} +{{'[See raw schema](https://raw.githubusercontent.com/Infectious-Disease-Modeling-Hubs/schemas/BRANCH/SCHEMA_VERSION/tasks-schema.json)'.replace('SCHEMA_VERSION', schema_version).replace('BRANCH', schema_branch)}} + +{{''.replace('SCHEMA_VERSION', schema_version).replace('BRANCH', schema_branch)}} + diff --git a/docs/source/format/hub-metadata.md b/docs/source/format/hub-metadata.md deleted file mode 100644 index fd65b78e..00000000 --- a/docs/source/format/hub-metadata.md +++ /dev/null @@ -1,54 +0,0 @@ -(hub-metadata)= -# Hub configuration files - -## Directory Structure -The `hub-config` directory in a modeling hub is required to contain three files: - 1. `admin.json` - JSON file defining Hub modeling targets - 2. `tasks.json` - JSON file defining information about model submission validation - 3. `model-metadata-schema.json` - Json or yaml file defining format of model metadata files - -```{caution} -Note: Due to technical issues, we do not currently support json references or yaml metadata files. -``` - - -## Purpose -Hub metadata specifies general configurations for a hub as well as (possibly round-specific) details of what model outputs are requested or required. Hub metadata are used for: -* Validating model output submissions - * submissions must adhere to the file formats and value combinations specified in the hub metadata. -* Scoring model outputs - * the hub metadata specifies the scores that are used - * the task id variables specified in the hub metadata can be used to join model output data with truth data for the purpose of scoring forecasts. -* Configuring model output visualizations - * Visualization tools may benefit from the ability to programmatically identify task id variables so that a separate visualization of model outputs can be generated for each combination of those variables (e.g. via facetting or menu selections). For example, it may be beneficial to produce separate visualizations for different locations or scenario ids. - * Visualization tools may give special treatment to the hub’s ensemble and baseline models, which are identified in the hub metadata. -* Report generation - * The hub’s ensemble and baseline models may be treated specially in reports - -## Recommended Standards -We divide the hub metadata into two files: -1. `admin.json` Generic information about the hub as well as static configuration settings for downstream tools such as validations, visualizations, etc. -2. `tasks.json`: Specifications of the modeling tasks and model output formats, which may be round-specific. - -These are described separately in the following subsections. - -## Hub administrative metadata (`admin.json` file) - -The administrative hub metadata file contains settings that are expected to remain fixed throughout a hub’s existence, or for which it is not required to retain past values in order to work with hub data. - -### Hub administrative metadata (`admin.json`) Interactive Schema - - - - Other things we may want to consider adding here: -* Something about truth data? -* Something about scoring? -* Something about report generation? - -(tasks_metadata)= -## Hub model task metadata (`tasks.json` file) -The hub model task metadata file specifies the model tasks and model output formats for the hub. - -### Model Tasks (`tasks.json`) Interactive Schema - - diff --git a/docs/source/format/hub-structure.md b/docs/source/format/hub-structure.md index 3491482c..fe65b524 100644 --- a/docs/source/format/hub-structure.md +++ b/docs/source/format/hub-structure.md @@ -1,16 +1,16 @@ (hub-structure)= # Structure of Hub repositories -A Hub should be structured according to the following recommendations. +A Hub should be structured according to the following recommendations. -Generally, the Hub repository is intended primarily as a storage space for primary data. All other code and outputs related to model output validation, visualizations, reports, ensemble construction, etc., should be placed in repositories other than the primary Hub repository. +Generally, Hub file structure is intended primarily as a storage space for primary data. All other code and outputs related to model output validation, visualizations, reports, ensemble construction, etc., should be placed in repositories other than the primary Hub location. The directory and file structure of a modeling hub should contain only the following directories and files: * Documentation files * Hubs should provide a documentation file (e.g., `README.md`) at the top level that describes the overall structure of the hub, as well as a documentation file within each folder that provides more detail. -* `hub-metadata` directory (see {doc}`/format/hub-metadata`) +* `hub-config ` directory (see {doc}`/format/hub-config`) * `model-output` directory (see {doc}`/format/model-output`) @@ -22,5 +22,7 @@ The directory and file structure of a modeling hub should contain only the follo * `auxiliary-data` (optional, see {doc}`/format/target-data`) -* Optionally, any files necessary to define continuous integration workflows, for example for the purpose of validating submissions or updating target data. To the extent possible, only the workflow definition files should be stored within the Hub repository, with any additional scripts or functionality residing in an external repository. +* Optionally, any files necessary to define continuous integration workflows, for example for the purpose of validating submissions or updating target data. To the extent possible, only the workflow definition files should be stored within the Hub file space, with any additional scripts or functionality residing in an external location. + +Although most hubs to date have been housed in GitHub repositories, the proposed structure is more general and can be adapted for use on any shared filesystem. diff --git a/docs/source/format/intro-data-formats.md b/docs/source/format/intro-data-formats.md index 132aba2a..57a728d1 100644 --- a/docs/source/format/intro-data-formats.md +++ b/docs/source/format/intro-data-formats.md @@ -7,7 +7,7 @@ On this page we provide an [outline on the contents of this data formats section This section of the documentation provides standards for: * [Structure of hub repositories](hub-structure): standards for file and directory structures for Hubs -* [Hub configuration files](hub-metadata): the files needed to set up and run a modeling Hub +* [Hub configuration files](hub-config): the files needed to set up and run a modeling Hub * [Model metadata](model-metadata): metadata describing models * [Model output](model-output): standard formats for model output such as forecasts and projections that are saved in Hubs * [Target data](target-data): standard formats for target data, the eventually observable quantities of interest to a hub @@ -33,7 +33,7 @@ This Hub allows for submissions on a pre-specified set of dates specified by the * `target` (the sole **target key** variable): can only take the value "wk inc flu hosp" * `location`: “US”, “01”, “02”, …, “78” ([FIPS codes](https://en.wikipedia.org/wiki/Federal_Information_Processing_Standards) for US states and territories) * `origin_date` (this variable is specified as the one from which rounds are given IDs): weekly on Mondays -* `horizon`: 1, 2, 3, 4 +* `horizon`: 1, 2, 3, 4 (in units of weeks, which is specified in the target-metadata) ``` @@ -73,49 +73,6 @@ Projections are requested for each combination of the following variables. * `horizon`: 1 ``` -(task_id_vars)= -## Task ID variables - -### Overview of task ID variables -Hubs typically specify that modeling outputs (e.g., forecasts or projections) should be generated for each combination of values across a set of task ID variables. For modeling exercises where the model outputs correspond to estimates or predictions of a quantity that could in principle be calculated from observable data, these task ID variables should be sufficient to uniquely identify an observed value for the modeling target that could be compared to model outputs to evaluate model accuracy. This is discussed more in the section on [target (a.k.a. truth) data](target-data). - -Because they are central to Hubs, these task ID variables serve several purposes: -* They are used in the Hub metadata to define modeling tasks of the hub -* They are used in model outputs to identify the modeling task to which forecasts correspond -* They are used in the specification of [target data](target-data) and methods to calculate "ground truth" target data values, to allow for alignment of model outputs with true target values -The relationships between these items are illustrated at a high level in the following diagram; sections to follow provide more detail. - -```{figure} img/hub-data-relations.jpeg ---- -figclass: margin-caption -alt: A figure showing where data from hubs is created. -name: hub-data-relations ---- -The figure shows that Hub metadata and target data are specified by the hub itself, along with any necessary functions to calculate scores or "observed values" from target data. Teams provide model output data that must conform with standards identified in the Hub metadata. -``` - -### Usage of task ID variables - -Task ID variables can be thought of as columns of a tabular representation in a model output file, where a combination of values of task ID variables would uniquely define a row of data. - -In our running Example 1 above, the task ID variables are `target`, `location`, `origin_date`, and `horizon`. We note that some task ID variables are special in that they conceptually define a modeling "target" (these are referred to in the [tasks metadata](tasks-metadata) as a `target_key`). In this example, `target` is the target key. In other examples, (such as Running Example 3) more than one variable can serve as target keys together. - -In general, there are no restrictions on what task ID variables may be named, however when appropriate, we suggest that Hubs adopt the following standard column names and definitions: - -* `origin_date`: the starting point that can be used for calculating a target_date via the formula target_date = origin_date + horizon * time_units_per_horizon (e.g., with weekly data, target_date is calculated as origin_date + horizon * 7 days). -* `scenario_id`: a unique identifier for a scenario -* `location`: a unique identifier for a location -* `target`: a unique identifier for the target. It is recommended, although not required, that hubs set up a single variable to define the target (i.e., as a target key), with additional detail specified in the `target_metadata` section of the [tasks metadata](tasks-metadata). -* `target_date`: for short-term forecasts, the target_date specifies the date of occurrence of the outcome of interest. For instance, if models are requested to forecast the number of hospitalizations that will occur on 2022-07-15, the target_date is 2022-07-15. -* `horizon`: The difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months) -* `age_group`: a unique identifier for an age group - -```{note} -We encourage Hubs to avoid redundancy in the model task columns. For example, Hubs should not include all three of `target_date`, `origin_date`, and `horizon` as task ID columns because if any two are specified, the third can be calculated directly. Similarly, if a variable is constant, it should not be included. For example, if a Hub does not include multiple targets, the `target` column could be omitted from the task ID columns. -``` - -As Hubs define new modeling tasks, they may need to introduce new task ID variables that have not been used before. In those cases, the new variables should be added to this list to ensure that the concepts are documented in a central place and can be reused in future efforts. - (submission-rounds)= ## Submission rounds -Many Hubs will accept model output submissions over multiple rounds. In the case of the forecast hubs there has typically been one submission round per week, while the scenario hubs have had submission rounds less frequently, typically about once per month. As part of the [Hub metadata](hub-metadata), Hubs should specify a set of `round_id` values that uniquely identify the submission round. For instance, for weekly submissions the round id might be the date that submissions are due to the Hub or a specification of an epidemic week. In instances where the rounds do not follow a predetermined schedule, more generic identifiers such as “round1” may be preferred. The round id will be used as the file names of model output submissions and round-specific model abstract submissions, as well as in the Hub metadata to specify model tasks that may vary across rounds. +Many Hubs will accept model output submissions over multiple rounds. In the case of the forecast hubs there has typically been one submission round per week, while the scenario hubs have had submission rounds less frequently, typically about once per month. As part of the [Hub configuration files](hub-config), Hubs should specify a set of `round_id` values that uniquely identify the submission round. For instance, for weekly submissions the round id might be the date that submissions are due to the Hub or a specification of an epidemic week. In instances where the rounds do not follow a predetermined schedule, more generic identifiers such as “round1” may be preferred. The round id will be used as the file names of model output submissions and round-specific model abstract submissions, as well as in the Hub metadata to specify model tasks that may vary across rounds. diff --git a/docs/source/format/model-output.md b/docs/source/format/model-output.md index e567a865..ea3366c2 100644 --- a/docs/source/format/model-output.md +++ b/docs/source/format/model-output.md @@ -14,7 +14,7 @@ The `model-output` directory in a modeling hub is required to have the following ## Formats of model output -Model outputs are contributed by teams, and are represented in a “tidy” rectangular format, where each row corresponds to a unique model output and columns define: (1) the model task, (2) specification of the representation of the model output, and (3) the model output value. More detail about each of these is given in the following points: +Model outputs are contributed by teams, and are represented in a rectangular format, where each row corresponds to a unique model output and columns define: (1) the model task, (2) specification of the representation of the model output, and (3) the model output value. More detail about each of these is given in the following points: * Task ids: A set of columns specifying the model task, as described [here](task_id_vars). The columns used as task ids will vary across different Hubs. @@ -25,24 +25,24 @@ Model outputs are contributed by teams, and are represented in a “tidy” rect These are described more in the following table: ```{margin} -Note on `category` model output type: Values are required to sum to 1 across all `type_id` values within each combination of values of task id variables This representation should only be used if the outcome variable is truly categorical; if the categories would represent a binned discretization of an underlying continuous variable a CDF representation is preferred. +Note on `pmf` model output type: Values are required to sum to 1 across all `type_id` values within each combination of values of task id variables. This representation should only be used if the outcome variable is truly discrete; if the categories would represent a binned discretization of an underlying continuous variable a CDF representation is preferred. ``` ```{margin} Note on `sample` model output type: Depending on the Hub specification, samples with the same sample index (specified by the `type_id`) may be assumed to correspond to a joint distribution across multiple levels of the task id variables. This is discussed more below. ``` - +(output_type_table)= | `type` | `type_id` | `value` | | ------ | ------ | ------ | | `mean` | NA (not used for mean predictions) | Numeric: the mean of the predictive distribution | | `median` | NA (not used for median predictions) | Numeric: the median of the predictive distribution | | `quantile` | Numeric between 0.0 and 1.0: a probability level | Numeric: the quantile of the predictive distribution at the probability level specified by the type_id | | `cdf` | Numeric within the support of the outcome variable: a possible value of the target variable | Numeric between 0.0 and 1.0: the value of the cumulative distribution function of the predictive distribution at the value of the outcome variable specified by the type_id | -| `category` | String naming a possible category of the outcome variable | Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable. | +| `pmf` | String naming a possible category of a discrete outcome variable | Numeric between 0.0 and 1.0: the value of the probability mass function of the predictive distribution when evaluated at a specified level of a categorical outcome variable. | | `sample` | Positive integer sample index | Numeric: a sample from the predictive distribution. -We emphasize that the `mean`, `median`, `quantile`, `cdf`, and `category` representations all summarize the marginal predictive distribution for a single combination of model task id variables. On the other hand, the `sample` representation may capture dependence across combinations of multiple model task id variables by recording samples from a joint predictive distribution. For example, suppose that the model task id variables are “forecast date”, “location” and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task id variables, including: +We emphasize that the `mean`, `median`, `quantile`, `cdf`, and `pmf` representations all summarize the marginal predictive distribution for a single combination of model task id variables. On the other hand, the `sample` representation may capture dependence across combinations of multiple model task id variables by recording samples from a joint predictive distribution. For example, suppose that the model task id variables are “forecast date”, “location” and “horizon”. A predictive mean will summarize the predictive distribution for a single combination of forecast date, location and horizon. On the other hand, there are several options for the distribution from which a sample might be drawn, capturing dependence across different levels of the task id variables, including: 1. the joint predictive distribution across all locations and horizons within each forecast date 2. the joint predictive distribution across all horizons within each forecast date and location 3. the joint predictive distribution across all locations within each forecast date and horizon @@ -104,4 +104,4 @@ Validation of forecast values occurs in two steps: * Size: In combination, splitting files up and using parquet would get around GitHub limits on file sizes * Loads only data that are needed * Disadvantages: - * Harder to work with; teams and people who want to work with files need to install additional libraries \ No newline at end of file + * Harder to work with; teams and people who want to work with files need to install additional libraries diff --git a/docs/source/format/target-data.md b/docs/source/format/target-data.md index 96466289..000f94f1 100644 --- a/docs/source/format/target-data.md +++ b/docs/source/format/target-data.md @@ -41,7 +41,7 @@ To allow for reproducible analyses in the event of revisions to previously repor ## Calculating modeling targets -For any modeling Hubs with targets that can be calculated from the truth data, functions should be specified that map time series truth data in the tidy format discussed above to a value of the modeling target for each unique combination of values in the [“task id” columns](task-id-vars). This function should produce data in a tidy format with columns for all task id variables and a value column. These outputs can be consumed by later tools in our pipeline, such as evaluation tools. +For any modeling Hubs with targets that can be calculated from the truth data, functions should be specified that map time series truth data in the tabular format discussed above to a value of the modeling target for each unique combination of values in the [“task id” columns](task-id-vars). This function should produce data in a tabular format with columns for all task id variables and a value column. These outputs can be consumed by later tools in our pipeline, such as evaluation tools. We illustrate with our second running example: a hypothetical forecasting exercise for influenza hospitalization rates per 100,000 population by age group at the state level in the US, with short-term incidence and “seasonal” targets. Forecasts are requested for each combination of the following variables: diff --git a/docs/source/format/tasks.md b/docs/source/format/tasks.md new file mode 100644 index 00000000..bdd1fdf3 --- /dev/null +++ b/docs/source/format/tasks.md @@ -0,0 +1,59 @@ +(tasks)= +# Defining modeling tasks + +Every Hub is organized around "modeling tasks" that are defined to meet the needs of a project. Modeling tasks are defined for a hub in the [tasks.json configuration file](tasks_metadata) for a hub. Modeling tasks are defined for either a single round, or for multiple rounds that are distinguished by different values of a specific `task_id` variable. The three components of modeling tasks are [task ID variables](task_id_vars), [output types](output_types), and [target metadata](target_metadata). Broadly speaking these three components function as follows: + + - The [task_ids](task_id_vars) object defines both labels for columns in submission files and the set of valid values for each column. Any unique combination of the values define a single modeling task, or target. + - The [output_type](output_types) object defines accepted representations for each task. More on the different output types can be found in [this table](output_type_table). + - The [target_metadata](target_metadata) array provides additional information about each target. + +(task_id_vars)= +## task ID variables +Hubs typically specify that modeling outputs (e.g., forecasts or projections) should be generated for each combination of values across a set of task ID variables. For modeling exercises where the model outputs correspond to estimates or predictions of a quantity that could in principle be calculated from observable data, these task ID variables should be sufficient to uniquely identify an observed value for the modeling target that could be compared to model outputs to evaluate model accuracy. This is discussed more in the section on [target (a.k.a. truth) data](target-data). + +Because they are central to Hubs, these task ID variables serve several purposes: +* They are used in the Hub metadata to define modeling tasks of the hub +* They are used in model outputs to identify the modeling task to which forecasts correspond +* They are used in the specification of [target data](target-data) and methods to calculate "ground truth" target data values, to allow for alignment of model outputs with true target values +The relationships between these items are illustrated at a high level in the following diagram; sections to follow provide more detail. + +```{figure} img/hub-data-relations.jpeg +--- +figclass: margin-caption +alt: A figure showing where data from hubs is created. +name: hub-data-relations +--- +The figure shows that Hub metadata and target data are specified by the hub itself, along with any necessary functions to calculate scores or "observed values" from target data. Teams provide model output data that must conform with standards identified in the Hub metadata. +``` + +### Usage of task ID variables + +Task ID variables can be thought of as columns of a tabular representation in a model output file, where a combination of values of task ID variables would uniquely define a row of data. + +In our [Running Example 1](running-examples), the task ID variables are `target`, `location`, `origin_date`, and `horizon`. We note that some task ID variables are special in that they conceptually define a modeling "target" (these are referred to in the [tasks metadata](tasks-metadata) as a `target_key`). In this example, `target` is the target key. In other examples, (such as [Running Example 3](running-examples)) more than one variable can serve as target keys together. + +In general, there are no restrictions on what task ID variables may be named, however when appropriate, we suggest that Hubs adopt the following standard column names and definitions: + +* `origin_date`: the starting point that can be used for calculating a target_date via the formula target_date = origin_date + horizon * time_units_per_horizon (e.g., with weekly data, target_date is calculated as origin_date + horizon * 7 days). +* `scenario_id`: a unique identifier for a scenario +* `location`: a unique identifier for a location +* `target`: a unique identifier for the target. It is recommended, although not required, that hubs set up a single variable to define the target (i.e., as a target key), with additional detail specified in the `target_metadata` section of the [tasks metadata](tasks-metadata). +* `target_date`: for short-term forecasts, the target_date specifies the date of occurrence of the outcome of interest. For instance, if models are requested to forecast the number of hospitalizations that will occur on 2022-07-15, the target_date is 2022-07-15. +* `horizon`: The difference between the target_date and the origin_date in time units specified by the hub (e.g., may be days, weeks, or months) +* `age_group`: a unique identifier for an age group + +```{note} +We encourage Hubs to avoid redundancy in the model task columns. For example, Hubs should not include all three of `target_date`, `origin_date`, and `horizon` as task ID columns because if any two are specified, the third can be calculated directly. Similarly, if a variable is constant, it should not be included. For example, if a Hub does not include multiple targets, the `target` column could be omitted from the task ID columns. +``` + +As Hubs define new modeling tasks, they may need to introduce new task ID variables that have not been used before. In those cases, the new variables should be added to this list to ensure that the concepts are documented in a central place and can be reused in future efforts. + +(output_types)= +## Output types + +The [output_type](output_types) object defines accepted representations for each task. More on the different output types can be found in [this table](output_type_table). + +(target_metadata)= +## Target metadata + +Document here the properties of a target, as listed in the schema. diff --git a/docs/source/index.md b/docs/source/index.md index 81507f40..b6076943 100644 --- a/docs/source/index.md +++ b/docs/source/index.md @@ -4,36 +4,9 @@ This project is under active development. ``` -The Consortium of Infectious Disease Modeling Hubs is a collaboration of research teams that have built and maintained predictive modeling hubs for infectious disease applications. Working together, we have developed software to for groups that are running collaborative modeling hub efforts. This website documents the requirements for using the infrastructure that our collaborative group has set up. The following sections of this page provide an outline of the different resources created by this project. - -## Tools for building and hosting modeling hubs - -The following subsections provide pointers to resources developed by the Consortium to make designing, launching, and maintaining hubs easier. - -### Template hubs - -The [template hub repositories](https://github.com/Infectious-Disease-Modeling-Hubs?q=&type=template&language=&sort=) provided by the consortium may be cloned directly to start a new hub. Unlike the example hubs below, these repositories do not have any data in them, they just provide a skeletal structure of a hub. Currently, we only host a single [template hub](https://github.com/Infectious-Disease-Modeling-Hubs/hubTemplate). - -### Example hubs - -We have created some [example Hub repositories](https://github.com/Infectious-Disease-Modeling-Hubs?q=example&type=all&language=&sort=) that provide minimal working examples of hubs. These repositories could be used for ideas of how to set up configuration files for new projects. They are also used as use-cases for testing the software described below. - -- The [Simple Forecast Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-simple-forecast-hub) is designed to be similar to the [US CDC FluSight Hospitalization Forecasting exercise](https://github.com/cdcepi/Flusight-forecast-data) from 2022-2023. -- The [Complex Forecast Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub) is designed to be similar to the [US COVID-19 Forecast Hub](https://github.com/reichlab/covid19-forecast-hub) and the [European COVID-19 Forecast Hub](https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe). -- The [Complex Scenario Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-scenario-hub) is designed to be similar to the [US COVID-19 Scenario Modeling Hub](https://github.com/midas-network/covid19-scenario-modeling-hub) - -### Schema files for hub configuration - -To take advantage of the infrastructure designed by the Consortium, a repository must contain JSON configuration files in a [specific location and format](hub-metadata). The schemas that define the structure and formats of the configuration files live in their own [schemas repository](https://github.com/Infectious-Disease-Modeling-Hubs/schemas). The schemas are versioned, and every hub must point to a specific version of the schemas that they are using. - -## Software for modeling hubs - -The main benefit of setting up a hub using the structure outlined in this documentation is that it enables you to use a wide array of tools designed to support common modeling hub tasks, like loading model output data, plotting the model output data, building ensembles using the data, and in some cases evaluating the predictions made by different models. - -- [hubUtils](https://infectious-disease-modeling-hubs.github.io/hubUtils/) is an R package with utility functions for working with data from modelings hubs. -- [hubEnsembles](https://github.com/Infectious-Disease-Modeling-Hubs/hubEnsembles) is an R package with functionality to build simple ensembles of data from modeling hubs. - +**The Consortium of Infectious Disease Modeling Hubs** is a collaboration of research teams that have built and maintained predictive modeling hubs for infectious disease applications. Working together, we have developed software for groups that are running collaborative modeling hub efforts. This website documents the requirements for using the infrastructure that the Consortium has set up. +The [overview](overview/who-we-are.md) section provides an introduction to the project, and the [getting started](overview/getting-started.md) section outlines how to set up a working hub, as well as the different resources created by this project. @@ -46,6 +19,8 @@ The main benefit of setting up a hub using the structure outlined in this docume overview/who-we-are.md overview/scope.md overview/definitions.md +overview/getting-started.md +overview/software.md ``` ```{toctree} @@ -54,7 +29,8 @@ overview/definitions.md :hidden: format/intro-data-formats.md format/hub-structure.md -format/hub-metadata.md +format/task-id-vars.md +format/hub-config.md format/model-metadata.md format/model-output.md format/target-data.md diff --git a/docs/source/overview/getting-started.md b/docs/source/overview/getting-started.md new file mode 100644 index 00000000..92b343fd --- /dev/null +++ b/docs/source/overview/getting-started.md @@ -0,0 +1,32 @@ +# Getting started + +The following subsections provide pointers to resources developed by the Consortium to make designing, launching, and maintaining hubs easier. + +The simplest way to set up a modeling hub is to directly clone one from the [template hub repositories](https://github.com/Infectious-Disease-Modeling-Hubs?q=&type=template&language=&sort=) or to use one of the [example hub repositories](https://github.com/Infectious-Disease-Modeling-Hubs?q=example&type=all&language=&sort=), which are based on prior use cases. The template hubs provide a skeletal structure of a hub without any data in them, whereas the example hubs provide minimal working examples of hubs and could be used for ideas of how to set up configuration files for new projects. + +## [Template hubs](https://github.com/Infectious-Disease-Modeling-Hubs?q=&type=template&language=&sort=) + +The [`hubTemplate`](https://github.com/Infectious-Disease-Modeling-Hubs/hubTemplate) repository (under development) provides a skeleton structure for groups wishing to build and maintain a new modeling hub. This repository may be cloned to start a new repository for a modeling hub. + +## [Example hubs](https://github.com/Infectious-Disease-Modeling-Hubs?q=example&type=all&language=&sort=) + +The [example Hub repositories](https://github.com/Infectious-Disease-Modeling-Hubs?q=example&type=all&language=&sort=) provide minimal working examples of hubs that can be used for ideas of how to set up configuration files for new projects. They are also used as use-cases for testing the [software for modeling hubs](software.md). + +### 1. [Simple Forecast Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-simple-forecast-hub) +The [Simple Forecast Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-simple-forecast-hub) is designed to be similar to the [US CDC FluSight Hospitalization Forecasting exercise](https://github.com/cdcepi/Flusight-forecast-data) from 2022-2023. + +### 2. [Complex Forecast Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub) +The [Complex Forecast Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-forecast-hub) is designed to be similar to the [US COVID-19 Forecast Hub](https://github.com/reichlab/covid19-forecast-hub) and the [European COVID-19 Forecast Hub](https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe). + +### 3. [Complex Scenario Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-scenario-hub) +The [Complex Scenario Hub Example](https://github.com/Infectious-Disease-Modeling-Hubs/example-complex-scenario-hub) is designed to be similar to the [US COVID-19 Scenario Modeling Hub](https://github.com/midas-network/covid19-scenario-modeling-hub) + + +## [Schema files](https://github.com/Infectious-Disease-Modeling-Hubs/schemas) + +To take advantage of the infrastructure designed by the Consortium, a hub must contain JSON configuration files in a [specific location and format](hub-config). The schemas that define the structure and formats of the configuration files live in their own [schemas repository](https://github.com/Infectious-Disease-Modeling-Hubs/schemas). The schemas are versioned, and every hub must point to a specific version of the schemas that they are using. + +## [Software for modeling hubs](software.md) + +The main benefit of setting up a hub using the structure outlined in this documentation is that it enables you to use a wide array of tools designed to support common modeling hub tasks, like loading model output data, plotting the model output data, building ensembles using the data, and in some cases evaluating the predictions made by different models. + diff --git a/docs/source/overview/scope.md b/docs/source/overview/scope.md index d90cafe1..e213f585 100644 --- a/docs/source/overview/scope.md +++ b/docs/source/overview/scope.md @@ -2,9 +2,9 @@ Several of the groups that have worked on supporting and developing the modeling hubs came together in 2022 to initiate a community-driven effort to generalize hub-related tools that were developed “on the go” during the first two years of the COVID-19 pandemic. -The goal of the Consortium of Infectious Disease Modeling Hubs is to develop a central open-source suite of tools for creating, hosting, maintaining, and running a modeling hub. As mentioned above, while the focus of the motivating applications were related to predictive time-series-style modeling of outbreaks, the tools developed as part of this effort are designed to be more general and could be used for other purposes, e.g., for aggregating estimates of parameters of interest. +The goal of the Consortium of Infectious Disease Modeling Hubs is to develop a central open-source suite of tools for creating, hosting, maintaining, and running a modeling hub. As mentioned [previously](https://hubdocs.readthedocs.io/en/latest/overview/who-we-are.html), while the focus of the motivating applications were related to predictive time-series-style modeling of outbreaks, the tools developed as part of this effort are designed to be more general and could be used for other purposes, e.g., for aggregating estimates of parameters of interest. -The ultimate goal of this effort is to provide a suite of portable and open-source resources that could be relatively easily adapted by new modeling hubs without the need to duplicate effort. +The ultimate goal of this project is to provide a suite of portable and open-source resources that could be relatively easily adapted by new modeling hubs without the need to duplicate efforts. Initial work will focus on diff --git a/docs/source/overview/software.md b/docs/source/overview/software.md new file mode 100644 index 00000000..5d96eaf5 --- /dev/null +++ b/docs/source/overview/software.md @@ -0,0 +1,18 @@ +# Software + +To assist users in building a hub, we have developed a software suite with specific functions and uses outlined below. These tools are designed to support common modeling hub tasks, like loading model output data, plotting the model output data, building ensembles using the data, and in some cases evaluating the predictions made by different models. + +## [`hubUtils`](https://infectious-disease-modeling-hubs.github.io/hubUtils/) + +The goal of `hubUtils` is to provide a set of utility functions for downloading, plotting, and scoring forecast and truth data from Infectious Disease Modeling Hubs. You can find instructions to download and use the package [here](https://infectious-disease-modeling-hubs.github.io/hubUtils/). + +## [`hubEnsembles`](https://github.com/Infectious-Disease-Modeling-Hubs/hubEnsembles) + +`hubEnsembles` is an R package with functionality to build simple ensembles of data from modeling hubs. Different ensembles can be built using for instance the mean, median, and mode. You can find the complete package and instructions for use [here](https://github.com/Infectious-Disease-Modeling-Hubs/hubEnsembles). + +## [`hubValidations`](https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations) + +The `hubValidations` repository facilitates the implementation of general validation rules that are enforced on submissions in the form of pull requests to hub repositories. You can find the complete package and instructions for use [here](https://github.com/Infectious-Disease-Modeling-Hubs/hubValidations). + + + diff --git a/docs/source/overview/who-we-are.md b/docs/source/overview/who-we-are.md index 98f2da42..e76059d6 100644 --- a/docs/source/overview/who-we-are.md +++ b/docs/source/overview/who-we-are.md @@ -4,10 +4,10 @@ The Consortium of Infectious Disease Modeling Hubs brings together groups that h The initial modeling hubs were developed with a focus on providing nowcasts, forecasts, or scenario projections of outbreaks. While the infrastructure we developed to support these efforts are readily generalizable to other applications, we note that these motivating examples were for applications that focused on (a) predictive modeling and (b) outbreak settings. -Building off of systems designed for influenza forecasting challenges led by the US CDC, the [Reich Lab at UMass-Amherst](https://reichlab.io/), in collaboration with the US CDC, developed the [US COVID-19 Forecast Hub](https://covid19forecasthub.org/) in early 2020 to support COVID-19 forecasting efforts. This infrastructure was later adapted for use by several other modeling hubs: +Building off of systems designed for influenza forecasting challenges led by the US CDC, the [Reich Lab at UMass-Amherst](https://reichlab.io/), in collaboration with the [US CDC](https://www.cdc.gov/), developed the [US COVID-19 Forecast Hub](https://covid19forecasthub.org/) in early 2020 to support COVID-19 forecasting efforts. This infrastructure was later adapted for use by several other modeling hubs: - - German/Poland COVID-19 Forecast Hub - - US COVID-19 Scenario Modeling Hub - - European COVID-19 Forecast Hub - - German Hospitalization Nowcast Hub - - US Influenza FluSight 2022 Challenge + - [German/Poland COVID-19 Forecast Hub](https://github.com/KITmetricslab/covid19-forecast-hub-de) + - [US COVID-19 Scenario Modeling Hub](https://github.com/midas-network/covid19-scenario-modeling-hub) + - [European COVID-19 Forecast Hub](https://github.com/covid19-forecast-hub-europe/covid19-forecast-hub-europe) + - [German Hospitalization Nowcast Hub](https://github.com/KITmetricslab/hospitalization-nowcast-hub) + - [US Influenza FluSight 2022 Challenge](https://github.com/cdcepi/Flusight-forecast-data)