🆕 Schema v4 release on 2024-11-25 #32

zkamvar · 2024-11-22T16:57:49Z

zkamvar
Nov 22, 2024
Maintainer

We will be releasing schema version v4.0.0 on Monday, 25 November 2024. This contains changes that are incompatible with earlier schema versions. We have taken care to ensure backward compatibility across the hubverse packages so you should not notice any change to your existing workflows.

In this announcement, we discuss the important changes in v4 and provide a checklist to update to v4.

Hub- and Round-level `derived_task_ids`

In v4, you can add a derived_task_ids property at the hub or round level to define task IDs that are derived from other task IDs. This allows more efficient validations.

If you are not familiar with these, derived task IDs are non-independent task IDs that are derived from other task IDs. A common example of a derived task ID is target_end_date which is most often derived from the origin_date and horizon task ids.

All `output_type` elements gain the `is_required` element

In v4, all output_types gain the is_required element to indicate whether or not that particular output type is required (is_required: true) or if it is optional (is_required: false). A special note for the sample output type: the is_required is moved from being a property of output_type_id_params to being a property of sample.

In addition, this means that we also disallow the optional property for output_type_id objects (this is still allowed and necessary in task_id objects).

Take for example, this optional set of quantiles for a v3 hub:

"quantile": {
    "output_type_id": {
        "required": null,
        "optional": [0, 0.5, 1]
    }
    ...
}

In English, this object is stating that a hub can optionally accept a quantile output type with any, all, or
none of the output_type_ids in the set [0, 0.5, 1].

This is the identical object in a v4 hub:

"quantile": {
    "output_type_id": {
        "required": [0, 0.5, 1]
    },
    ...
    "is_required": false
}

The interpretation differs slightly, however. In English, this is stating that a hub can optionally accept a quantile output type. If a quantile output type is submitted, it must have all output_type_ids in the set [0, 0.5, 1].

Discussion: no more mixing optional and required output type IDs

The impetus for this change was the fact that it was possible to include both optional and required output type IDs in a model submission. While this made things flexible on the side of the modelers, downstream analyses became more difficult because of the heterogeneity of the outputs.

Part of the reason is that output_type_ids are ordered. It becomes impossible to know how to combine these output_type_ids if they are split between "required" and "optional". Take for instance a CDF output that requires forecasts for every other epiweek, but optionally modelers could submit every week:

"cdf": {
    "output_type_id": {
        "required": ["EW2", "EW4", "EW6", "EW8", "EW10"],
        "optional": ["EW1", "EW3", "EW5", "EW7", "EW9"]
    }
    ...
}

Situations like this mean that it becomes more difficult to validate a model submission because we cannot programmatically confirm that the output type IDs are in the correct order.

By setting the output_type to be either required or not allows for straightforward validation of these elements.

point estimate `output_type_id`s: Use `null` instead of `NA`

Since the beginning of the hubverse, point estimate (e.g. mean and mean) output_type_ids are not applicable, that is, they are encoded as missing values.

In v3, if you wanted to specify output_type_id for a required mean output_type, you would write ["NA"] to indicate a presence of an absence:

"mean": {
    "output_type_id": {
        "required": ["NA"],
        "optional": null,
    }
    ...
}

This lead many modelers to incorrectly assume that the output_type_id column of their submissions should be the character "NA". While we updated our documentation to reflect this, adding the is_required property to output_types allows us to make the expectation clear. Now to specify a required mean output_type, you would write:

"mean": {
    "output_type_id": {
        "required": null,
    }
    ...
    "is_required": true
}

Documentation

The full documentation for v4 on https://hubverse.io/ is still in progress and will be updated shortly (we will release a news item for that later).

Updating to v4

Change the version number in the schema ID property of your tasks.json & admin.json.
Move all output_type_id values to the required property and delete any optional properties.
For each output_type add an is_required property and use it to indicate whether an output type is required or not. If you've already been collecting samples make sure to move this property from the output_type_id_params object.
Ensure any point estimate output type IDs have null instead of ["NA"] in their output_type_ids.required property
Specify any derived task IDs in the derived_task_ids property at the to level of the config.
Run hubAdmin::validate_hub_config() to ensure your hub config is valid

Full list of updates

BREAKING CHANGE: Introduction of is_required boolean property at the output_type level to configure whether the output type is required for submissions to be considered valid (#99).
BREAKING CHANGE: Disallowed optional property in output_type_id objects. As such, when a given output type is submitted, values for all output type IDs much be submitted (#100,#101, #102).
BREAKING CHANGE: To improve cross-platform interoperability, expectation of missing values in point estimate output_type_id required properties now encoded with null instead of ["NA"] (#109).
Introduction of optional derived_task_ids properties to enable hub administrators to define derived task IDs (i.e. task IDs whose values depend on the values of other task IDs). The higher level derived_task_ids property sets the property globally at the hub level but can be overriden by the round level derived_task_ids property. The property allows for primarily validation functionality to ignore such task IDs when appropriate which can significantly improve validation efficency (#96). For more information see hubValidations documentation on ignoring derived task IDs.
Added more specific schema for target_keys to ensure only string properties are allowed (#97)
Removed the requirement for a minimum value of zero in cdf numeric output_type_ids (#113).
Ensured that additional properties are not allowed in lower level properties (e.g. individual task IDs, output_type objects etc). Custom additional properties are only allowed at the round level, while additional task ID objects that match the expected task ID schema are allowed in the task_ids object (#114).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hubverse

🆕 Schema v4 release on 2024-11-25 #32

{{title}}

Replies: 0 comments

Select a reply

The Hubverse

🆕 Schema v4 release on 2024-11-25 #32

zkamvar Nov 22, 2024 Maintainer

Hub- and Round-level derived_task_ids

All output_type elements gain the is_required element

Discussion: no more mixing optional and required output type IDs

point estimate output_type_ids: Use null instead of NA

Documentation

Updating to v4

Full list of updates

Replies: 0 comments

zkamvar
Nov 22, 2024
Maintainer

Hub- and Round-level `derived_task_ids`

All `output_type` elements gain the `is_required` element

point estimate `output_type_id`s: Use `null` instead of `NA`