Problems in auto-detecting `compound_taskid_set`s because of derived task IDs #26

annakrystalli · 2024-07-24T09:44:52Z

annakrystalli
Jul 24, 2024
Maintainer

While trying to finish off the the work to support validation of coarser
samples I ran up against an issue with a class of task IDs that is
problematic in other respects too, that of derived task IDs i.e. task
IDs whose values depend on on the values of other task IDs.

A common example already found in active hubs is the target_end_date
task ID which is most commonly derived from the reference_date or
origin_date and horizon task ids.

Problems in deterimining `compound_taskid_set` from submitted data

The first step to validating coarser samples is to determine the
compound_taskid_set from the submitted data and compare it to the
compound_taskid_set defined in the config. For a coarser sample to be
valid, the detected compound_taskid_set must be a subset of the
compound_taskid_set defined in the config. The non-independence of
derived task IDs is causing problems in the method used to auto-detect
the compound_taskid_set from a given submission file.

Detecting `compound_taskid_set` from data

The approach developed to detect compound_taskid_sets from data is
to group submission data by output_type_id (i.e. sample index) and count the number unique values in each column within each sample. The
columns with single unique values are then considered to be part of the
compound_taskid_set.

However, to ensure false positives are not returned, for example, in
cases where there might only be a single possible value for a given task
ID or where a single value is supplied from a set of optional task id
values, an additional check is performed that looks at whether columns
identified ascompound_taskid_set members at the sample level also have
a single unique value across all samples (i.e. n of unique values
across the entire table is also 1). Such columns cannot by definition be
part of the compound_taskid_set (unless explicitly defined in the
config as compound_taskid_set members, which is also checked) and are
therefore discarded as potential compound_taskid_set members.

The problems arise when the tasks IDs a derived task ID depends on are
members of the compound_taskid_set and the derived task ID is not.

In such cases the derived task ID will be incorrectly identified as a
compound_taskid_set member because it will have a single unique value
as result of it’s direct relationship to the unique values in the task
IDs its derived from but fails to be excluded by the second check
because the value varies across samples as the values of the task IDs
it's derived from also do.

This is actually not a straightforward problem to currently solve
programmatically because there’s not a simple/efficient/robust way to
detect such task IDs from the data nor are such task IDs identified in the config. It
is currently a blocker to rolling out the coarse samples functionality.

Example

The following demo of the problem uses the version of the hubValidations package and
test data on the
coarser-spl-checks
branch. If you want to reproduce any of it, you’ll need to be working
in the hubValidations repo in the coarser-spl-checks branch.

The demo test data and config are in the
tests/testthat/testdata/hub-spl directory and the sample expectations
are defined in the config as such:

library(dplyr)
library(testthat)
devtools::load_all()

                        "sample": {
                            "output_type_id_params": {
                                "is_required": true,
                                "type": "integer",
                                "min_samples_per_task": 90,
                                "max_samples_per_task": 100,
                                "compound_taskid_set" : ["reference_date", "horizon", "location", "variant"]
                            },
                            "value": {
                                "type": "integer",
                                "minimum": 0
                            }
                        }
                    }

There are three files that have been created for testing:

2022-10-22-Flusight-baseline.parquet: compound_taskid_set
structure matching that of the config,
i.e. c("reference_date", "horizon", "location", "variant")
2022-10-29-Flusight-baseline.parquet: coarser sample structure and
compound_taskid_set: c("reference_date", "location")
2022-11-05-Flusight-baseline.parquet: coarser sample structure and
compound_taskid_set: c("reference_date", "horizon")

  hub_path <- here::here("tests/testthat/testdata/hub-spl")

  tbl <- read_model_out_file(
    file_path = create_file_path("2022-10-22"),
    hub_path = hub_path, coerce_types = "chr"
  )
  # Coarser test files
  tbl_coarse_location <- read_model_out_file(
    file_path = create_file_path("2022-10-29"),
    hub_path = hub_path, coerce_types = "chr"
  )
  tbl_coarse_horizon <- read_model_out_file(
    file_path = create_file_path("2022-11-05"),
    hub_path = hub_path, coerce_types = "chr"
  )

Coarser spl structure that does not include all task IDs a derived task ID depends on

tbl_coarse_location has been created with a coarser sample structure
and compound_taskid_set c("reference_date", "location")
which is a valid subset of the compound_taskid_set defined in the
config and should pass validation.

print(tbl_coarse_location, width = Inf, n = 25)

## # A tibble: 8,000 × 9
##    reference_date target          horizon location variant target_end_date
##    <chr>          <chr>           <chr>   <chr>    <chr>   <chr>          
##  1 2022-10-29     wk inc flu hosp 0       US       AA      2022-10-29     
##  2 2022-10-29     wk inc flu hosp 0       US       BB      2022-10-29     
##  3 2022-10-29     wk inc flu hosp 0       US       CC      2022-10-29     
##  4 2022-10-29     wk inc flu hosp 0       US       DD      2022-10-29     
##  5 2022-10-29     wk inc flu hosp 1       US       AA      2022-11-05     
##  6 2022-10-29     wk inc flu hosp 1       US       BB      2022-11-05     
##  7 2022-10-29     wk inc flu hosp 1       US       CC      2022-11-05     
##  8 2022-10-29     wk inc flu hosp 1       US       DD      2022-11-05     
##  9 2022-10-29     wk inc flu hosp 2       US       AA      2022-11-12     
## 10 2022-10-29     wk inc flu hosp 2       US       BB      2022-11-12     
## 11 2022-10-29     wk inc flu hosp 2       US       CC      2022-11-12     
## 12 2022-10-29     wk inc flu hosp 2       US       DD      2022-11-12     
## 13 2022-10-29     wk inc flu hosp 3       US       AA      2022-11-19     
## 14 2022-10-29     wk inc flu hosp 3       US       BB      2022-11-19     
## 15 2022-10-29     wk inc flu hosp 3       US       CC      2022-11-19     
## 16 2022-10-29     wk inc flu hosp 3       US       DD      2022-11-19     
## 17 2022-10-29     wk inc flu hosp 0       01       AA      2022-10-29     
## 18 2022-10-29     wk inc flu hosp 0       01       BB      2022-10-29     
## 19 2022-10-29     wk inc flu hosp 0       01       CC      2022-10-29     
## 20 2022-10-29     wk inc flu hosp 0       01       DD      2022-10-29     
## 21 2022-10-29     wk inc flu hosp 1       01       AA      2022-11-05     
## 22 2022-10-29     wk inc flu hosp 1       01       BB      2022-11-05     
## 23 2022-10-29     wk inc flu hosp 1       01       CC      2022-11-05     
## 24 2022-10-29     wk inc flu hosp 1       01       DD      2022-11-05     
## 25 2022-10-29     wk inc flu hosp 2       01       AA      2022-11-12     
##    output_type output_type_id value
##    <chr>       <chr>          <chr>
##  1 sample      1              39   
##  2 sample      1              159  
##  3 sample      1              752  
##  4 sample      1              209  
##  5 sample      1              374  
##  6 sample      1              818  
##  7 sample      1              34   
##  8 sample      1              516  
##  9 sample      1              13   
## 10 sample      1              69   
## 11 sample      1              895  
## 12 sample      1              755  
## 13 sample      1              409  
## 14 sample      1              308  
## 15 sample      1              278  
## 16 sample      1              89   
## 17 sample      2              928  
## 18 sample      2              537  
## 19 sample      2              291  
## 20 sample      2              424  
## 21 sample      2              880  
## 22 sample      2              286  
## 23 sample      2              908  
## 24 sample      2              671  
## 25 sample      2              121  
## # ℹ 7,975 more rows

Indeed, because the compound_taskid_set does not include all task IDs
the derived task ID target_end_date depends on, the validation passes.

check_tbl_spl_compound_taskid_set(
  tbl_coarse_location, "2022-10-29",
  create_file_path("2022-10-29"), hub_path
)

## <message/check_success>
## Message:
## All samples in a model task conform to single, unique compound task ID set that
## matches or is coarser than the configured `compound_taksid_set`.

Coarser spl structure that DOES include all task IDs a derived task ID depends on

tbl_coarse_horizon has been created with a coarser sample structure
and compound_taskid_set c("reference_date", "horizon") which is also a
valid subset of the compound_taskid_set defined in the config and
should also pass validation.

print(tbl_coarse_horizon, width = Inf, n = 25)

## # A tibble: 8,000 × 9
##    reference_date target          horizon location variant target_end_date
##    <chr>          <chr>           <chr>   <chr>    <chr>   <chr>          
##  1 2022-11-05     wk inc flu hosp 0       US       AA      2022-11-05     
##  2 2022-11-05     wk inc flu hosp 0       01       AA      2022-11-05     
##  3 2022-11-05     wk inc flu hosp 0       02       AA      2022-11-05     
##  4 2022-11-05     wk inc flu hosp 0       04       AA      2022-11-05     
##  5 2022-11-05     wk inc flu hosp 0       05       AA      2022-11-05     
##  6 2022-11-05     wk inc flu hosp 0       US       BB      2022-11-05     
##  7 2022-11-05     wk inc flu hosp 0       01       BB      2022-11-05     
##  8 2022-11-05     wk inc flu hosp 0       02       BB      2022-11-05     
##  9 2022-11-05     wk inc flu hosp 0       04       BB      2022-11-05     
## 10 2022-11-05     wk inc flu hosp 0       05       BB      2022-11-05     
## 11 2022-11-05     wk inc flu hosp 0       US       CC      2022-11-05     
## 12 2022-11-05     wk inc flu hosp 0       01       CC      2022-11-05     
## 13 2022-11-05     wk inc flu hosp 0       02       CC      2022-11-05     
## 14 2022-11-05     wk inc flu hosp 0       04       CC      2022-11-05     
## 15 2022-11-05     wk inc flu hosp 0       05       CC      2022-11-05     
## 16 2022-11-05     wk inc flu hosp 0       US       DD      2022-11-05     
## 17 2022-11-05     wk inc flu hosp 0       01       DD      2022-11-05     
## 18 2022-11-05     wk inc flu hosp 0       02       DD      2022-11-05     
## 19 2022-11-05     wk inc flu hosp 0       04       DD      2022-11-05     
## 20 2022-11-05     wk inc flu hosp 0       05       DD      2022-11-05     
## 21 2022-11-05     wk inc flu hosp 1       US       AA      2022-11-12     
## 22 2022-11-05     wk inc flu hosp 1       01       AA      2022-11-12     
## 23 2022-11-05     wk inc flu hosp 1       02       AA      2022-11-12     
## 24 2022-11-05     wk inc flu hosp 1       04       AA      2022-11-12     
## 25 2022-11-05     wk inc flu hosp 1       05       AA      2022-11-12     
##    output_type output_type_id value
##    <chr>       <chr>          <chr>
##  1 sample      1              843  
##  2 sample      1              932  
##  3 sample      1              238  
##  4 sample      1              764  
##  5 sample      1              339  
##  6 sample      1              985  
##  7 sample      1              39   
##  8 sample      1              822  
##  9 sample      1              986  
## 10 sample      1              137  
## 11 sample      1              455  
## 12 sample      1              738  
## 13 sample      1              560  
## 14 sample      1              589  
## 15 sample      1              83   
## 16 sample      1              696  
## 17 sample      1              879  
## 18 sample      1              994  
## 19 sample      1              196  
## 20 sample      1              769  
## 21 sample      2              680  
## 22 sample      2              286  
## 23 sample      2              606  
## 24 sample      2              500  
## 25 sample      2              784  
## # ℹ 7,975 more rows

However, this fails validation because target_end_date, because of
it’s dependence on the values of "reference_date" and "horizon" is
being detected as a compound_taskid_set member but is not allowed in
the config

check_tbl_spl_compound_taskid_set(
  tbl_coarse_horizon, "2022-11-05",
  create_file_path("2022-11-05"), hub_path
)

## <error/check_error>
## Error:
## ! All samples in a model task do not conform to single, unique compound
##   task ID set that matches or is coarser than the configured
##   `compound_taksid_set`.  mt 2: Finer `compound_taskid_set` than allowed
##   detected. "target_end_date" identified as compound task ID in file but not
##   allowed in config. Compound task IDs should be one of "reference_date",
##   "horizon", "location", and "variant".

Sample structure that matches the `compound_taskid_set`

To allow for the validation of coarser samples the validation workflow
plan has been to auto-detect and validate thecompound_taskid_set in a
files at the start of sample validation and use it to inform subsequent
checks. However the problematic auto-detection is now affecting
previously successful validation of files which conform to the
compound_taskid_set (i.e. were not coarser) when all task IDs a
derived task ID depends on are members of the compound_taskid_set for
the same reasons demonstrated above.

This is the case with tbl which was generated with a
compound_taskid_set structure matching that of the config,
i.e. c("reference_date", "horizon", "location", "variant").

check_tbl_spl_compound_taskid_set(
  tbl, "2022-10-22",
  create_file_path("2022-10-22"), hub_path
)

## <error/check_error>
## Error:
## ! All samples in a model task do not conform to single, unique compound
##   task ID set that matches or is coarser than the configured
##   `compound_taksid_set`.  mt 2: Finer `compound_taskid_set` than allowed
##   detected. "target_end_date" identified as compound task ID in file but not
##   allowed in config. Compound task IDs should be one of "reference_date",
##   "horizon", "location", and "variant".

The only solution for successful validation in the compound_taskid_set
setting in the config. If all task IDs a derived task ID depends on are
members of the compound_taskid_set then the only way for robust
validation of the compound_taskid_set is for the derived task ID to
also be a member of the compound_taskid_set.

Here I’m temporarily modifying the compound_taskid_set to include
"target_end_date" through mocking

    config_tasks <- purrr::modify_in(
      hubUtils::read_config_file(
        fs::path(hub_path, "hub-config", "tasks.json")
      ),
      list(
        "rounds", 1, "model_tasks", 2,
        "output_type", "sample",
        "output_type_id_params", "compound_taskid_set"
      ),
      ~ c("reference_date", "horizon", "location", "variant", "target_end_date")
    )

  mockery::stub(
    check_tbl_spl_compound_taskid_set,
    "hubUtils::read_config",
    config_tasks,
    2
  )

This results in successful validation of both files that were previously
failing:

check_tbl_spl_compound_taskid_set(
  tbl, "2022-10-22",
  create_file_path("2022-10-22"), hub_path
)

## <message/check_success>
## Message:
## All samples in a model task conform to single, unique compound task ID set that
## matches or is coarser than the configured `compound_taksid_set`.

check_tbl_spl_compound_taskid_set(
  tbl_coarse_horizon, "2022-11-05",
  create_file_path("2022-11-05"), hub_path
)

## <message/check_success>
## Message:
## All samples in a model task conform to single, unique compound task ID set that
## matches or is coarser than the configured `compound_taksid_set`.

CAUTION

While this does provide a solution, I’d consider it brittle because
there in no way to detect an effectively mis-configured
compound_taskid_set, which could cause validation problems further
down the line, when validating the config because derived variables and
the task IDs they depend on are not recorded anywhere. So this would be
something hub admins would likely discover in the first round of
submissions/validations and need to do some digging into the
documentation and updating of the config to fix.

Potential solution

Record derived variables

Given the problems with derived task IDs, it would be useful to record
derived task IDs and perhaps also the task IDs they depend on in the
config. This would allow us to: - ignore them where appropriate in
certain validation tests, especially when expanding grids of valid
values with the understanding that custom validations would be validating requirements of such task ids.

exclude them from the compound_taskid_set where appropriate
(unless they are explictly excluded).
exclude them from the submission
template where appropriate.

In an ideal world we would also be able to encode the relationship of
derived variables to other task IDs but this is a more complex problem and not necessary
to solve the immediate problems.

annakrystalli · 2024-07-25T08:53:25Z

annakrystalli
Jul 25, 2024
Maintainer Author

I've thought about this a bit more and I think a warning in the documentation about the need to include derived task IDs in the compound_taskid_set if all task IDs they are derived from are, should be enough to release the validation of coarser samples functionality.

As an additional solution to the compound task id set issues described I've also suggested the following in: hubverse-org/hubValidations#88 (comment)

Even if we solved the issues described above, I generally worry that auto-detecting compound_taskid_set during validation might cause further unforseen errors down the line.

As such, I was wondering if this is something submitting teams should actually record and communicate, perhaps in their model metadata file? Of course this assumes a single model would have a single compound task ID structure.

Whether we can implement largely depends on how stable within a model sample structures are or whether any variation can be captured in the model metadata effectively to be able to be used by validation functions. If this was possible, I would consider this declarative approach much more robust.

Derived task IDs in general

I still feeling thinking about how to record/handle such task IDs could be beneficial in a number of areas so let's keep this discussion open and hopefully active.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Hubverse

Problems in auto-detecting `compound_taskid_set`s because of derived task IDs #26

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

The Hubverse

Problems in auto-detecting compound_taskid_sets because of derived task IDs #26

annakrystalli Jul 24, 2024 Maintainer

Problems in deterimining compound_taskid_set from submitted data

Detecting compound_taskid_set from data

Example

Coarser spl structure that does not include all task IDs a derived task ID depends on

Coarser spl structure that DOES include all task IDs a derived task ID depends on

Sample structure that matches the compound_taskid_set

CAUTION

Other derived task ID problem areas

Potential solution

Record derived variables

Replies: 1 comment

annakrystalli Jul 25, 2024 Maintainer Author

Derived task IDs in general

Problems in auto-detecting `compound_taskid_set`s because of derived task IDs #26

annakrystalli
Jul 24, 2024
Maintainer

Problems in deterimining `compound_taskid_set` from submitted data

Detecting `compound_taskid_set` from data

Sample structure that matches the `compound_taskid_set`

annakrystalli
Jul 25, 2024
Maintainer Author