Problems in auto-detecting compound_taskid_set
s because of derived task IDs
#26
Replies: 1 comment
-
I've thought about this a bit more and I think a warning in the documentation about the need to include derived task IDs in the As an additional solution to the compound task id set issues described I've also suggested the following in: hubverse-org/hubValidations#88 (comment)
Whether we can implement largely depends on how stable within a model sample structures are or whether any variation can be captured in the model metadata effectively to be able to be used by validation functions. If this was possible, I would consider this declarative approach much more robust. Derived task IDs in generalI still feeling thinking about how to record/handle such task IDs could be beneficial in a number of areas so let's keep this discussion open and hopefully active. |
Beta Was this translation helpful? Give feedback.
-
While trying to finish off the the work to support validation of coarser
samples I ran up against an issue with a class of task IDs that is
problematic in other respects too, that of derived task IDs i.e. task
IDs whose values depend on on the values of other task IDs.
A common example already found in active hubs is the
target_end_date
task ID which is most commonly derived from the
reference_date
ororigin_date
andhorizon
task ids.Problems in deterimining
compound_taskid_set
from submitted dataThe first step to validating coarser samples is to determine the
compound_taskid_set
from the submitted data and compare it to thecompound_taskid_set
defined in the config. For a coarser sample to bevalid, the detected
compound_taskid_set
must be a subset of thecompound_taskid_set
defined in the config. The non-independence ofderived task IDs is causing problems in the method used to auto-detect
the
compound_taskid_set
from a given submission file.Detecting
compound_taskid_set
from dataThe approach developed to detect
compound_taskid_set
s from data isto group submission data by
output_type_id
(i.e. sample index) and count the number unique values in each column within each sample. Thecolumns with single unique values are then considered to be part of the
compound_taskid_set
.However, to ensure false positives are not returned, for example, in
cases where there might only be a single possible value for a given task
ID or where a single value is supplied from a set of optional task id
values, an additional check is performed that looks at whether columns
identified as
compound_taskid_set
members at the sample level also havea single unique value across all samples (i.e. n of unique values
across the entire table is also 1). Such columns cannot by definition be
part of the
compound_taskid_set
(unless explicitly defined in theconfig as
compound_taskid_set
members, which is also checked) and aretherefore discarded as potential
compound_taskid_set
members.The problems arise when the tasks IDs a derived task ID depends on are
members of the
compound_taskid_set
and the derived task ID is not.In such cases the derived task ID will be incorrectly identified as a
compound_taskid_set
member because it will have a single unique valueas result of it’s direct relationship to the unique values in the task
IDs its derived from but fails to be excluded by the second check
because the value varies across samples as the values of the task IDs
it's derived from also do.
This is actually not a straightforward problem to currently solve
programmatically because there’s not a simple/efficient/robust way to
detect such task IDs from the data nor are such task IDs identified in the config. It
is currently a blocker to rolling out the coarse samples functionality.
Example
The demo test data and config are in the
tests/testthat/testdata/hub-spl
directory and the sample expectationsare defined in the config as such:
There are three files that have been created for testing:
2022-10-22-Flusight-baseline.parquet
:compound_taskid_set
structure matching that of the config,
i.e.
c("reference_date", "horizon", "location", "variant")
2022-10-29-Flusight-baseline.parquet
: coarser sample structure andcompound_taskid_set
:c("reference_date", "location")
2022-11-05-Flusight-baseline.parquet
: coarser sample structure andcompound_taskid_set
:c("reference_date", "horizon")
Coarser spl structure that does not include all task IDs a derived task ID depends on
tbl_coarse_location
has been created with a coarser sample structureand
compound_taskid_set
c("reference_date", "location")
which is a valid subset of the
compound_taskid_set
defined in theconfig and should pass validation.
Indeed, because the
compound_taskid_set
does not include all task IDsthe derived task ID
target_end_date
depends on, the validation passes.Coarser spl structure that DOES include all task IDs a derived task ID depends on
tbl_coarse_horizon
has been created with a coarser sample structureand compound_taskid_set
c("reference_date", "horizon")
which is also avalid subset of the
compound_taskid_set
defined in the config andshould also pass validation.
However, this fails validation because
target_end_date
, because ofit’s dependence on the values of
"reference_date"
and"horizon"
isbeing detected as a
compound_taskid_set
member but is not allowed inthe config
Sample structure that matches the
compound_taskid_set
To allow for the validation of coarser samples the validation workflow
plan has been to auto-detect and validate the
compound_taskid_set
in afiles at the start of sample validation and use it to inform subsequent
checks. However the problematic auto-detection is now affecting
previously successful validation of files which conform to the
compound_taskid_set
(i.e. were not coarser) when all task IDs aderived task ID depends on are members of the
compound_taskid_set
forthe same reasons demonstrated above.
This is the case with
tbl
which was generated with acompound_taskid_set
structure matching that of the config,i.e.
c("reference_date", "horizon", "location", "variant")
.The only solution for successful validation in the
compound_taskid_set
setting in the config. If all task IDs a derived task ID depends on are
members of the
compound_taskid_set
then the only way for robustvalidation of the
compound_taskid_set
is for the derived task ID toalso be a member of the
compound_taskid_set
.Here I’m temporarily modifying the
compound_taskid_set
to include"target_end_date"
through mockingThis results in successful validation of both files that were previously
failing:
CAUTION
While this does provide a solution, I’d consider it brittle because
there in no way to detect an effectively mis-configured
compound_taskid_set
, which could cause validation problems furtherdown the line, when validating the config because derived variables and
the task IDs they depend on are not recorded anywhere. So this would be
something hub admins would likely discover in the first round of
submissions/validations and need to do some digging into the
documentation and updating of the config to fix.
Other derived task ID problem areas
It has already been noted that derived task IDs are problematic in other
areas too
Most notably, they pollute the expanded grid of task IDs that are
generated when validating value combination, assigning submitted data to
modeling tasks and creating the submission template. This leads to: -
expanded grids used in validation of valid values being much much larger
than they need to be, an areas which already needs to be optimized.
#93 - submission templates also being much larger than they need to be
and containing value combinations that are not valid.
Potential solution
Record derived variables
Given the problems with derived task IDs, it would be useful to record
derived task IDs and perhaps also the task IDs they depend on in the
config. This would allow us to: - ignore them where appropriate in
certain validation tests, especially when expanding grids of valid
values with the understanding that custom validations would be validating requirements of such task ids.
compound_taskid_set
where appropriate(unless they are explictly excluded).
template where appropriate.
In an ideal world we would also be able to encode the relationship of
derived variables to other task IDs but this is a more complex problem and not necessary
to solve the immediate problems.
Beta Was this translation helpful? Give feedback.
All reactions