-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bs/fix target date schema/265 #267
Conversation
Prior to this change, target data files were created using the Polars write_parquet method. However, that operation results in column datatypes that are interpreted by Arrow as string_large. Because we're including our partition fields in the parquet file, the large_string results in a type mismatch when R packages use arrow::open_dataset to read the files (the value in the partition key is read as a string, not a large_string)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, though I had one question about using string
as the data type for some columns that are dates. Requesting a change to either update the data types to dates, or add a comment where the schema is defined explaining why we can't use dates.
src/get_target_data.py
Outdated
("nowcast_date", pa.string()), | ||
("sequence_as_of", pa.string()), | ||
("tree_as_of", pa.string()), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should these be a pa.date32()
data type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may be addressed with an answer of "no" in your comment on the unit tests below, because it's being used as a partition key? If so, maybe worth adding a comment here explaining the reason for string data type for these date columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right that the dates used as partition keys should be strings to prevent a data type mismatch when being read by Arrow's open_dataset()
function. Additionally, nowcast_date
is a string in tasks.json
(I think? The values are in quotes), and the target data format RFC states that column types should match their config counterparts.
Which makes me wonder if target_date
should also be string?
So we're left with and sequence_as_of
tree_as_of
. Those have been strings in the target data PRs we've been merging all along, though it wasn't explicit. These two columns aren't part of the hub config, so we can set their types to whatever is convenient. Would it be helpful for downstream analysis if they were pa.date32
?
[edited after I remembered that sequence_as_of
is also used as a partitioning key (in the time series)]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems so unfortunate that partitioning on a particular column impacts what data types you're allowed to use for that column!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'm now inclined to just say all of these date columns should be stored as a string?? It seems better to be consistent about this than to have different data types for different columns, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's my current thinking:
- Drop
sequence_as_of
as a partition key in the timeseries data. It's functionally dependent onnowcast_date
, so as things now stand, it doesn't provide any partitioning value. If we eventually decide to add daily additional sequence_as_of data for each round (e.g., for charting), there's no need for premature optimization. - Cast
target_date
,sequence_as_of
, andtree_as_of
as date32 - Keep
nowcast_date
as a string. Regardless of whether or not we include this partition key in the parquet file, arrow will read it in as a string (presumably the Hubverse tools that read target data will have to cast it)
Addition detail
it seems so unfortunate that partitioning on a particular column impacts what data types you're allowed to use for that column!
To be fair, that limitation exists for us because we want to keep the partition columns in the physical parquet file.
When not provided with an explicit schema, Arrow will apply a string data type to the partitioning column. If that column also exists in the parquet file itself, the data type must match.
By the way, I was wrong about nowcast_date and target_date being strings based on how they're expressed in tasks.config
. When reading the model output files, hubData
sets the schema as follows (presumably infers it from the data):
── Connection schema
hub_connection with 48 Parquet files
8 columns
nowcast_date: date32[day]
target_date: date32[day]
location: string
clade: string
output_type: string
output_type_id: string
value: double
model_id: string
The mismatch doesn't matter for this convo, because even if we didn't include the partition columns in our file, Arrow would cast them to string unless told otherwise. Just wanted to acknowledge I was wrong 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I confirmed that when directly reading in the model output files with polars (skipping arrow), the column data types are what you listed above. I think that it would be really ideal if we could figure out how to get it so that at least those 7 columns (other than model_id) to have the same data type in the target/oracle output data and the model output data.
When not provided with an explicit schema, Arrow will apply a string data type to the partitioning column. If that column also exists in the parquet file itself, the data type must match.
I would expect that any arrow-based tools we eventually build for loading target data would apply a schema for the oracle output (this is trickier for the nowcast data). Does that resolve this problem, so that we could save the things with a date data type and expect anyone using arrow to load them in to specify a schema with date data types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect that any arrow-based tools we eventually build for loading target data would apply a schema for the oracle output (this is trickier for the nowcast data). Does that resolve this problem, so that we could save the things with a date data type and expect anyone using arrow to load them in to specify a schema with date data types?
Yes! If we're operating under the assumption that Hubverse tools will a apply a schema when reading target data, then this script can type everything as you suggest (i.e., match the model-output schema).
Just pushed some changes for this.
Assuming that any Hubverse tools using arrow::open_dataset() when reading target-data files, we should be to use the hub's model output schema when writing them. hubDataPy isn't online yet (which will presumably be able to determine a hub's model-output schema like it's R counterpart), so this changeset manually creates a schema to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! thanks for your patience with my severe waffling about how to best set all this up
Resolves #265
Background
We want Arrow to read
nowcast_date
andsequence_as_of
colummns as astring
data type rather thanlarge_string
(see issue linked above for more details).This PR changes the method of writing both oracle and time series target data parquet files to ensure that all text columns will be read by arrow as
string
.Testing
The tests in
get_target_data.py
now read the target data files with PyArrow to confirm the schema.Some manual poking around confirmed this in R.
Reading a single time series file before the change (partitioned reads didn't work, which is the reason for the PR)
Reading partitioned time series files after the change: