Bs/fix target date schema/265 #267

bsweger · 2025-01-10T22:08:40Z

Resolves #265

Background

We want Arrow to read nowcast_date and sequence_as_of colummns as a string data type rather than large_string (see issue linked above for more details).

This PR changes the method of writing both oracle and time series target data parquet files to ensure that all text columns will be read by arrow as string.

Testing

The tests in get_target_data.py now read the target data files with PyArrow to confirm the schema.

Some manual poking around confirmed this in R.

Reading a single time series file before the change (partitioned reads didn't work, which is the reason for the PR)

FileSystemDataset with 1 Parquet file
7 columns
target_date: date32[day]
location: large_string
clade: large_string
observation: uint32
nowcast_date: large_string
sequence_as_of: large_string
tree_as_of: large_string

Reading partitioned time series files after the change:

> ds <- arrow::open_dataset("/Users/rsweger/code/variant-nowcast-hub/target-data/time-series/", format="parquet")
> ds
FileSystemDataset with 2 Parquet files
7 columns
target_date: date32[day]
location: string
clade: string
observation: int32
nowcast_date: string
sequence_as_of: string
tree_as_of: string

Prior to this change, target data files were created using the Polars write_parquet method. However, that operation results in column datatypes that are interpreted by Arrow as string_large. Because we're including our partition fields in the parquet file, the large_string results in a type mismatch when R packages use arrow::open_dataset to read the files (the value in the partition key is read as a string, not a large_string)

elray1

Looks good, though I had one question about using string as the data type for some columns that are dates. Requesting a change to either update the data types to dates, or add a comment where the schema is defined explaining why we can't use dates.

elray1 · 2025-01-13T19:14:17Z

src/get_target_data.py

+            ("nowcast_date", pa.string()),
+            ("sequence_as_of", pa.string()),
+            ("tree_as_of", pa.string()),


should these be a pa.date32() data type?

This may be addressed with an answer of "no" in your comment on the unit tests below, because it's being used as a partition key? If so, maybe worth adding a comment here explaining the reason for string data type for these date columns.

You're right that the dates used as partition keys should be strings to prevent a data type mismatch when being read by Arrow's open_dataset() function. Additionally, nowcast_date is a string in tasks.json (I think? The values are in quotes), and the target data format RFC states that column types should match their config counterparts.

Which makes me wonder if target_date should also be string?

So we're left with ~~sequence_as_of~~ and tree_as_of. Those have been strings in the target data PRs we've been merging all along, though it wasn't explicit. These two columns aren't part of the hub config, so we can set their types to whatever is convenient. Would it be helpful for downstream analysis if they were pa.date32?

[edited after I remembered that sequence_as_of is also used as a partitioning key (in the time series)]

it seems so unfortunate that partitioning on a particular column impacts what data types you're allowed to use for that column!

i'm now inclined to just say all of these date columns should be stored as a string?? It seems better to be consistent about this than to have different data types for different columns, right?

Here's my current thinking:

Drop sequence_as_of as a partition key in the timeseries data. It's functionally dependent on nowcast_date, so as things now stand, it doesn't provide any partitioning value. If we eventually decide to add daily additional sequence_as_of data for each round (e.g., for charting), there's no need for premature optimization.

Cast target_date, sequence_as_of, and tree_as_of as date32

Keep nowcast_date as a string. Regardless of whether or not we include this partition key in the parquet file, arrow will read it in as a string (presumably the Hubverse tools that read target data will have to cast it)

Addition detail

it seems so unfortunate that partitioning on a particular column impacts what data types you're allowed to use for that column!

To be fair, that limitation exists for us because we want to keep the partition columns in the physical parquet file.
When not provided with an explicit schema, Arrow will apply a string data type to the partitioning column. If that column also exists in the parquet file itself, the data type must match.

By the way, I was wrong about nowcast_date and target_date being strings based on how they're expressed in tasks.config. When reading the model output files, hubData sets the schema as follows (presumably infers it from the data):

── Connection schema hub_connection with 48 Parquet files 8 columns nowcast_date: date32[day] target_date: date32[day] location: string clade: string output_type: string output_type_id: string value: double model_id: string

The mismatch doesn't matter for this convo, because even if we didn't include the partition columns in our file, Arrow would cast them to string unless told otherwise. Just wanted to acknowledge I was wrong 😄

I confirmed that when directly reading in the model output files with polars (skipping arrow), the column data types are what you listed above. I think that it would be really ideal if we could figure out how to get it so that at least those 7 columns (other than model_id) to have the same data type in the target/oracle output data and the model output data.

When not provided with an explicit schema, Arrow will apply a string data type to the partitioning column. If that column also exists in the parquet file itself, the data type must match.

I would expect that any arrow-based tools we eventually build for loading target data would apply a schema for the oracle output (this is trickier for the nowcast data). Does that resolve this problem, so that we could save the things with a date data type and expect anyone using arrow to load them in to specify a schema with date data types?

I would expect that any arrow-based tools we eventually build for loading target data would apply a schema for the oracle output (this is trickier for the nowcast data). Does that resolve this problem, so that we could save the things with a date data type and expect anyone using arrow to load them in to specify a schema with date data types?

Yes! If we're operating under the assumption that Hubverse tools will a apply a schema when reading target data, then this script can type everything as you suggest (i.e., match the model-output schema).

Just pushed some changes for this.

src/get_target_data.py

Assuming that any Hubverse tools using arrow::open_dataset() when reading target-data files, we should be to use the hub's model output schema when writing them. hubDataPy isn't online yet (which will presumably be able to determine a hub's model-output schema like it's R counterpart), so this changeset manually creates a schema to use.

elray1

lgtm! thanks for your patience with my severe waffling about how to best set all this up

bsweger added 2 commits January 10, 2025 16:55

Use Nextstrain 100K files for integration tests

2f86d8c

trobacker previously approved these changes Jan 11, 2025

View reviewed changes

elray1 requested changes Jan 13, 2025

View reviewed changes

bsweger dismissed trobacker’s stale review via e0e1af1 January 14, 2025 16:48

Update tests for data type changes

e820c94

elray1 approved these changes Jan 14, 2025

View reviewed changes

bsweger merged commit a13571e into main Jan 14, 2025
1 check passed

bsweger deleted the bs/fix-target-date-schema/265 branch January 14, 2025 19:15

bsweger mentioned this pull request Jan 14, 2025

2024-10-09-target-data 2025-01-14_19-41-59 #271

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bs/fix target date schema/265 #267

Bs/fix target date schema/265 #267

bsweger commented Jan 10, 2025

elray1 left a comment

elray1 Jan 13, 2025

elray1 Jan 13, 2025

bsweger Jan 13, 2025 •

edited

Loading

elray1 Jan 13, 2025

elray1 Jan 13, 2025

bsweger Jan 13, 2025

elray1 Jan 14, 2025 •

edited

Loading

bsweger Jan 14, 2025

elray1 left a comment

Bs/fix target date schema/265 #267

Bs/fix target date schema/265 #267

Conversation

bsweger commented Jan 10, 2025

elray1 left a comment

Choose a reason for hiding this comment

elray1 Jan 13, 2025

Choose a reason for hiding this comment

elray1 Jan 13, 2025

Choose a reason for hiding this comment

bsweger Jan 13, 2025 • edited Loading

Choose a reason for hiding this comment

elray1 Jan 13, 2025

Choose a reason for hiding this comment

elray1 Jan 13, 2025

Choose a reason for hiding this comment

bsweger Jan 13, 2025

Choose a reason for hiding this comment

elray1 Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

bsweger Jan 14, 2025

Choose a reason for hiding this comment

elray1 left a comment

Choose a reason for hiding this comment

bsweger Jan 13, 2025 •

edited

Loading

elray1 Jan 14, 2025 •

edited

Loading