Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bs/fix target date schema/265 #267

Merged
merged 4 commits into from
Jan 14, 2025
Merged

Bs/fix target date schema/265 #267

merged 4 commits into from
Jan 14, 2025

Conversation

bsweger
Copy link
Collaborator

@bsweger bsweger commented Jan 10, 2025

Resolves #265

Background

We want Arrow to read nowcast_date and sequence_as_of colummns as a string data type rather than large_string (see issue linked above for more details).

This PR changes the method of writing both oracle and time series target data parquet files to ensure that all text columns will be read by arrow as string.

Testing

The tests in get_target_data.py now read the target data files with PyArrow to confirm the schema.

Some manual poking around confirmed this in R.

Reading a single time series file before the change (partitioned reads didn't work, which is the reason for the PR)

FileSystemDataset with 1 Parquet file
7 columns
target_date: date32[day]
location: large_string
clade: large_string
observation: uint32
nowcast_date: large_string
sequence_as_of: large_string
tree_as_of: large_string

Reading partitioned time series files after the change:

> ds <- arrow::open_dataset("/Users/rsweger/code/variant-nowcast-hub/target-data/time-series/", format="parquet")
> ds
FileSystemDataset with 2 Parquet files
7 columns
target_date: date32[day]
location: string
clade: string
observation: int32
nowcast_date: string
sequence_as_of: string
tree_as_of: string

Prior to this change, target data files were created using the Polars
write_parquet method. However, that operation results in column datatypes
that are interpreted by Arrow as string_large. Because we're including our
partition fields in the parquet file, the large_string results in a type
mismatch when R packages use arrow::open_dataset to read the files (the value
in the partition key is read as a string, not a large_string)
trobacker
trobacker previously approved these changes Jan 11, 2025
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, though I had one question about using string as the data type for some columns that are dates. Requesting a change to either update the data types to dates, or add a comment where the schema is defined explaining why we can't use dates.

Comment on lines 392 to 394
("nowcast_date", pa.string()),
("sequence_as_of", pa.string()),
("tree_as_of", pa.string()),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should these be a pa.date32() data type?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be addressed with an answer of "no" in your comment on the unit tests below, because it's being used as a partition key? If so, maybe worth adding a comment here explaining the reason for string data type for these date columns.

Copy link
Collaborator Author

@bsweger bsweger Jan 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right that the dates used as partition keys should be strings to prevent a data type mismatch when being read by Arrow's open_dataset() function. Additionally, nowcast_date is a string in tasks.json (I think? The values are in quotes), and the target data format RFC states that column types should match their config counterparts.

Which makes me wonder if target_date should also be string?

So we're left with sequence_as_of and tree_as_of. Those have been strings in the target data PRs we've been merging all along, though it wasn't explicit. These two columns aren't part of the hub config, so we can set their types to whatever is convenient. Would it be helpful for downstream analysis if they were pa.date32?

[edited after I remembered that sequence_as_of is also used as a partitioning key (in the time series)]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems so unfortunate that partitioning on a particular column impacts what data types you're allowed to use for that column!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm now inclined to just say all of these date columns should be stored as a string?? It seems better to be consistent about this than to have different data types for different columns, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's my current thinking:

  • Drop sequence_as_of as a partition key in the timeseries data. It's functionally dependent on nowcast_date, so as things now stand, it doesn't provide any partitioning value. If we eventually decide to add daily additional sequence_as_of data for each round (e.g., for charting), there's no need for premature optimization.
  • Cast target_date, sequence_as_of, and tree_as_of as date32
  • Keep nowcast_date as a string. Regardless of whether or not we include this partition key in the parquet file, arrow will read it in as a string (presumably the Hubverse tools that read target data will have to cast it)

Addition detail

it seems so unfortunate that partitioning on a particular column impacts what data types you're allowed to use for that column!

To be fair, that limitation exists for us because we want to keep the partition columns in the physical parquet file.
When not provided with an explicit schema, Arrow will apply a string data type to the partitioning column. If that column also exists in the parquet file itself, the data type must match.

By the way, I was wrong about nowcast_date and target_date being strings based on how they're expressed in tasks.config. When reading the model output files, hubData sets the schema as follows (presumably infers it from the data):

── Connection schema 
hub_connection with 48 Parquet files
8 columns
nowcast_date: date32[day]
target_date: date32[day]
location: string
clade: string
output_type: string
output_type_id: string
value: double
model_id: string

The mismatch doesn't matter for this convo, because even if we didn't include the partition columns in our file, Arrow would cast them to string unless told otherwise. Just wanted to acknowledge I was wrong 😄

Copy link
Collaborator

@elray1 elray1 Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed that when directly reading in the model output files with polars (skipping arrow), the column data types are what you listed above. I think that it would be really ideal if we could figure out how to get it so that at least those 7 columns (other than model_id) to have the same data type in the target/oracle output data and the model output data.

When not provided with an explicit schema, Arrow will apply a string data type to the partitioning column. If that column also exists in the parquet file itself, the data type must match.

I would expect that any arrow-based tools we eventually build for loading target data would apply a schema for the oracle output (this is trickier for the nowcast data). Does that resolve this problem, so that we could save the things with a date data type and expect anyone using arrow to load them in to specify a schema with date data types?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect that any arrow-based tools we eventually build for loading target data would apply a schema for the oracle output (this is trickier for the nowcast data). Does that resolve this problem, so that we could save the things with a date data type and expect anyone using arrow to load them in to specify a schema with date data types?

Yes! If we're operating under the assumption that Hubverse tools will a apply a schema when reading target data, then this script can type everything as you suggest (i.e., match the model-output schema).

Just pushed some changes for this.

src/get_target_data.py Outdated Show resolved Hide resolved
Assuming that any Hubverse tools using arrow::open_dataset() when reading
target-data files, we should be to use the hub's model output schema when
writing them. hubDataPy isn't online yet (which will presumably be able to
determine a hub's model-output schema like it's R counterpart), so this
changeset manually creates a schema to use.
Copy link
Collaborator

@elray1 elray1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! thanks for your patience with my severe waffling about how to best set all this up

@bsweger bsweger merged commit a13571e into main Jan 14, 2025
1 check passed
@bsweger bsweger deleted the bs/fix-target-date-schema/265 branch January 14, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Resolve string vs large_string type error when opening target data files in R
3 participants