You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
+ ./bin/check-locations data/gisaid/metadata.tsv data/gisaid/location_hierarchy.tsv gisaid_epi_isl
sys:1: DtypeWarning: Columns (9,28,37,39,43) have mixed types.Specify dtype option on import or set low_memory=False
This is probably unexpected types in the data, either medatata.tsv or location_hierarchy.tsv. For example we might be assuming a column contains numbers, but in reality it contains mostly numbers and then some runaway strings. But might be somethign more sophisticated also.
How to investigate:
This can be investigated in isolation from the pipeline, by running the ./bin/check-locations script.
download data/gisaid/metadata.tsv from S3
review input data files for obvious defects
review ./bin/check-locations for obvious defects
try to do binary search on the input data files to find rows that trigger the issue: delete half of the file, if problem goes away then the problem is in the other half, if not, remove half of what's remaining. Repeat until the minimal set of rows is found that reproduces the issue. Inspect these rows.
There is a few places throughout ingest scripts where these warnigns were silenced by setting low_memory=False as proposed in the warnign message. But this might not be what we need and might just hide programming mistakes and generate bogus outputs. We might need to search for occurrences of low_memory in the codebase to see what's going on there.
The text was updated successfully, but these errors were encountered:
A snippet from the GISAID ingest log:
This is probably unexpected types in the data, either medatata.tsv or location_hierarchy.tsv. For example we might be assuming a column contains numbers, but in reality it contains mostly numbers and then some runaway strings. But might be somethign more sophisticated also.
How to investigate:
This can be investigated in isolation from the pipeline, by running the
./bin/check-locations
script.data/gisaid/metadata.tsv
from S3./bin/check-locations
for obvious defectsThere is a few places throughout ingest scripts where these warnigns were silenced by setting
low_memory=False
as proposed in the warnign message. But this might not be what we need and might just hide programming mistakes and generate bogus outputs. We might need to search for occurrences oflow_memory
in the codebase to see what's going on there.The text was updated successfully, but these errors were encountered: