SNOW-1025489: inconsistent timestamp downscaling #1868

jwyang-qraft · 2024-01-31T05:43:46Z

Python version

Python 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0]

Operating system and processor architecture

Linux-5.4.0-165-generic-x86_64-with-glibc2.31

Installed packages

numba==0.58.1
numpy @ file:///work/mkl/numpy_and_numpy_base_1682953417311/work
pandas==2.1.4
python-dateutil @ file:///tmp/build/80754af9/python-dateutil_1626374649649/work
pytz==2022.7.1
requests==2.31.0
snowballstemmer @ file:///tmp/build/80754af9/snowballstemmer_1637937080595/work
snowflake-connector-python==3.7.0
snowflake-sqlalchemy==1.5.1
SQLAlchemy==1.4.50
tqdm==4.66.1

What did you do?

TIMESTAMP_NTZ(9) column with values that overflow int64 with ns precision (eg. '9999-12-31 00:00:00.000')

What did you expect to see?

I tried to fetch a column with TIMESTAMP_NTZ(9) dtype and the max datetime is '9999-12-31 00:00:00.000' and minimum is '1987-01-30 23:59:59.000'.

I get following error when I select from that column.

  File "/home/jwyang/anaconda3/lib/python3.11/site-packages/snowflake/connector/result_batch.py", line 79, in _create_nanoarrow_iterator
    else PyArrowTableIterator(
         ^^^^^^^^^^^^^^^^^^^^^
  File "src/snowflake/connector/nanoarrow_cpp/ArrowIterator/nanoarrow_arrow_iterator.pyx", line 239, in snowflake.connector.nanoarrow_arrow_iterator.PyArrowTableIterator.__cinit__
  File "pyarrow/table.pxi", line 4116, in pyarrow.lib.Table.from_batches
  File "pyarrow/error.pxi", line 154, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
  
  pyarrow.lib.ArrowInvalid: Schema at index 2 was different:
  DT: timestamp[us]
  vs
  DT: timestamp[ns]

Because '9999-12-31 00:00:00.000' doesn't fit in int64 with ns precision, it seems like it is downcast to us precision on a batch basis in

snowflake-connector-python/src/snowflake/connector/nanoarrow_cpp/ArrowIterator/CArrowTableIterator.cpp

Line 562 in 6a2a5b6

if (epoch > (INT64_MAX / powTenSB4) ||

I am guessing downcasting is not applied to all batches and it results in different data types between batches which pyarrow does not allow.

Can you set logging to DEBUG and collect the logs?

import logging
import os

for logger_name in ('snowflake.connector',):
    logger = logging.getLogger(logger_name)
    logger.setLevel(logging.DEBUG)
    ch = logging.StreamHandler()
    ch.setLevel(logging.DEBUG)
    ch.setFormatter(logging.Formatter('%(asctime)s - %(threadName)s %(filename)s:%(lineno)d - %(funcName)s() - %(levelname)s - %(message)s'))
    logger.addHandler(ch)

The text was updated successfully, but these errors were encountered:

sfc-gh-aling · 2024-02-29T19:31:59Z

thanks @jwyang-qraft for reaching out! we will look into the issue

sfc-gh-mkeller · 2024-12-10T20:12:45Z

As others have pointed this out our code sees datetime.datetime(9999, 12, 31, 23, 59, 59) and realizes that this will not fit into ns precision and automatically determines that we can safely cast this object down to us precision and then arrow refuses to mix the us and ns precisions into the same column.

The real problem is that the data you are requesting from Snowflake cannot be represented in Arrow. 9999-12-31 23:59:59.000000000 is technically supported by our server-side (albeit extreme timestamps are not suggested to be used).
So this timestamp is impossible to represent in Arrow with the same precision.

I got around the issue by explicitly casting the column down to us precision like: SELECT id, ts::TIMESTAMP_NTZ(6) FROM table instead of SELECT * FROM table.

Now I agree that us automatically trying to fit the data into a lower precision only leads to issues in the long run, as it boxes us into 2 options:

When we see a single cell that doesn't fit into nanosecond precision we go back and update every row in the current result batch to micro second precision.
Not doing this smart down casting even if the data could safely be downcast and simply throwing this exception every time we detect data that cannot be represented.

I much prefer option number 2, as it makes the need for precision loss explicit and allows the users to evaluate if this is okay, or if some computation needs to be moved into Snowflake before the data is pulled out.

However; it's important to note that both of these options are technically backwards incompatible, so a major bump will be necessary either way.

sfc-gh-areddy · 2025-01-09T20:44:23Z

What is the maximum supported range of the timestamp value ? Is there an acceptable range?

sfc-gh-mkeller · 2025-01-13T19:33:44Z

@sfc-gh-areddy please see https://arrow.apache.org/docs/python/timestamps.html for explanation.

Arrow timestamps are stored as a 64-bit integer with column metadata to associate a time unit (e.g. milliseconds, microseconds, or nanoseconds), and an optional time zone.

So for nanosecond precision you can see the constants defined in pandas min and max.

64 bit 2s complement min and max numbers are:

>>> -(2**(64-1))                                                                                                                                                                                                                                                                                                                                                                           -9223372036854775808
>>> (2**(64-1)-1)
9223372036854775807

Plugging these numbers into datetime object (will approximately, because there's not nanoseconds support for datetime objects):

>>> datetime.datetime(year=1970, month=1, day=1) + datetime.timedelta(microseconds=-9223372036854775)
datetime.datetime(1677, 9, 21, 0, 12, 43, 145225)
>>> datetime.datetime(year=1970, month=1, day=1) + datetime.timedelta(microseconds=9223372036854775)
datetime.datetime(2262, 4, 11, 23, 47, 16, 854775)

Then similarly you can do the same calculation for microseconds, but I get an overflow.

>>> datetime.datetime(year=1970, month=1, day=1) + datetime.timedelta(microseconds=9223372036854775807)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
OverflowError: date value out of range

dennis-wey · 2025-01-17T08:12:39Z

@sfc-gh-mkeller thanks for the detailed explanation of the bug. We experienced the bug ourself and I think none of your proposed solutions is a good fit for us.
In our case we do a lot of select * queries, in which we analyze & transform the data outside of snowflake due to performance reasons (not to say that snowflake isn't performant but in our use case it's rather small data so we can process them in-memory).
Right now this would mean:

Getting the schema & look for timestamps
Querying their maximum value
Adjust timestamp resolution

With your Box solution number 2 this would mean:

Querying the data
Getting the Error message
Adjusting the sql statement
Getting the error message again for a different column . . .

Both solutions seem quite inconvenient and I think there are many use cases in which people don't care so much about timestamp resolution.

So I would suggest two different kind of solution:

Do number 1 of your suggested solution but hide behind an optional flag in the cursor to not introduce breaking changes
Introduce an option to overwrite timestamp precision when fetching the data to apply this to all fetched timestamps

sfc-gh-mkeller · 2025-01-21T18:44:27Z

I like your option 2 Dennis, it provides true convenience to the users without confusion.

I just want to catch anyone new to this issue, we will stop doing this magic of downscaling timestamps like 9999-12-31 00:00:00.000_000_000 from nanoseconds to microseconds (remove last 3 digits) because you don't necessarily lose any meaningful precision.
We have been doing this since Arrow result format was introduced, but it leads to issues where if there's only 1 row that needs downscaling Arrow format will not allow us to mix precisions in a single column.
So we will discontinue doing this and will just throw an error where you can either adjust your SQL to request less precision from the back-end, or we will introduce a client-level setting to overwrite all timestamps to a specific scale.

jwyang-qraft added bug needs triage labels Jan 31, 2024

github-actions bot changed the title ~~inconsistent timestamp downscaling~~ SNOW-1025489: inconsistent timestamp downscaling Jan 31, 2024

sfc-gh-aling added triaged and removed needs triage labels Feb 29, 2024

sfc-gh-dszmolka assigned sfc-gh-aling Mar 11, 2024

sfc-gh-dszmolka added the status-triage_done Initial triage done, will be further handled by the driver team label Dec 16, 2024

sfc-gh-aling assigned sfc-gh-sostrowski and unassigned sfc-gh-aling Jan 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1025489: inconsistent timestamp downscaling #1868

SNOW-1025489: inconsistent timestamp downscaling #1868

jwyang-qraft commented Jan 31, 2024 •

edited

Loading

sfc-gh-aling commented Feb 29, 2024

sfc-gh-mkeller commented Dec 10, 2024

sfc-gh-areddy commented Jan 9, 2025

sfc-gh-mkeller commented Jan 13, 2025

dennis-wey commented Jan 17, 2025

sfc-gh-mkeller commented Jan 21, 2025

SNOW-1025489: inconsistent timestamp downscaling #1868

SNOW-1025489: inconsistent timestamp downscaling #1868

Comments

jwyang-qraft commented Jan 31, 2024 • edited Loading

Python version

Operating system and processor architecture

Installed packages

What did you do?

What did you expect to see?

Can you set logging to DEBUG and collect the logs?

sfc-gh-aling commented Feb 29, 2024

sfc-gh-mkeller commented Dec 10, 2024

sfc-gh-areddy commented Jan 9, 2025

sfc-gh-mkeller commented Jan 13, 2025

dennis-wey commented Jan 17, 2025

sfc-gh-mkeller commented Jan 21, 2025

jwyang-qraft commented Jan 31, 2024 •

edited

Loading