Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Fokko · 2025-01-20T11:58:56Z

This was already being discussed back here: #208 (comment)

This PR changes from doing a sort, and then a single pass over the table to the approach where we determine the unique partition tuples filter on them individually.

Fixes #1491

Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The combine_chunks method does this correctly.

Now:

0.42877754200890195
Run 1 took: 0.2507691659993725
Run 2 took: 0.24833179199777078
Run 3 took: 0.24401691700040828
Run 4 took: 0.2419595829996979
Average runtime of 0.28 seconds

Before:

Run 0 took: 1.0768639159941813
Run 1 took: 0.8784021250030492
Run 2 took: 0.8486490420036716
Run 3 took: 0.8614017910003895
Run 4 took: 0.8497851670108503
Average runtime of 0.9 seconds

So it comes with a nice speedup as well :)

This was already being discussed back here: apache#208 (comment) This PR changes from doing a sort, and then a single pass over the table to the the approach where we determine the unique partition tuples then filter on them one by one. Fixes apache#1491 Because the sort caused buffers to be joined where it would overflow in Arrow. I think this is an issue on the Arrow side, and it should automatically break up into smaller buffers. The `combine_chunks` method does this correctly. Now: ``` 0.42877754200890195 Run 1 took: 0.2507691659993725 Run 2 took: 0.24833179199777078 Run 3 took: 0.24401691700040828 Run 4 took: 0.2419595829996979 Average runtime of 0.28 seconds ``` Before: ``` Run 0 took: 1.0768639159941813 Run 1 took: 0.8784021250030492 Run 2 took: 0.8486490420036716 Run 3 took: 0.8614017910003895 Run 4 took: 0.8497851670108503 Average runtime of 0.9 seconds ``` So it comes with a nice speedup as well :)

Fokko force-pushed the fd-fix-overflowing-buffer branch from 0043889 to e548117 Compare January 20, 2025 13:18

Fokko force-pushed the fd-fix-overflowing-buffer branch from e548117 to 4658c3c Compare January 20, 2025 13:25

Fokko mentioned this pull request Jan 20, 2025

[Bug] Error in overwrite(): pyarrow.lib.ArrowInvalid: offset overflow with large dataset (~3M rows) #1491

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Fokko commented Jan 20, 2025

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Are you sure you want to change the base?

Arrow: Avoid buffer-overflow by avoid doing a sort #1539

Conversation

Fokko commented Jan 20, 2025