Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use arrow::read_parquet instead of nanoparquet #462

Open
BenoitLondon opened this issue Jan 16, 2025 · 3 comments
Open

use arrow::read_parquet instead of nanoparquet #462

BenoitLondon opened this issue Jan 16, 2025 · 3 comments

Comments

@BenoitLondon
Copy link

BenoitLondon commented Jan 16, 2025

I've found in my benchmarks nanoparquet to be much less efficient than arrow in term of speed and RAM usage

        expression median mem_alloc   name   size
            <char>  <num>     <num> <char> <char>
 1:     df_parquet  1.153     5.578  write  small
 2: df_nanoparquet  0.674   183.986  write  small
 3:     dt_parquet  5.172     0.018  write  small
 4: dt_nanoparquet  0.656   183.876  write  small
 5:     df_parquet 10.878     0.015  write    big
 6: df_nanoparquet 10.182  2068.884  write    big
 7:     dt_parquet 11.461     0.015  write    big
 8: dt_nanoparquet 10.038  2068.947  write    big
 9:     df_parquet  0.088    34.901   read  small
10: df_nanoparquet  0.414   183.187   read  small
11:     df_parquet  1.187     0.009   read    big
12: df_nanoparquet  5.180  1324.072   read    big

speed and RAM usage when reading big files are not very good .

on nanoparquet repo they say :

Being single-threaded and not fully optimized, 
nanoparquet is probably not suited well for large data sets. 
It should be fine for a couple of gigabytes. 
Reading or writing a ~250MB file that has 32 million rows 
and 14 columns takes about 10-15 seconds on an M2 MacBook Pro.
 For larger files, use Apache Arrow or DuckDB.

rio uses arrow for feather already so I'm not sure why we rely on nanoparquet for parquet

If you keep nanoparquet as default maybe we could have an option to use arrow instead?

@chainsawriot
Copy link
Collaborator

chainsawriot commented Jan 16, 2025

@BenoitLondon Thank you for the benchmark.

As you asked the why question: long story short of #315 , we wanted Parquet support by default. At first, R 1.0.0 with arrow. But quickly reverted due to installation concerns. Then later, nanoparquet by @gaborcsardi was supposed to be installed by default, because it's dependency free. But it was, again, reverted due to the insufficient support for Big Endian platforms r-lib/nanoparquet#21. And therefore, we have the funny state of reading parquet with nanoparquet but feather with arrow. Going back to pre 1.1, arrow was used for both parquet and feather in the so-called Suggests tier.

We are somehow reluctant in introducing options for choosing which package to use. We are still cleaning those up from the pre-1.0 era. I don't mind switching back to arrow altogether. At the same time, I also believe that @gaborcsardi is actively developing nanoparquet to make it more efficient.

@gaborcsardi
Copy link

Can you share the code for the benchmark?

Some notes:

  • The dev version of nanoparquet has a completely rewritten read_parquet(), which is much faster. (See below)
  • I suspect that you can't really compare mem_alloc because if only includes memory allocated within R, and arrow probably allocates most of its memory in C/C++.
  • I am not totally sure how to interpret the results. E.g. does
            expression median mem_alloc   name   size
     3:     dt_parquet  5.172     0.018  write  small
     4: dt_nanoparquet  0.656   183.876  write  small
    
    mean that nanoparquet is 8 times faster here? Or 8 times slower?

Not really a good benchmark, but I just ran arrow and nanoparquet on the mentioned 33 milion row data set (10x flights from nycflights13), and nanoparquet is about 2 times faster when writing, and about the same in reading. (This is with options(arrow.use_altrep = FALSE), so that arrow actually reads the data.)

It would be great to have a proper benchmark, but nevertheless I'll update note in the nanoparquet README, because it is acually competitive in terms of speed. I suspect that it is also competitive in terms of memory, but we'd need a better way to measure that.

@BenoitLondon
Copy link
Author

BenoitLondon commented Jan 17, 2025

Oh thanks guys for the explanations, very much appreciated!
I guess my benchmarks were not very well designed. I suspected there was some ÀLTREP magic behind those numbers and for the ram as well it didn't look correct. I will use some summary after reading to make sure the data is actually loaded into R

median is the median time of 3 iterations so yeah in the small dataset case nano is 8 times faster than arrow.

I m very happy to use nanoparquet if there s no downside (my use case is basically writing /reading biggish files (1-5 GB) in R and also reading in python or Julia so I wanted compatibility and speed and low ram usage if possible)

Thanks again.
I will share my benchmark when fixed ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants