Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance and array semantics #33

Open
simonbyrne opened this issue May 29, 2019 · 3 comments
Open

Performance and array semantics #33

simonbyrne opened this issue May 29, 2019 · 3 comments

Comments

@simonbyrne
Copy link

simonbyrne commented May 29, 2019

Calling functions which iterate over array elements (e.g. Statistics.mean) on netcdf datasets can be very slow. It would be useful to have some performance tips to e.g. first convert a dataset to an Array.

On that note, I noticed that simply calling Array(dataset) is also slow. I take it from the examples in the manual that the suggested way to convert is to call dataset[:]. However this has different behaviour from ordinary Julia multidimensional arrays, which return a vector:

julia> X = rand(3,3);

julia> X[:]
9-element Array{Float64,1}:
 0.1443316789104081
 0.8768272466421989
 0.01240381950170022
 0.6732249425627772
 0.4128628021781906
 0.2112718766221251
 0.9459472658879167
 0.8996044010964837
 0.47806340748050236
@Alexander-Barth
Copy link
Member

Thank you for your insightful comments!

Indeed, Statistics.mean (and friends) are not overloaded and thus they are fetching every element individually from the netCDF file which is slow. I added a section in the documentation about performance as suggested.

It never occurred to me, to load an array in RAM with the Array constructor, but it makes sense and I added a method for this which give a speed-up as expected. If myvar is a 100x100 Float64 Matrix, before we had:

julia> @btime Array(ds["myvar"]) 
  115.359 ms (170042 allocations: 8.93 MiB)

Now after my recent commit:

julia> @btime Array(ds["myvar"]);
  69.404 μs (49 allocations: 170.43 KiB)

It is true that ncvar[:] is not consistent with the behaviour of regular Julia arrays, but it was a price that I was willing to pay for a concise syntax. In fact, the first NetCDF library that I used was the netcdf matlab toolbox (now abandoned) which implemented exactly the same (slightly inconsistent) behaviour in matlab.

What alternatives could be implemented?

ds["myvar"][:,:]  # does not work so well if the dimensionality of myvar is not known a priori
Array(ds["myvar"]) # works now, but it feels a bit wordy to me
ds["myvar"][] # like references, but it looks a bit obscure to most people I guess

Alexander-Barth added a commit that referenced this issue May 29, 2019
@simonbyrne
Copy link
Author

Great, thanks!

I don't think there's anything wrong with Array(ds["myvar"]). If you want something that would work for scalars as well, it would probably work to overload copy?

@Alexander-Barth
Copy link
Member

I had a look at HDF5.jl (https://github.com/JuliaIO/HDF5.jl/blob/master/doc/hdf5.md#reading-and-writing-data) and they use read, for example:

variable = read(ds["myvar"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants