Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

anndata initializer does not verify indices match on layers #1769

Open
3 tasks done
amcpherson opened this issue Nov 16, 2024 · 6 comments
Open
3 tasks done

anndata initializer does not verify indices match on layers #1769

amcpherson opened this issue Nov 16, 2024 · 6 comments
Labels

Comments

@amcpherson
Copy link

Please make sure these conditions are met

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of anndata.
  • (optional) I have confirmed this bug exists on the master branch of anndata.

Report

The AnnData Initializer verifies index and columns of the input X match var and obs, but does not do the same for layers resulting in possible bugs especially when initializing with layers but no X.

Code:

import numpy as np
import pandas as pd
import anndata as ad

var = pd.DataFrame(index=["gene1", "gene2"])
different_indices = ["gene2", "gene1"]
layer_data = pd.DataFrame(np.array([[5, 6], [7, 8]]), columns=different_indices)
layers = {"layer1": layer_data}
adata = ad.AnnData(obs=obs, var=var, layers=layers)

Versions

-----
anndata             0.11.1.dev10+gaf6480e
session_info        1.0.0
-----
cython_runtime      NA
dateutil            2.9.0.post0
exceptiongroup      1.2.2
h5py                3.12.1
natsort             8.4.0
numpy               2.0.2
packaging           24.2
pandas              2.2.3
pytz                2024.2
scipy               1.14.1
six                 1.16.0
zoneinfo            NA
-----
Python 3.10.15 | packaged by conda-forge | (main, Oct 16 2024, 01:24:20) [Clang 17.0.6 ]
macOS-14.6.1-arm64-arm-64bit
-----
Session information updated at 2024-11-15 21:08
@amcpherson
Copy link
Author

Happy to try to fix and submit a PR given some advice on how to proceed

@ilan-gold
Copy link
Contributor

@amcpherson I think this should be ok, although I am not sure the correct intention for layers. The docs only say "Dictionary-like object with values of the same dimensions as X." At this point people might actually be relying on the behavior you describe since the docs only specify that the dimensions be the same. I would need a stronger counter-example or @flying-sheep should weigh in

@amcpherson
Copy link
Author

Layers can be accessed in the same way as X for instance adata[obs_indices, var_indices].layers['somelayer']. This implies that the layers arent just the same dimension but they are aligned. However this alignment isnt enforced on set or anndata creation.

@ilan-gold
Copy link
Contributor

@amcpherson Those global indices are internally converted to integer-like indices (np array, slice, int). So I don't think we would have a great way of special-casing this checking, but I will wait for @flying-sheep to weigh in. I would probably say this should be allowed to be honest but not sure.

@amcpherson
Copy link
Author

amcpherson commented Nov 27, 2024

@ilan-gold , yes, exactly. The global indices are converted to integer like indices, which is how it should be. The anndata encapsulates a set of axes aligned objects but the underlying representations are just raw numpy arrays. This is the same as for .X. However, if we are trusting anndata to manage the obs, var, X and layers to be axis aligned, then it is really helpful to have some checks to ensure we dont do something wrong. I use anndata extensively and this is the number 1 cause of bugs especially by junior developers who do not know all the caveats of anndatas. Another gotcha is this:

adata.obs = adata.obs.iloc[::-1]

This completely scrambles the relationship between obs and X/layers. It should raise an exception because the indices are not exactly equal including order, or align the indices of the new adata.obs. Ideally the former.

Looking forward to hearing from @flying-sheep.

@ilan-gold
Copy link
Contributor

In that case, the flipside is people who expect

obs_index_changed = adata.obs.set_index('foo', inplace=False)
adata.obs = obs_index_changed

to work. We could distinguish by looking at the set of index values and ensuring they match or something. But it may also be better to simply document that people need to be aware of this tradeoff we have made.

I agree that this behavior is confusing, though. I think this then requires a broader discussion because we have no way of knowing what the intentions of people are.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants