Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add anndata factory #255

Open
wants to merge 12 commits into
base: development
Choose a base branch
from
Open

Add anndata factory #255

wants to merge 12 commits into from

Conversation

mschwoer
Copy link
Contributor

@mschwoer mschwoer commented Nov 22, 2024

Add first version of anndata conversion.

Copy link
Contributor

@lucas-diedrich lucas-diedrich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah 🥳 Fantastic!

  • I think how to handle duplicated protein groups is a good question - is this expected to happen? Otherwise I would raise a warning/error, drop them, and use the strategy first while pivoting
  • For some downstream analyses it might be good to consider additional information from the psm files (e.g. gene names). Would it be possible to add additional metadata to the metadata attributes? (e.g. list of columns the .obs and .obs attributes?)

index=PsmDfCols.RAW_NAME,
columns=PsmDfCols.PROTEINS,
values=PsmDfCols.INTENSITY,
aggfunc=np.nanmean, # how to aggregate intensities for same protein in same raw file TODO first?
Copy link
Contributor

@lucas-diedrich lucas-diedrich Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there scenarios in which the same protein occurs multiple times in a file? I tested the diann_test_input_mDIA.tsv with the DiannReader class and did not find any.

I think aggregating by the mean might be dangerous. One could add a test on whether there are duplicates, and at least raise a warning.

duplicated_proteins  = self._psm_df[PsmDfCols.PROTEINS].duplicated()
if  duplicated_proteins.sum() > 0:
   warning.warn(f"{duplicated_proteins.sum()} duplicated protein groups") 

Alternatively, this could be an optional argument agg_duplicates: Literal["mean", "drop", "raise"] with "raise" raising a ValueError, "drop" dropping the duplicated entries, and "mean" aggregating

Copy link
Contributor Author

@mschwoer mschwoer Nov 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if missing_cols:
raise ValueError(f"Missing required columns: {missing_cols}")

self._psm_df = psm_df
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add optional metadata columns to the .obs and .var attributes by passing obs_columns: Optional[str, List[str]] and var_columns: Optional[str, List[str]] to the factory class?

This would add to the complexity as one had to validate that the columns are in the data frame, but other than that one could just use .pivot_table while passing the list of columns

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mschwoer mschwoer force-pushed the add_anndata_factory branch from 36f0ce4 to c19e9ce Compare November 25, 2024 14:01
@@ -1,3 +1,4 @@
anndata==0.11.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that in order to use 0.11.1 we would need to drop support for python 3.8 (latest supported versions are 0.10.9, 0.11.0rc1, 0.11.0rc2) .. would be an argument for moving this module out of alphabase

lucas-diedrich
lucas-diedrich approved these changes Nov 25, 2024
@mschwoer mschwoer force-pushed the add_anndata_factory branch from f6bb454 to 427e64a Compare November 25, 2024 16:06
@mschwoer mschwoer changed the base branch from refactor_readers_XI to add_alphadia_reader November 25, 2024 16:09
@mschwoer mschwoer force-pushed the add_alphadia_reader branch from 3e745b3 to 239e2ef Compare November 26, 2024 09:35
@mschwoer mschwoer force-pushed the add_anndata_factory branch from 427e64a to 8ce3ce5 Compare November 26, 2024 09:35
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@mschwoer mschwoer marked this pull request as ready for review November 26, 2024 10:36
Base automatically changed from add_alphadia_reader to development January 9, 2025 16:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants