Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(openchallenges): add EDAM Extract and Transform Processes #2564

Merged
merged 16 commits into from
Mar 22, 2024

Conversation

mdsage1
Copy link
Contributor

@mdsage1 mdsage1 commented Mar 13, 2024

Description

EDAM ETL processes need to be developed to incorporate ETAM ontology in the Maria DB linking the ontology to existing data. This PR will address the extract and transform portion.

Related Issue

Contribute to #2524
Contribute to #2548

Fixes #2547
Fixes #2563

Changelog

  • Add
  1. Download a specified version of the EDAM ontology from https://github.com/Sage-Bionetworks/edamontology
  2. Transform the raw data into a Pandas dataframe that match the content of this file
  3. Start id values from 1 to mimic the behavior of SQL AUTO_INCREMENT.
  4. Print info and statistic about the data to the stdout
  5. Version of EDAM processed
  6. Number of concepts transformed (overall, operation, data, etc.)

Preview

image

@mdsage1 mdsage1 added the sonar-scan-approved-deprecated Ready for Sonar code analysis label Mar 13, 2024
@mdsage1 mdsage1 changed the title feat(edam): Add EDAM ETL feat(openchallenges): Add EDAM ETL Mar 13, 2024
@mdsage1 mdsage1 changed the title feat(openchallenges): Add EDAM ETL feat(openchallenges): add EDAM ETL Mar 13, 2024
@mdsage1 mdsage1 self-assigned this Mar 13, 2024
@mdsage1
Copy link
Contributor Author

mdsage1 commented Mar 14, 2024

@tschaffter Quality Gate doesn't seem to be performing checks for this PR.

@mdsage1 mdsage1 marked this pull request as ready for review March 15, 2024 15:32
@mdsage1 mdsage1 changed the title feat(openchallenges): add EDAM ETL feat(openchallenges): add EDAM Extract and Transform Processes Mar 15, 2024
@mdsage1
Copy link
Contributor Author

mdsage1 commented Mar 15, 2024

@tschaffter This is ready for review. Version is now an environment variable and the Description includes a preview.

apps/openchallenges/edam-etl/src/main.py Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
@vpchung
Copy link
Member

vpchung commented Mar 15, 2024

@mdsage1 thanks for working on this!! You didn't ask me to, but I added some comments to the PR. Feel free to use them or ignore 😄

Copy link

@mdsage1 mdsage1 requested a review from vpchung March 15, 2024 21:09
@tschaffter
Copy link
Member

tschaffter commented Mar 18, 2024

Fixes #2546

This PR is part of #2546, so this PR should not be configured to close this ticket. I will remove it from the list.

Copy link
Member

@tschaffter tschaffter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mdsage1 Why do you show the std, mean and other metrics for the id column?

Update: here are the information that the script should print

  • Version of EDAM processed
  • Number of concepts that will be added to the table
    • Total number of concepts
    • Number of concepts for the following category
      • Data concepts
      • Operation concepts
      • Format concepts
      • Operation concepts
      • Other concepts

@tschaffter tschaffter self-requested a review March 18, 2024 22:25
Copy link
Member

@tschaffter tschaffter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The config parameters should be validated before using them, otherwise the following behavior will occur:

$ nx serve-detach openchallenges-edam-etl

> nx run openchallenges-edam-etl:serve-detach

 Container openchallenges-mariadb  Recreate
 Container openchallenges-mariadb  Recreated
 Container openchallenges-edam-etl  Recreate
 Container openchallenges-edam-etl  Recreated
 Container openchallenges-mariadb  Starting
 Container openchallenges-mariadb  Started
 Container openchallenges-mariadb  Waiting
 Container openchallenges-mariadb  Healthy
 Container openchallenges-edam-etl  Starting
 Container openchallenges-edam-etl  Started

 ————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————

 >  NX   Successfully ran target serve-detach for project openchallenges-edam-etl (33s)
 
   View logs and investigate cache misses at https://cloud.nx.app/runs/Ivfcx2Zd35

vscode@dee30b82cf44:/workspaces/sage-monorepo$ docker logs openchallenges-edam-etl
EDAM Version: None
OC DB URL: jdbc:mysql://openchallenges-mariadb:3306/challenge_service
Downloading the EDAM concepts from GitHub (CSV file)...
Error downloading EDAM concepts: 404 Client Error: Not Found for url: https://github.com/edamontology/edamontology/raw/main/releases/EDAM_None.csv
Processing the EDAM concepts...
File EDAM_None.csv not found.
No data available.

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
@mdsage1
Copy link
Contributor Author

mdsage1 commented Mar 20, 2024

closed in error

@mdsage1 mdsage1 reopened this Mar 20, 2024
@mdsage1 mdsage1 marked this pull request as ready for review March 21, 2024 16:14
@tschaffter
Copy link
Member

You could get this information from the column class_id and a regex.

See the suggestion I made above.

@vpchung
Copy link
Member

vpchung commented Mar 21, 2024

You could get this information from the column class_id and a regex.

@mdsage1 alternatively, you can also do a replace() to remove the substring you don't need from class_id, if you're not comfortable with regex.

EDIT: Since you're interested in the number of concepts per category, you can actually use pandas' contains to get you closer to the count 🙂 e.g.

>>> df["class_id"].str.contains("data")
0        True
1        True
2        True
3        True
4        True
        ...  
3468    False
3469    False
3470    False
3471    False
3472    False

@tschaffter
Copy link
Member

tschaffter commented Mar 21, 2024

Prefer exact match to using contains (more future proof): contains would not work if the ontology were to have the concept Data and DataFormat, for example.

@vpchung
Copy link
Member

vpchung commented Mar 21, 2024

if the ontology were to have the concept Data and DataFormat, for example.

Good point. Just shooting my shot here, but this can be overcome by using data_ (assuming they use "dataformat_"). Also, you can use regex with contains().

@mdsage1 mdsage1 marked this pull request as draft March 21, 2024 17:48
@mdsage1
Copy link
Contributor Author

mdsage1 commented Mar 21, 2024

@tschaffter I've updated the concept counts to use the class_id column and regex. The case has been ignored to avoid any future issues. I didn't use contains but used search() function from the regex module. I have prevented future issues with data, and any other concept name, listing as a match when there is an additional word following the word of interest by adding the underscore to the regex as @vpchung suggested.

@mdsage1 mdsage1 marked this pull request as ready for review March 21, 2024 18:17
@mdsage1 mdsage1 requested review from tschaffter and vpchung March 21, 2024 18:20
Copy link
Member

@vpchung vpchung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one final suggestion, but otherwise, the script looks good on my end!

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Show resolved Hide resolved
Copy link
Contributor Author

@mdsage1 mdsage1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tested the link and updated it

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
@mdsage1 mdsage1 requested a review from tschaffter March 21, 2024 23:05
@mdsage1 mdsage1 marked this pull request as draft March 22, 2024 15:54
@mdsage1 mdsage1 marked this pull request as ready for review March 22, 2024 16:02
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
return None


def count_occurrences(identifier_pattern: str, df) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def count_occurrences(identifier_pattern: str, df) -> int:
def count_occurrences(identifier_pattern: str, df: pd.DataFrame) -> np.int64:

^^ IIRC. May need to double-check whether it is a numpy type being returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Numpy int is being returned and made changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool. May need to add import numpy as np in this case!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch! I need to submit and patch for this project in #2594 and will fix this issue at the same time.

apps/openchallenges/edam-etl/src/main.py Outdated Show resolved Hide resolved
@tschaffter tschaffter merged commit 3c7933e into Sage-Bionetworks:main Mar 22, 2024
9 checks passed
@mdsage1 mdsage1 deleted the edam-etl branch March 22, 2024 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sonar-scan-approved-deprecated Ready for Sonar code analysis
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Task] Start EDAM concept ID from 1 instead of 0 [Task] Extract and transform the EDAM ontology data
3 participants