Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bootstrap the library #1

Merged
merged 18 commits into from
May 8, 2024
Merged

Bootstrap the library #1

merged 18 commits into from
May 8, 2024

Conversation

cwognum
Copy link
Contributor

@cwognum cwognum commented Apr 26, 2024

As @zhu0619 and me were working through polaris-hub/polaris#98, we realized that the curation module within the Polaris library really should be a stand-alone package. It is independent from the broader Polaris codebase. Up until now this was fine, but as we go public with Polaris, we will need to be more careful about our release strategy, which would unnecessarily slow down the development of the curation module.

This PR moves the curation module into its own package, temporarily named alchemy.

Refactoring

As part of this PR, we decided to refactor the code base to be better setup for future maintenance.

In summary:

  • We added a Curator class. This class specifies the curation process as a number of steps. It can be serialized to be easily saved to and loaded from JSON. The goal with this is to make the process more easily reproducible.
  • We reimplemented all the curation steps as individual, serializable BaseAction objects.
  • We then added the curation.functional module, inspired by torch.nn.functional, to easily use any of the curation steps in isolation, outside of the object-based approach.
  • We added a CurationReport, which is produced by the Curator and holds relevant information about the curation process (for now just logs and images) and can be exported to different formats through Broadcasters, such as the LoggerBroadcaster and HTMLBroadcaster.
  • We added the visualization module and refactored the visualizations to be more general.

The object-based API now looks like:

# Define the curation workflow
curator = Curator(
    steps=[
        MoleculeCuration(input_column="smiles"),
        OutlierDetection(method="zscore", columns=["SOL"]),
        Discretization(input_column="SOL", thresholds=[-3]),
    ],
    parallelized_kwargs = { "n_jobs": -1 }
)

# Run the curation
dataset, report = curator(dataset)

You could do something similar with the functional API, e.g.:

y = np.random.normal(0, 1, 1000)
is_outlier = detect_outliers(y)
visualize_distribution_with_outliers(y, is_outlier)

Relation to Polaris Recipes

Now that we have this system, I'm thinking we should require every polaris-recipe to provide the serialized Curator object as JSON. That way, everyone that has access to the base dataset could reproduce the process with the CLI we added.

TODO

(Note: We don't need to do all these things in this PR, but just for future reference)

  • Decide on a name
  • Extensively test by porting the code in the Polaris Recipes repo
  • Add more documentation (bit of an empty shell right now)
  • Make public and release officially!

@cwognum cwognum requested a review from zhu0619 April 26, 2024 20:44
@cwognum cwognum self-assigned this Apr 26, 2024
@cwognum cwognum added the feature Annotates any PR that adds new features; Used in the release process label Apr 26, 2024
* dev refactor

* fix test

* fix tests

* remove unused code

* refactor repo name

* Reviewed @zhu0619's changes. Switched to jinja2 for HTML  broadcaster

* Fixed bug in LoggerBroadcaster

* Fixed release CICD

* Fix CICD

---------

Co-authored-by: cwognum <[email protected]>
@cwognum cwognum merged commit 01210d9 into main May 8, 2024
4 checks passed
@cwognum cwognum deleted the bootstrap branch May 8, 2024 00:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants