Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐕 Batch: An LLM-improved model incorporation pipeline #1382

Open
miquelduranfrigola opened this issue Nov 15, 2024 · 4 comments
Open

🐕 Batch: An LLM-improved model incorporation pipeline #1382

miquelduranfrigola opened this issue Nov 15, 2024 · 4 comments
Labels
enhancement New feature or request

Comments

@miquelduranfrigola
Copy link
Member

miquelduranfrigola commented Nov 15, 2024

Improved model incorporation experience (with LLMs)

Hi @DhanshreeA, I wanted to suggest a few improvements to the pipeline for model incorporation.

Background

The current model incorporation pipeline (based on GitHub Issues) works nicely but, consistently, we get poor metadata from our contributors. We should have a better annotation of the models from the inception, and I think that LLMs can greatly help us here.

In particular, we already have two scripts to (a) summarize a publication PDF and (b) generate metadata from a raw metadata file that, in my opinion, can be exploited for this.

Therefore, I suggest an update of the pipeline as specified below.

Suggested pipeline

  1. The user opens a model request. In that model request, the form can be more simplified than it is now (for example, no need to ask for tags yet). Actually, what we need is a suggested title, a drafted description, and the links to the publication and code. This is all.

  2. As soon as a model request is posted, we could trigger an action that modifies the title of the issue and directly assigns it a model identifier. This will have two advantages: (a) it will make it easier for all of us to find issues associated with models and (b) it will already tell the Ersilia team which is the associated identifier to this thread, which will have good implications for the next point (3).

  3. As soon as a model identifier is available (in the title of the issue) and a link to the publication is found. The Ersilia maintainers can download the publication and store it in the S3 bucket, correspondingly. This is a necessary manual step that will ensure that we have the publication in PDF form available already at this early stage.

  4. Then, we can start discussion in the thread as normal but, at some point, the Ersilia team can trigger a /stage command. I am calling it /stage but we can call it /pre-approve or /annotate (I like /stage - unless it is reserved by GitHub?). The goal of this command will be to launch the following LLM-based workflows: (a) produce a summary of the publication and (b) produce an improved version of the metadata based on the summary of the publication and the raw metadata provided by the user on the first message of the issue. As a result of this workflow, we will have the summary of the publication stored in S3 and the improved metadata, in Markdown format, posted as a comment in the issue thread.

  5. The contributor will be asked to revise the improved metadata, and edit the comment correspondingly, until we are all satisfied with the content. In this improved metadata comment, a slug is provided, an improved description, tags, etc.

  6. Finally, we can /approve the incorporation of the model. In this case, the workflows will proceed normally, that is, a new model repository will be created corresponding to the model identifier, and a comment will be posted encouraging the contributor to work on it. There are a few things to be considered. First, note that the model identifier was already generated in point 1, so there is no need to generate a new one. Second, the metadata to generate the metadata.yml file should be obtained from the improved metadata comment, not from the first comment of the issue. Finally, we may want to add a comment somewhere notifying that this procedure is assisted with an LLM.

About the LLM

I would definitely use OpenAI GPT-4o (or more) for this. There is no need to use any custom LLM or any self-hosted solution.

Objective(s)

  • Improve the model incorporation workflow with LLM assistance.

Documentation

N/A

@staru09
Copy link

staru09 commented Nov 23, 2024

Hii is this issue open to work ?

@DhanshreeA DhanshreeA added the enhancement New feature or request label Nov 25, 2024
@DhanshreeA
Copy link
Member

Closely related to, but not a duplicate of: #1354

@GemmaTuron
Copy link
Member

Hi @staru09

Thanks for your interest in Ersilia! Please have a look at issues labelled as good-first-contributions to start working with Ersilia rather than complex batch issues. Thanks again :)

@DhanshreeA DhanshreeA removed their assignment Nov 28, 2024
@miquelduranfrigola
Copy link
Member Author

Update: I have drafted a project proposal for Harvard T4SG. You can find it here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: On Hold
Development

No branches or pull requests

4 participants