Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐕 Batch: Add Output Validation for Model API Results in Playground and Model Tester Pipelines #1488

Open
9 of 12 tasks
Abellegese opened this issue Jan 5, 2025 · 4 comments
Assignees

Comments

@Abellegese
Copy link
Contributor

Abellegese commented Jan 5, 2025

Description

The current models process SMILES inputs (string, list, or .csv) and generate various types of outputs, including numeric values, text (e.g., for generative models), boolean values and other values. Checking and validating the output from those models helps to ensures consistent functionalities when they exposed to several refactor and changes. Currently Ersilia has two main testing pipeline to check models if they work properly and report issues otherwise, utilising the test command and playground test. These pipelines lack features to perform several checks on output of the models API. This issue focuses on defining and implementing validation checks for outputs returned by the Ersilia Model Hub APIs.

The goal is to:

  1. Define a set of rules for validating outputs in the playground test pipeline.
  2. Add a new check function in the CheckService class for the model tester pipeline to validate results programmatically.

Details
To achieve the most out of the checking and validating of models output and to address several output type, we can use information from metadata, a file that has all necessary information about the model.

  • Metadata for instance has the following information about the model outputs
  "Input": [
      "Compound"
  ],
  "Input Shape": "Single",
  "Output": [
      "Descriptor"
  ],
  "Output Type": [
      "Integer"
  ],
  "Output Shape": "List"
  1. Ersilia Test Command Features (Double Layer checks)

    a) First Level Checks: focusing on the two info for now, which are Output Type and Output Shape. Given input as string, list and .csv file. This can be achieved by creating a simple post code to the api after serve. This is important because we this commands test model healthiness, accessing raw output from the api and validate that would be invaluable.

    • The first check would be to check if the output returned from the API does not have null, '', None, and NAN.
    • If check passed this we check for if the the returned json from the API has the required DataType (for instance integer, float, bool) and DataStruture (List, Single,...).
    • Then matching the length of the input and output will follow.

    b) After all first level check passes, we sencondly check for:

    • If the requested output file (json, h5 `) type is correctly created and has a proper value in it. Since the value is validated in the first check, if things fail in this stage, its likely because of ersilia not the models themselves. Decoupling systems is really important.
    • Then matching the length of the input and output will follow.
  2. Playground Test:

    • Create rules that will get executed at run command. Run an Ersilia run command and validate three inputs with three output file types. The validation checks are same as model tester such as null, '', None, NAN, DataType, DataStruture and length matching between input output.

Tasks

  1. Playground Test: Add Rule-Based Checks and Feature Update

    • Define output validation rules for numeric, text, and boolean outputs.
    • Update the playground test pipeline to incorporate these rules.
    • Ensure rules can be easily updated or extended for new models.
    • Supporting macOS for docker status checking (we can not manipulate docker pro grammatically in macOS)
    • We can pass all the config.yml from a nox sessions.
  2. Model Tester: Add CheckService Function

    • Add a check function in the CheckService class.
    • Implement logic to validate:
      • Create a check function with a code to post to the api
      • Numeric outputs: Range checks, type validation, and NaN handling.
      • Text outputs: Non-empty checks and expected substring checks.
      • Boolean outputs: Enforce strict True/False results.
      • Implement Ersilia run command with three output file types (json, h5) and validate
    • Add appropriate error logging for invalid outputs (reporting them as table would be better).
    • Write unit tests for the check function to ensure its correctness.
  3. Integration and Documentation

    • Integrate the updated pipelines into the CI/CD workflow.
    • Update the project documentation with details about validation rules and usage examples.
    • Add examples to the playground test and ersilia test command user guide for how rules apply during testing.

Objective(s)

No response

Documentation

No response

@GemmaTuron
Copy link
Member

Hi @Abellegese

I am a bit confused with the changes that will be implemented in the test command itself vs the testing playground. Wasn't the test command already checking that the output was correct? Would it pass the existing test if the output was None or Null in the existing version previous to refactoring?

@Abellegese
Copy link
Contributor Author

Hi @GemmaTuron yes the test indeed checks the output but it reads the values from a csv (missing json and h5). Here is the code below that does checking values extracted from csv output. This checks two output generated from 1) ersilia run CLI and 2) from the running the model run.sh file in its isolated venv.

def validate_output(output1, output2):
    if not isinstance(output1, type(output2)):
        raise texc.InconsistentOutputTypes(self.model_id)

    if output1 is None:
        return

    if isinstance(output1, (float, int)):
        rmse = compute_rmse([output1], [output2])
        if rmse > 0.1:
            raise texc.InconsistentOutputs(self.model_id)

        rho, _ = spearmanr([output1], [output2])
        if rho < 0.5:
            raise texc.InconsistentOutputs(self.model_id)

    elif isinstance(output1, list):
        rmse = compute_rmse(output1, output2)
        if rmse > 0.1:
            raise texc.InconsistentOutputs(self.model_id)

        rho, _ = spearmanr(output1, output2)
        if rho < 0.5:
            raise texc.InconsistentOutputs(self.model_id)

    elif isinstance(output1, str):
        if _compare_output_strings(output1, output2) <= 95:
            raise texc.InconsistentOutputs(self.model_id)

@Abellegese
Copy link
Contributor Author

Abellegese commented Jan 7, 2025

So @GemmaTuron here I want to make more general testing system for three output file type also guided by ersilia general datastructure definition:

# Supported data structures
ds = {
    "Single": lambda x: isinstance(x, list) and len(x) == 1,
    "List": lambda x: isinstance(x, list)
    and len(x) > 1
    and all(isinstance(item, (int, float)) for item in x),
    "Flexible List": lambda x: isinstance(x, list)
    and all(isinstance(item, (str, int, float)) for item in x),
    "Matrix": lambda x: isinstance(x, list)
    and all(
        isinstance(row, list)
        and all(isinstance(item, (int, float)) for item in row)
        for row in x
    ),
    "Serializable Object": lambda x: isinstance(x, dict),
}

In the model tester I proposed double layer check, meaning decoupling model api result (accessing raw response from the model) and ersilia system that converts those api results to the specified output files. In the first layer checks we perform several checks on the model api result, guided by the above datastructure and also not violating the invalid values rule such as None, null ... The reason is that those three output file will be generated by ersilia so at some point if something changes in the ersilia system we may decide that the model has something wrong, which would entirely be false, because the final files are ersilia system dependent and will not be sufficient to decide our model healthiness. Having this first layer checks passed, whatever files check comes after that and fails, wont be a model problem, this let use know the source of the problem.

Whereas in the playground test we do have all those checks except the double check layer, which is more important to test model that are under dev or maintenance.

Does it makes sense?

@DhanshreeA
Copy link
Member

Hi @Abellegese I completely understand your point about decoupling systems and testing them individually rather than the integration of the "ersilia CLI + model" set up. However, it does not do much for us since the final outcome, ie the model working or not working will be considered in light of the entire system. That is, if the model API does indeed return a response that is correct, but Ersiila doesn't handle it correctly, the eventual effect will be that the model won't be considered "working", because it will not be useful for the users.

I am considering the tradeoff here in terms of usefulness and implementation time - as a proxy measure you can see that the model API has returned something by looking at the response status code from the API in the logs from making a prediction request to it.

In any case, the test for running the bash script which generates a CSV file is a good enough test that the model itself is working. I don't see a strong argument for the decoupling approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: On Hold
Development

No branches or pull requests

3 participants