Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Integrating ChemTEB #1708

Open
wants to merge 79 commits into
base: main
Choose a base branch
from
Open

feat: Integrating ChemTEB #1708

wants to merge 79 commits into from

Conversation

HSILA
Copy link
Contributor

@HSILA HSILA commented Jan 5, 2025

As discussed in issue #1585, I have integrated ChemTEB into the MTEB codebase and submitted this PR for merging. I have added 28 tasks and two model families: amazon_models.py and cohere_bedrock_models.py. These models were included to meet our internal requirements, as they are not currently available in MTEB. I thought it might be good to have them, but I can remove them if necessary. I have tested the mentioned models with all the tasks, you can see the result here: chemteb_results_pr.csv

Closes #1585

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.
  • Updating tables in docs/benchmarks.md and docs/tasks.md
  • Adding ChemTEB as a benchmark in benchmark.py

Adding datasets checklist

Reason for dataset addition: To address the lack of benchmarks tailored to chemical sciences, where text involves complex entity names, expressions, and specialized representations like SMILES codes, not typically found in general domain datasets.

  • I have run the following models on the task (adding the results to the pr). These can be run using the mteb -m {model_name} -t {task_name} command.
    • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2
    • intfloat/multilingual-e5-small
  • I have checked that the performance is neither trivial (both models gain close to perfect scores) nor random (both models gain close to random scores).
  • If the dataset is too big (e.g. >2048 examples), considering using self.stratified_subsampling() under dataset_transform()
  • I have filled out the metadata object in the dataset file (find documentation on it here).
  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

Adding a model checklist

  • I have filled out the ModelMeta object to the extent possible
  • I have ensured that my model can be loaded using mteb.get_model(model_name, revision) and mteb.get_model_meta(model_name, revision)
  • I have tested the implementation works on a representative set of tasks.

HSILA and others added 30 commits August 10, 2024 17:47
HSILA and others added 10 commits January 4, 2025 17:25
- Move `PubChemSMILESBitextMining` to `eng` folder
- Add citations for tasks involving SDS, NQ, Hotpot, PubChem data
- Update Clustering tasks `category`
- Change `main_score` for `PubChemAISentenceParaphrasePC`
- `eval_langs` fixed
- Dataset path was fixed for two datasets
- Metadata was completed for all tasks, mainly following fields: `date`, `task_subtypes`, `dialect`, `sample_creation`
- ruff lint
- rename `nomic_bert_models.py` to `nomic_bert_model.py` and update it.
Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?

mteb/models/cohere_bedrock_models.py Outdated Show resolved Hide resolved
mteb/tasks/BitextMining/eng/PubChemSMILESBitextMining.py Outdated Show resolved Hide resolved
mteb/tasks/PairClassification/eng/PubChemSynonymPC.py Outdated Show resolved Hide resolved
@HSILA
Copy link
Contributor Author

HSILA commented Jan 5, 2025

Great! Can you submit your results into https://github.com/embeddings-benchmark/results and compare your results with results from paper?

Hi, thank you for the review. I haven't worked with the mentioned repository before. Could you clarify how I should proceed? What do you mean by "paper"?

@Samoed
Copy link
Collaborator

Samoed commented Jan 5, 2025

You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1

@HSILA
Copy link
Contributor Author

HSILA commented Jan 5, 2025

You should provide results of your run to this repo (directory with json's). By paper I mean that you should check if your results matching with results from https://arxiv.org/abs/2412.00532v1

The detailed score for each task-model pair is not reported in our paper (I have it locally though); instead, an average per category is provided. Additionally, the current PR does not include the exact same tasks: three tasks have been removed, one new multilingual task has been added, and eight tasks have been merged into two. Therefore, the average scores may not align perfectly. However, I can run the benchmark and push the scores to the results repository.

- Text should be truncated for amazon text embedding models.
- `text-embedding-ada-002` returns null embeddings for some inputs with 8192 tokens.
- Two datasets are updated, dropping very long samples (len > 99th percentile)
@HSILA
Copy link
Contributor Author

HSILA commented Jan 8, 2025

I updated bedrock_models.py to truncate sentences if they exceed Amazon's context length. For Cohere models in Bedrock, the limit is exactly 2048 characters, while for Amazon models, it is 8192 tokens. However, there is no straightforward way to calculate the number of tokens for a piece of text (like tiktoken does for OpenAI models) unless you send a long text to the API and it returns an error with the exact token count :)

Error Example:

ValidationException: An error occurred (ValidationException) when calling the InvokeModel operation: 
400 Bad Request: Too many input tokens. Max input tokens: 8192, request input token count: 11530
To handle long text, there are two approaches for truncation:

Static Approach
Use a smaller, safe character-to-token ratio. According to Amazon's documentation:
The characters-to-token ratio in English is 4.7 characters per token, on average.
A conservative ratio, like 45, can be used.

Dynamic Approach
Wrap the logic in a try-catch block, let the API throw an error for long text, and then adjust based on the exact token count (maybe I'm overthinking though)

try:
    all_embeddings.append(self._embed_amazon(sentence))
except ValidationError as e:
    pattern = "request input token count:" + r"\s*(\d+)"
    match = re.search(pattern, str(e))
    if match:
        num_tokens = int(match.group(1))
        ratio = 0.9 * (self._max_tokens / num_tokens)
        max_sequence_length = int(len(sentence) * ratio)
        all_embeddings.append(self._embed_amazon(sentence[:max_sequence_length]))
    else:
        raise e

I used the first method, but if you think the second one is better, let me know.

@HSILA
Copy link
Contributor Author

HSILA commented Jan 8, 2025

While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the text-embedding-ada-002 model actually has a context length of 8191. If you push it to 8192 tokens, the API doesn’t throw an error (probably to align with the text-embedding-3-* models), but the model sometimes returns null values instead. I think we should update openai_models.py to fix this. Here’s a script to reproduce the issue, change 8191 to 8192 and see the results.

from openai import OpenAI
import tiktoken
import numpy as np

client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")

sn = 'Hello World' * 5000

print(f'Num tokens: {len(encoding.encode(sn))}')

truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)

response = client.embeddings.create(
    input=truncated_sentence,
    model="text-embedding-ada-002",
    encoding_format="float",
)

em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}')

@Samoed
Copy link
Collaborator

Samoed commented Jan 9, 2025

Yes, we have some inconsistency with max_tokens in OpenAI models. In ModelMeta, it states max_tokens=8191, but in the class, it is set to 8192. I think you could create a separate PR to fix this.

For Bedrock token count, I suggest adding a static calculation for the number of tokens along with a dynamic approach to ensure the models work correctly.

Additionally, could you create a table here comparing the results from the paper with your results?

@Samoed Samoed mentioned this pull request Jan 12, 2025
2 tasks
@Samoed Samoed changed the title Integrating ChemTEB feat: Integrating ChemTEB Jan 12, 2025
Copy link
Collaborator

@Samoed Samoed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

mteb/tasks/PairClassification/eng/PubChemSMILESPC.py Outdated Show resolved Hide resolved
@Samoed
Copy link
Collaborator

Samoed commented Jan 12, 2025

Hi, hope you are doing fine.

As I mentioned earlier here, a direct comparison with the paper is not possible because the task-model scores are not presented in the paper. The paper reports an average score per category, but the combination of tasks has changed in my PR, so we cannot directly compare per-category averages.

However, one approach we can take is to compare the tasks that are present both in my PR and in the ChemTEB results. I have shared the local JSON results that I have, allowing us to compare the shared tasks and evaluate how the performance has changed on average.

To ensure that the JSON files I shared in chemteb-results correspond to the same results used to produce Table 2, you can refer to table2.ipynb and reproduce it for verification.

The mteb.ipynb notebook compares the main score for shared tasks (using an average score for tasks that were merged as subsets of a bigger task in MTEB) and reports the difference.

The observed difference is 0.0045 overall (average across all the models and tasks), and it ranges from approximately 0 to 0.02 for most tasks. Notably, the PubChemWikiParagraphsPC task shows a significant difference because it was updated later. This update involved masking exact chemical compound names in each text pair to make the problem more challenging.

Other observed changes, particularly in Classification and Clustering tasks, can be attributed to updates that complemented the label column (which was not sufficiently descriptive) with a label_text column. Additionally, these tasks may have undergone reordering, potentially affecting train-test splits.

That said, we can always revert all revisions to match those used in the paper. However, the current revisions in the PR provide more detailed information, such as the label_text column and an updated README.

Originally posted by @HSILA in embeddings-benchmark/results#89 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate ChemTEB
2 participants