Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update max tokens for OpenAI #1772

Merged
merged 1 commit into from
Jan 12, 2025
Merged

Conversation

Samoed
Copy link
Collaborator

@Samoed Samoed commented Jan 12, 2025

While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the text-embedding-ada-002 model actually has a context length of 8191. If you push it to 8192 tokens, the API doesn’t throw an error (probably to align with the text-embedding-3-* models), but the model sometimes returns null values instead. I think we should update openai_models.py to fix this. Here’s a script to reproduce the issue, change 8191 to 8192 and see the results.

from openai import OpenAI
import tiktoken
import numpy as np

client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")

sn = 'Hello World' * 5000

print(f'Num tokens: {len(encoding.encode(sn))}')

truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)

response = client.embeddings.create(
    input=truncated_sentence,
    model="text-embedding-ada-002",
    encoding_format="float",
)

em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}')

Originally posted by @HSILA in #1708 (comment)

We had difference num of max_tokens in loader and model meta and that caused this problem

Checklist

  • Run tests locally to make sure nothing is broken using make test.
  • Run the formatter to format the code using make lint.

@KennethEnevoldsen KennethEnevoldsen merged commit 0c5c3a5 into main Jan 12, 2025
11 checks passed
@KennethEnevoldsen KennethEnevoldsen deleted the update_openai_num_tokens branch January 12, 2025 20:46
KennethEnevoldsen added a commit that referenced this pull request Jan 15, 2025
* fix: update max tokens for OpenAI (#1772)

update max tokens

* ci: skip AfriSentiLID for now (#1785)

* skip AfriSentiLID for now

* skip relevant test case instead

---------

Co-authored-by: Isaac Chung <[email protected]>

* 1.28.7

Automatically generated by python-semantic-release

* ci: fix model loading test (#1775)

* pass base branch into the make command as an arg

* test a file that has custom wrapper

* what about overview

* just dont check overview

* revert instance check

* explicitly omit overview and init

* remove test change

* try on a lot of models

* revert test model file

---------

Co-authored-by: Isaac Chung <[email protected]>

* feat: Update task filtering, fixing bug which included cross-lingual tasks in overly many benchmarks (#1787)

* feat: Update task filtering, fixing bug on MTEB

- Updated task filtering adding exclusive_language_filter and hf_subset
- fix bug in MTEB where cross-lingual splits were included
- added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta)

The following code outlines the problems:

```py
import mteb
from mteb.benchmarks import MTEB_ENG_CLASSIC

task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0]
# was eq. to:
task = mteb.get_task("STS22", languages=["eng"])
task.hf_subsets
# correct filtering to English datasets:
# ['en', 'de-en', 'es-en', 'pl-en', 'zh-en']
# However it should be:
# ['en']

# with the changes it is:
task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0]
task.hf_subsets
# ['en']
# eq. to
task = mteb.get_task("STS22", hf_subsets=["en"])
# which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits):
task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True)
```

* format

* remove "en-ext" from AmazonCounterfactualClassification

* fixed mteb(deu)

* fix: simplify in a few areas

* fix: Add gritlm

* 1.29.0

Automatically generated by python-semantic-release

* fix: Added more annotations!

* fix: Added C-MTEB (#1786)

Added C-MTEB

* 1.29.1

Automatically generated by python-semantic-release

* docs: Add contact to MMTEB benchmarks (#1796)

* Add myself to MMTEB benchmarks
* lint

* fix: loading pre 11 (#1798)

* fix loading pre 11

* add similarity

* lint

* run all task types

* 1.29.2

Automatically generated by python-semantic-release

* fix: allow to load no revision available (#1801)

* fix allow to load no revision available

* lint

* add require_model_meta to leaderboard

* lint

* 1.29.3

Automatically generated by python-semantic-release

---------

Co-authored-by: Roman Solomatin <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: Isaac Chung <[email protected]>
Co-authored-by: github-actions <[email protected]>
Co-authored-by: Márton Kardos <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants