fix: update max tokens for OpenAI #1772

Samoed · 2025-01-12T07:03:59Z

While running the benchmark again for all tasks, I ran into an error while evaluating a classification task. Digging deeper, I noticed something weird. OpenAI text embedding models generally have a context length of 8192 (if you go over that, the API throws an error). But it looks like the text-embedding-ada-002 model actually has a context length of 8191. If you push it to 8192 tokens, the API doesn’t throw an error (probably to align with the text-embedding-3-* models), but the model sometimes returns null values instead. I think we should update openai_models.py to fix this. Here’s a script to reproduce the issue, change 8191 to 8192 and see the results.
from openai import OpenAI
import tiktoken
import numpy as np

client = OpenAI()
encoding = tiktoken.get_encoding("cl100k_base")

sn = 'Hello World' * 5000

print(f'Num tokens: {len(encoding.encode(sn))}')

truncated_sentence = encoding.encode(sn)[:8191]
truncated_sentence = encoding.decode(truncated_sentence)

response = client.embeddings.create(
    input=truncated_sentence,
    model="text-embedding-ada-002",
    encoding_format="float",
)

em = np.array(response.data[0].embedding)
print(f'Null values: {np.isnan(em.astype(np.float32)).sum()}')

Originally posted by @HSILA in #1708 (comment)

We had difference num of max_tokens in loader and model meta and that caused this problem

Checklist

Run tests locally to make sure nothing is broken using make test.
Run the formatter to format the code using make lint.

* fix: update max tokens for OpenAI (#1772) update max tokens * ci: skip AfriSentiLID for now (#1785) * skip AfriSentiLID for now * skip relevant test case instead --------- Co-authored-by: Isaac Chung <[email protected]> * 1.28.7 Automatically generated by python-semantic-release * ci: fix model loading test (#1775) * pass base branch into the make command as an arg * test a file that has custom wrapper * what about overview * just dont check overview * revert instance check * explicitly omit overview and init * remove test change * try on a lot of models * revert test model file --------- Co-authored-by: Isaac Chung <[email protected]> * feat: Update task filtering, fixing bug which included cross-lingual tasks in overly many benchmarks (#1787) * feat: Update task filtering, fixing bug on MTEB - Updated task filtering adding exclusive_language_filter and hf_subset - fix bug in MTEB where cross-lingual splits were included - added missing language filtering to MTEB(europe, beta) and MTEB(indic, beta) The following code outlines the problems: ```py import mteb from mteb.benchmarks import MTEB_ENG_CLASSIC task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] # was eq. to: task = mteb.get_task("STS22", languages=["eng"]) task.hf_subsets # correct filtering to English datasets: # ['en', 'de-en', 'es-en', 'pl-en', 'zh-en'] # However it should be: # ['en'] # with the changes it is: task = [t for t in MTEB_ENG_CLASSIC.tasks if t.metadata.name == "STS22"][0] task.hf_subsets # ['en'] # eq. to task = mteb.get_task("STS22", hf_subsets=["en"]) # which you can also obtain using the exclusive_language_filter (though not if there was multiple english splits): task = mteb.get_task("STS22", languages=["eng"], exclusive_language_filter=True) ``` * format * remove "en-ext" from AmazonCounterfactualClassification * fixed mteb(deu) * fix: simplify in a few areas * fix: Add gritlm * 1.29.0 Automatically generated by python-semantic-release * fix: Added more annotations! * fix: Added C-MTEB (#1786) Added C-MTEB * 1.29.1 Automatically generated by python-semantic-release * docs: Add contact to MMTEB benchmarks (#1796) * Add myself to MMTEB benchmarks * lint * fix: loading pre 11 (#1798) * fix loading pre 11 * add similarity * lint * run all task types * 1.29.2 Automatically generated by python-semantic-release * fix: allow to load no revision available (#1801) * fix allow to load no revision available * lint * add require_model_meta to leaderboard * lint * 1.29.3 Automatically generated by python-semantic-release --------- Co-authored-by: Roman Solomatin <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: Isaac Chung <[email protected]> Co-authored-by: github-actions <[email protected]> Co-authored-by: Márton Kardos <[email protected]>

update max tokens

0d050bc

KennethEnevoldsen approved these changes Jan 12, 2025

View reviewed changes

KennethEnevoldsen merged commit 0c5c3a5 into main Jan 12, 2025
11 checks passed

KennethEnevoldsen deleted the update_openai_num_tokens branch January 12, 2025 20:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: update max tokens for OpenAI #1772

fix: update max tokens for OpenAI #1772

Samoed commented Jan 12, 2025

fix: update max tokens for OpenAI #1772

fix: update max tokens for OpenAI #1772

Conversation

Samoed commented Jan 12, 2025

Checklist