Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leaderboard: SFR-Embedding results don't match between old and new #1754

Open
Muennighoff opened this issue Jan 10, 2025 · 4 comments
Open

Comments

@Muennighoff
Copy link
Contributor

See #1571 (comment) & #1753 ; Probably best to check task by task which match and which don't

@Samoed
Copy link
Collaborator

Samoed commented Jan 11, 2025

Model: Salesforce/SFR-Embedding-2_R

Task Name Old Leaderboard New Leaderboard
AmazonCounterfactualClassification 92.72 92.72
AmazonPolarityClassification 97.31 97.31
AmazonReviewsClassification 61.04 61.04
ArguAna 62.34 62.34
ArxivClusteringP2P 54.02 54.02
ArxivClusteringS2S 48.82 48.82
AskUbuntuDupQuestions 66.71 66.71
BIOSSES 87.6 87.6
Banking77Classification 90.02 90.02
BiorxivClusteringP2P 50.76 50.76
BiorxivClusteringS2S 46.57 46.57
CQADupstackAndroidRetrieval N/A 54.78
CQADupstackEnglishRetrieval N/A 51.9
CQADupstackGamingRetrieval N/A 60.42
CQADupstackGisRetrieval N/A 41.13
CQADupstackMathematicaRetrieval N/A 33.6
CQADupstackPhysicsRetrieval N/A 48.85
CQADupstackProgrammersRetrieval N/A 46.37
CQADupstackStatsRetrieval N/A 38.53
CQADupstackTexRetrieval N/A 32.98
CQADupstackUnixRetrieval N/A 48.31
CQADupstackWebmastersRetrieval N/A 44.44
CQADupstackWordpressRetrieval N/A 35.12
ClimateFEVER 34.43 34.43
DBPedia 51.21 51.21
EmotionClassification 93.37 93.37
FEVER 92.16 92.16
FiQA2018 61.77 61.77
HotpotQA 81.36 81.36
ImdbClassification 96.8 96.8
MTOPDomainClassification 98.58 98.58
MTOPIntentClassification 91.3 91.3
MassiveIntentClassification 85.97 85.97
MassiveScenarioClassification 90.61 90.61
MedrxivClusteringP2P 46.66 46.66
MedrxivClusteringS2S 44.18 44.18
MindSmallReranking 31.26 31.26
NFCorpus 41.34 41.34
NQ 73.96 73.96
QuoraRetrieval 89.58 89.58
RedditClustering 72.74 62.92
RedditClusteringP2P 72.74 72.74
SCIDOCS 24.87 24.87
SICK-R 77.01 77.01
STS12 75.67 75.67
STS13 82.4 82.4
STS14 79.93 79.93
STS15 85.82 85.82
STS16 84.5 84.5
STS17 88.93 88.93
STS22 67.1 67.1
STSBenchmark 83.6 83.6
SciDocsRR 87.29 87.29
SciFact 85.91 85.91
SprintDuplicateQuestions 97.62 97.66
StackExchangeClustering 48.29 76.48
StackExchangeClusteringP2P 48.29 48.29
StackOverflowDupQuestions 55.32 55.32
SummEval 30.71 30.71
TRECCOVID 87.27 87.27
Touche2020 28.18 28.18
ToxicConversationsClassification 91.14 91.14
TweetSentimentExtractionClassification 79.7 79.7
TwentyNewsgroupsClustering 66.42 66.42
TwitterSemEval2015 78.57 78.57
TwitterURLCorpus 88.03 88.03
MSMARCO 42.18 42.18

@KennethEnevoldsen
Copy link
Contributor

Seems to me like the scores match, but that the aggregation is different (Old benchmark aggregates "CQADupstack*Retrieval").

@x-tabdeveloping we could manually aggregate these for MTEB (would be a hotfix). I prober solution is #1231.

@x-tabdeveloping
Copy link
Collaborator

I think this is also related to #1757 . As far as we understand with @KennethEnevoldsen MTEB(eng, classic) is incorrectly defined and uses some splits, which do contain English, but are not part of the original benchmark.

@KennethEnevoldsen
Copy link
Contributor

I have a solution outlined in #1771 which actually solves #1231. It is not polished yet though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants