Skip to content

FinMTEB: Finance Massive Text Embedding Benchmark

Notifications You must be signed in to change notification settings

yixuantt/FinMTEB

Repository files navigation

Logo

FinMTEB: Finance Massive Text Embedding Benchmark

Finance Massive Text Embedding Benchmark (FinMTEB), an embedding benchmark consists of 64 financial domain-specific text datasets, across English and Chinese, spanning seven different tasks. All datasets in FinMTEB are finance-domain specific, either previously used in financial NLP research or newly developed by the authors.

Usage

  • The basic pipeline is built upon MTEB.

Install

conda create -n finmteb python=3.10
git clone https://github.com/yixuantt/FinMTEB.git
cd FinMTEB
pip install -r requirements.txt

Task selection

FinMTEB offers 7 tasks and 64 datasets, which you can choose according to your needs.

import finance_mteb 

tasks = finance_mteb.get_tasks(task_types=["Clustering", "Retrieval","PairClassification","Reranking","STS","Summarization","Classification"]) # All 7 Tasks

Running a Benchmark

from finance_mteb import MTEB
task = "FinSTS"
evaluation = MTEB(tasks=[task])
evaluation.run(model, output_folder=f"results/{model_name_or_path.split('/')[-1]}")

Example Usage

  • There is an example Python script for your reference:
python eval_FinanceMTEB.py --model_name_or_path BAAI/bge-en-icl --pooling_method last

Citation

@misc{tang2024needdomainspecificembeddingmodels,
      title={Do We Need Domain-Specific Embedding Models? An Empirical Investigation}, 
      author={Yixuan Tang and Yi Yang},
      year={2024},
      eprint={2409.18511},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.18511}, 
}

Thanks to the MTEB Benchmark.

About

FinMTEB: Finance Massive Text Embedding Benchmark

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages