Skip to content

ppsmk388/MoraBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

77 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MoraBench (Model Ranking Benchmark)

Check out the paper:

How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking

🤔 What is it?

MoraBench (Model Ranking Benchmark) is a benchmark platform that comprises a collection of model outputs generated under diverse scenarios. It also provides a common and easy framework, for the development and evaluation of your own model ranking method within the benchmark.

🏁 What is model ranking?

Model Ranking is to rank models from a set of trained models according to their performance for the target task. Traditionally, people use a fully-labeled validation set to rank the models, here we explore how to do model ranking with a limited annotation budget.

🔧 Installation

Using conda

[1] Install anaconda: Instructions here: https://www.anaconda.com/download/

[2] Clone the repository:

git clone https://github.com/ppsmk388/MoraBench.git
cd MoraBench

[3] Create virtual environment:

conda env create -f environment.yml
source activate MoraBench

Using pip

pip install -r requirements.txt

📊 Available Model-sets

MoraBench assembles outputs from models operating under different learning paradigms:

Weak Supervision:

We generated model outputs within a weak supervision setting using the WRENCH framework. We generate model outputs across 48 distinct weak supervision configurations on five datasets: SMS, AGNews, Yelp, IMDB, Trec.

Semi-supervised Learning:

Leveraging the USB benchmark, model outputs were obtained from 12 semi-supervised methods across five datasets: IMDB, Amazon Review, Yelp Review, AGNews and Yahoo! Answer.

Prompt Selection:

We employed large language models and various prompts to generate diverse outputs, assessed using the T0 benchmark.

The table below shows the initial model set included in MoraBench and the total size of the validation set plus the test set, i.e., # Data. The number after the dataset of Semi-supervised Learning indicates the number of labels used in semi-supervised training stage.

Training Setting Task Type Dataset Model Number # Data
Weak Supervision Sentiment Classification Yelp 480 3800
Sentiment Classification IMDB 480 2500
Spam Classification SMS 480 500
Spam Classification IMDB 480 2500
Topic Classification AGNews 480 12000
Question Classification Trec 480 500
Semi-supervised Learning Sentiment Classification IMDB (20) 400 2000
Sentiment Classification IMDB (100) 400 2000
sentiment classification Yelp Review (250) 400 25000
sentiment classification Yelp Review (1000) 400 25000
sentiment classification Amazon Review (250) 400 25000
Sentiment Classification Amazon Review (1000) 400 25000
Topic Classification Yahoo! Answer (500) 400 50000
Topic Classification Yahoo! Answer (2000) 400 50000
Topic Classification AGNews (40) 400 10000
Topic Classification AGNews (200) 400 10000
Prompt Selection Coreference Resolution WSC 10 104
Word Sense Disambiguation WiC 10 638
Sentence Completion Story 6 3742
Natural Language Inference CB 15 56
Natural language Inference RTE 10 277
Natural language Inference ANLI1 15 1000
Natural language Inference ANLI2 15 1000
Natural language Inference ANLI3 15 1200

Details of these datasets can be found in our paper, and all these model sets can be downloaded via this. We plan to add more model set soon.

📙 Quick examples

All example code can be found in this. For example, for LEMR framework, we can show its result of prompt selection setting by following steps:

1. Generate plot data:

We can directly run . /examples/run.sh

bash ./examples/LEMR/run.sh num_split

where num_split is the number of splits generated and if not entered, the default is 50.

Here are the details of run.sh

#!/bin/bash


# Set the default value to 50
total_split_number=50

# If the command line parameter is given, use that
if [ ! -z "$1" ]; then
    total_split_number=$1
fi

for Ensemble_method in hard soft
do
    for dataset_name in  story  wsc cb rte wic anli1 anli2 anli3
    do
        for model_committee_type in z_score all_model
        do
            python run_lemr.py 
            --Ensemble_method $Ensemble_method              # ensemble method, hard or soft
            --dataset_name $dataset_name                    # dataset name
            --total_split_number $total_split_number        # total split number we used
            --total_split_number $model_committee_type      # model committee selection type, z_score or all_model
        done
    done
done

2. Results visualization

python ./examples/LEMR/show_lemr.py --metric rc # rc for ranking correction and og for optimal gap

📧 Contact

Contact person: Zhengyu Hu, [email protected]

Don't hesitate to send us an e-mail if you have any question.

We're also open to any collaboration!

✨ Contributing Dataset and Model

We sincerely welcome any contribution to the methods or model set!

👆 Citation

@article{hu2023many,
  title={How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking},
  author={Hu, Zhengyu and Zhang, Jieyu and Yu, Yue and Zhuang, Yuchen and Xiong, Hui},
  journal={arXiv preprint arXiv:2312.01619},
  year={2023}
}

About

MoraBench (Model Ranking Benchmark)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages