Check out the paper:
⭐ How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking
MoraBench (Model Ranking Benchmark) is a benchmark platform that comprises a collection of model outputs generated under diverse scenarios. It also provides a common and easy framework, for the development and evaluation of your own model ranking method within the benchmark.
Model Ranking is to rank models from a set of trained models according to their performance for the target task. Traditionally, people use a fully-labeled validation set to rank the models, here we explore how to do model ranking with a limited annotation budget.
[1] Install anaconda: Instructions here: https://www.anaconda.com/download/
[2] Clone the repository:
git clone https://github.com/ppsmk388/MoraBench.git
cd MoraBench
[3] Create virtual environment:
conda env create -f environment.yml
source activate MoraBench
pip install -r requirements.txt
MoraBench assembles outputs from models operating under different learning paradigms:
We generated model outputs within a weak supervision setting using the WRENCH framework. We generate model outputs across 48 distinct weak supervision configurations on five datasets: SMS, AGNews, Yelp, IMDB, Trec.
Leveraging the USB benchmark, model outputs were obtained from 12 semi-supervised methods across five datasets: IMDB, Amazon Review, Yelp Review, AGNews and Yahoo! Answer.
We employed large language models and various prompts to generate diverse outputs, assessed using the T0 benchmark.
The table below shows the initial model set included in MoraBench and the total size of the validation set plus the test set, i.e., # Data. The number after the dataset of Semi-supervised Learning indicates the number of labels used in semi-supervised training stage.
Training Setting | Task Type | Dataset | Model Number | # Data |
---|---|---|---|---|
Weak Supervision | Sentiment Classification | Yelp | 480 | 3800 |
Sentiment Classification | IMDB | 480 | 2500 | |
Spam Classification | SMS | 480 | 500 | |
Spam Classification | IMDB | 480 | 2500 | |
Topic Classification | AGNews | 480 | 12000 | |
Question Classification | Trec | 480 | 500 | |
Semi-supervised Learning | Sentiment Classification | IMDB (20) | 400 | 2000 |
Sentiment Classification | IMDB (100) | 400 | 2000 | |
sentiment classification | Yelp Review (250) | 400 | 25000 | |
sentiment classification | Yelp Review (1000) | 400 | 25000 | |
sentiment classification | Amazon Review (250) | 400 | 25000 | |
Sentiment Classification | Amazon Review (1000) | 400 | 25000 | |
Topic Classification | Yahoo! Answer (500) | 400 | 50000 | |
Topic Classification | Yahoo! Answer (2000) | 400 | 50000 | |
Topic Classification | AGNews (40) | 400 | 10000 | |
Topic Classification | AGNews (200) | 400 | 10000 | |
Prompt Selection | Coreference Resolution | WSC | 10 | 104 |
Word Sense Disambiguation | WiC | 10 | 638 | |
Sentence Completion | Story | 6 | 3742 | |
Natural Language Inference | CB | 15 | 56 | |
Natural language Inference | RTE | 10 | 277 | |
Natural language Inference | ANLI1 | 15 | 1000 | |
Natural language Inference | ANLI2 | 15 | 1000 | |
Natural language Inference | ANLI3 | 15 | 1200 |
Details of these datasets can be found in our paper, and all these model sets can be downloaded via this. We plan to add more model set soon.
All example code can be found in this. For example, for LEMR framework, we can show its result of prompt selection setting by following steps:
We can directly run . /examples/run.sh
bash ./examples/LEMR/run.sh num_split
where num_split is the number of splits generated and if not entered, the default is 50.
Here are the details of run.sh
#!/bin/bash
# Set the default value to 50
total_split_number=50
# If the command line parameter is given, use that
if [ ! -z "$1" ]; then
total_split_number=$1
fi
for Ensemble_method in hard soft
do
for dataset_name in story wsc cb rte wic anli1 anli2 anli3
do
for model_committee_type in z_score all_model
do
python run_lemr.py
--Ensemble_method $Ensemble_method # ensemble method, hard or soft
--dataset_name $dataset_name # dataset name
--total_split_number $total_split_number # total split number we used
--total_split_number $model_committee_type # model committee selection type, z_score or all_model
done
done
done
python ./examples/LEMR/show_lemr.py --metric rc # rc for ranking correction and og for optimal gap
Contact person: Zhengyu Hu, [email protected]
Don't hesitate to send us an e-mail if you have any question.
We're also open to any collaboration!
We sincerely welcome any contribution to the methods or model set!
@article{hu2023many,
title={How Many Validation Labels Do You Need? Exploring the Design Space of Label-Efficient Model Ranking},
author={Hu, Zhengyu and Zhang, Jieyu and Yu, Yue and Zhuang, Yuchen and Xiong, Hui},
journal={arXiv preprint arXiv:2312.01619},
year={2023}
}