At first clone the repo and install our module
git clone https://github.com/SebChw/NLP-Legal-eval.git
cd NLP-LEGAL-EVAL
pip install -e .
If you have GPU you should download GPU RoBERTa weights here
You can run the app navigating to app/ directory and running gunicorn server
cd app
gunicorn wsgi:app
Then you can make two types of api call:
process
- Just process the text and return the result
process_vizualize
- Returns an HTML file with nice spacy vizualization
curl -X POST -H "Content-Type: application/json" -d '{"text": "The next contention revolves around the role of the AICTE and the decision of the Supreme Court in Bharathidasan University."}' http://127.0.0.1:8000/process_vizualize > visualization.html
or just
curl -X POST -H "Content-Type: application/json" -d '{"text": "The next contention revolves around the role of the AICTE and the decision of the Supreme Court in Bharathidasan University."}' http://127.0.0.1:8000/process
##Docker
You can also use the API without installing anything. Just build and run docker container.
docker build -t legal_eval_gpu .
docker run --gpus all -p 8000:8000 legal_eval_gpu
Just remember to place your best GPU model in /legal_eval/spacy/gpu-RoBERTa/output/model-best
To download and load the data
from legal_eval.data import download_data, get_hf_dataset
DATA_PATH = Path("../data")
download_data(DATA_PATH)
dataset = get_hf_dataset()
annotations are in format like this (I unnested them!)
Text is just plain text
I checked it and actually no word has more than 1 label. WHY THEY USED LIST GOD.
I wrote function that formats this dataset
from utils import parse_to_ner
dataset_ner = parse_to_ner(dataset)
So eventually you work with something like this. I splitted the text as it is necessary for hugging face evaluator.
I know it's weird
If you prefer integers over string as ner_tags
run
from utils import cast_ner_labels_to_int
casted = cast_ner_labels_to_int(dataset_ner['train'])
After this you can easily take int2str and str2int functions from the ClassLabel feature that was created
casted.features['ner_tags'].feature.int2str(1)
casted.features['ner_tags'].feature.str2int(1)
For evaluation we use HF evaluate library Any model or pipeline should follow it's API. Check this to implement custom Pipeline. Check baselines.py
and test_stuff.ipynb
for an example of Simple Baseline with API good enough for HF evaluator.
Suggestion: Maybe we can try to use Spacy for this?
Installing fasttext
git clone https://github.com/facebookresearch/fastText.git
pip install fastText
Run model creation
from utils import create_fasttext_model
create_fasttext_model(dataset_ner['train'], "legal_eval.bin")
For transformers labels are matched to tokens using the following function. It assigns all tokens in a label interval a label from this interval.
from utils import parse_to_ner_custom_tokenizer
dataset_ner = parse_to_ner_custom_tokenizer(dataset, tokenizer)
To train distillbert model simply run:
python -m legal_eval.transformers.train
To test it modify model checkpoint path and run
python -m legal_eval.transformers.test
To train spacy model you need to have spacy installed and run
python -m spacy train cpu_final.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
If you are on GPU you make replace cpu_final.cfg
with gpu_final.cfg
.
Generative models were not designed to solve NER tasks, but nowadays they are used for everything so let's give them a chance.
You can use them providing your OPENAI_API_TOKEN
in system variables.
Then you may run the legal_eval/GPT/openai.ipynb notebook to see how it works.