Mistral Harmlessness Evaluator

This module is a part of the Mistral Self-Alignment project, aimed at aligning the Mistral model to be harmless and prevent dangerous and harmful responses. For more details, explore related GitHub repositories like The Lab.

Overview

The Mistral Harmlessness Evaluator is designed to test the harmlessness of the trained Mistral 7b Peft Adapters. The evaluation is based on a one-shot prompt test available here, resulting in an 82% agreement with labeled data.

Usage

Installation

Clone the repository and install the necessary dependencies:

pip install requirements.txt

Please note that the requirements.txt file was written in the context of running the module in a Kaggle notebook and may not contain all the packages necessary for a local notebook.

Running the Script

Use the following command to run the script:

python harmlessness_self_evaluator.py \
    --model_path "path/to/mistral" \
    --peft_path "path/to/peft_adapter" \
    --num_of_eval_prompts 200

In the example above:

--model_path should be the file path or the Hugging Face Hub's ID of Mistral 7b.
--peft_path should be the file path or the Hugging Face Hub's ID of the Peft Adapter.
--num_of_eval_prompts is the number of red team prompts in the train and test dataset for the model to be evaluated on. Keep in mind that the number of red team prompts is limited.

Note

This module was initially developed for personal use. If you find it useful, feel free to clone, modify, and adapt it to your use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Mistral Harmlessness Evaluator

Overview

Usage

Installation

Running the Script

Note

Files

README.md

Latest commit

History

README.md

File metadata and controls

Mistral Harmlessness Evaluator

Overview

Usage

Installation

Running the Script

Note