Project: Exam Benchmark Testing with LLM Providers

Overview

This project benchmarks the performance of various language model providers' flagship models, such as OpenAI, Anthropic, and Google, by running them through multiple-choice test questions. It evaluates response accuracy and latency, providing comparative insights into each provider’s model performance.

Features

Loads multiple-choice test questions from a CSV file.
Sends questions to the selected LLM provider’s flagship model.
Manages retries for potential API errors.
Records response times and correctness.
Saves results in a CSV format for analysis.

Installation

Clone this repository.
Install the required Python packages:
```
pip install -r requirements.txt
```

Create a .env file and add API keys for each LLM provider:

OPENAI_API_KEY=your_openai_api_key_here
ANTHROPIC_API_KEY=your_anthropic_api_key_here
GEMINI_API_KEY=your_gemini_api_key_here

Usage

To run the benchmarking tool, use the following command:

python3 main.py <test_file> --num_runs <number_of_runs> --provider <provider_name> --temperature <int:temperature>

Arguments:

<test_file>: The name of the CSV file containing the exam questions (must be in the tests folder).
--num_runs: The number of times to run the test (default: 10).
--provider: The LLM provider to use for this test (options: openai, anthropic, google).
--temperature: The temperature for the LLM (less (0) or more (1) random)

Example:

python3 main.py CDRE.csv --num_runs 10 --provider openai --temperature 0

This command runs the sample exam questions in CDRE.csv through OpenAI’s flagship model 10 times.

CSV File Format

The test file should be in CSV format with the following columns:

ID: A unique identifier for each question.
Context: Any context needed for the question.
Question: The question text.
Options: The answer options, typically numbered.
Answer: The correct answer number.

Project Structure

main.py: The main script to run the benchmarking tests.
tests/: Folder containing the multiple-choice question CSV files.
results/: Folder where the test results CSV files are saved.
.env: File to store API keys for each LLM provider.

Functionality

load_test(file_path): Loads test questions from a CSV file.
exam(system, user, provider, max_retries, retry_delay): Runs a test question through the selected LLM provider's model and retries on error.
run_test(test_data, num_runs, provider): Orchestrates running all test questions for the specified number of runs.
save_results(results, output_file): Saves the results to a CSV file.
main(test_file, num_runs, provider): Main entry point for running the benchmark tests.

Requirements

Python 3.7+
OpenAI, Anthropic, and Google SDKs
Pandas
tqdm
python-dotenv

License

This project is licensed under the MIT License.

Disclosure

This file was generated using GPT-4o

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
results		results
tests		tests
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Exam Benchmark Testing with LLM Providers

Overview

Features

Installation

Usage

Arguments:

Example:

CSV File Format

Project Structure

Functionality

Requirements

License

Disclosure

About

Releases

Packages

Languages

grandSpecial/llm-benchmarking

Folders and files

Latest commit

History

Repository files navigation

Project: Exam Benchmark Testing with LLM Providers

Overview

Features

Installation

Usage

Arguments:

Example:

CSV File Format

Project Structure

Functionality

Requirements

License

Disclosure

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages