This project benchmarks the performance of various language model providers' flagship models, such as OpenAI, Anthropic, and Google, by running them through multiple-choice test questions. It evaluates response accuracy and latency, providing comparative insights into each provider’s model performance.
- Loads multiple-choice test questions from a CSV file.
- Sends questions to the selected LLM provider’s flagship model.
- Manages retries for potential API errors.
- Records response times and correctness.
- Saves results in a CSV format for analysis.
-
Clone this repository.
-
Install the required Python packages:
pip install -r requirements.txt
-
Create a
.env
file and add API keys for each LLM provider:OPENAI_API_KEY=your_openai_api_key_here ANTHROPIC_API_KEY=your_anthropic_api_key_here GEMINI_API_KEY=your_gemini_api_key_here
To run the benchmarking tool, use the following command:
python3 main.py <test_file> --num_runs <number_of_runs> --provider <provider_name> --temperature <int:temperature>
<test_file>
: The name of the CSV file containing the exam questions (must be in thetests
folder).--num_runs
: The number of times to run the test (default: 10).--provider
: The LLM provider to use for this test (options:openai
,anthropic
,google
).--temperature
: The temperature for the LLM (less (0) or more (1) random)
python3 main.py CDRE.csv --num_runs 10 --provider openai --temperature 0
This command runs the sample exam questions in CDRE.csv
through OpenAI’s flagship model 10 times.
The test file should be in CSV format with the following columns:
ID
: A unique identifier for each question.Context
: Any context needed for the question.Question
: The question text.Options
: The answer options, typically numbered.Answer
: The correct answer number.
main.py
: The main script to run the benchmarking tests.tests/
: Folder containing the multiple-choice question CSV files.results/
: Folder where the test results CSV files are saved..env
: File to store API keys for each LLM provider.
load_test(file_path)
: Loads test questions from a CSV file.exam(system, user, provider, max_retries, retry_delay)
: Runs a test question through the selected LLM provider's model and retries on error.run_test(test_data, num_runs, provider)
: Orchestrates running all test questions for the specified number of runs.save_results(results, output_file)
: Saves the results to a CSV file.main(test_file, num_runs, provider)
: Main entry point for running the benchmark tests.
- Python 3.7+
- OpenAI, Anthropic, and Google SDKs
- Pandas
- tqdm
- python-dotenv
This project is licensed under the MIT License.
This file was generated using GPT-4o