Skip to content

Commit

Permalink
review comments
Browse files Browse the repository at this point in the history
  • Loading branch information
arjunattam committed May 22, 2024
1 parent 4260dcf commit b7e3759
Show file tree
Hide file tree
Showing 7 changed files with 85 additions and 159 deletions.
44 changes: 23 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,52 +3,55 @@
[![npm](https://img.shields.io/npm/v/empiricalrun)](https://npmjs.com/package/empiricalrun)
[![Discord](https://img.shields.io/badge/discord-empirical.run-blue?logo=discord&logoColor=white&color=5d68e8)](https://discord.gg/NeR6jj8dw9)

Empirical is the fastest way to test your LLM app and iterate over prompts and other model configuration.
Empirical is the fastest way to test different LLMs and model configurations, across
all the scenarios that matter for your application.

With Empirical, you can:
With Empirical, you can

- Run your test datasets locally against off-the-shelf models
- Test your own custom models and RAG applications (see [how-to](https://docs.empirical.run/models/custom))
- Reports to view, compare, analyze outputs on a web UI
- Run your test datasets locally against [off-the-shelf](https://docs.empirical.run/models/model) or [custom models](https://docs.empirical.run/models/custom)
- Compare model outputs on a web UI, and [test changes quickly](https://docs.empirical.run/reporter)
- Score your outputs with [scoring functions](https://docs.empirical.run/scoring/basics)
- Run [tests on CI/CD](https://docs.empirical.run/running-in-ci)

https://github.com/empirical-run/empirical/assets/284612/3309283c-ddad-4c4e-8175-08a32460686c

## Usage

[See quick start on docs →](https://docs.empirical.run/quickstart)
[**See all docs →**](https://docs.empirical.run/quickstart)

Empirical bundles together a CLI and a web app. The CLI handles running tests and
the web app visualizes results.
Empirical bundles together a test runner and a web app. These can be used through
the CLI in your terminal window.

Everything runs locally, with a JSON configuration file, `empiricalrc.json`.

> Required: [Node.js](https://nodejs.org/en) 20+ needs to be installed on your system.
Empirical relies on a configuration file, typically located at `empiricalrc.js`
which describes the test to run.

### Start with a basic example

In this example, we will ask an LLM to parse user messages to extract entities and
In this example, we will ask an LLM to extract entities from user messages and
give us a structured JSON output. For example, "I'm Alice from Maryland" will
become `"{name: 'Alice', location: 'Maryland'}"`.
become `{name: 'Alice', location: 'Maryland'}`.

Our test will succeed if the model outputs valid JSON.

1. Use the CLI to create a sample configuration file called `empiricalrc.json`.
1. Use the CLI to create a sample configuration file called `empiricalrc.js`.

```sh
npx empiricalrun init
cat empiricalrc.json
npm init empirical

# For TypeScript
npm init empirical -- --using-ts
```

2. Run the test samples against the models with the `run` command. This step requires
the `OPENAI_API_KEY` environment variable to authenticate with OpenAI. This
execution will cost $0.0026, based on the selected models.
2. Run the example dataset against the selected models.

```sh
npx empiricalrun
```

This step requires the `OPENAI_API_KEY` environment variable to
authenticate with OpenAI. This execution will cost $0.0026, based
on the selected models.

3. Use the `ui` command to open the reporter web app and see side-by-side results.

```sh
Expand All @@ -57,13 +60,12 @@ Our test will succeed if the model outputs valid JSON.

### Make it yours

Edit the `empiricalrc.json` file to make Empirical work for your use-case.
Edit the `empiricalrc.js` file to make Empirical work for your use-case.

- Configure which [models to use](https://docs.empirical.run/models/basics)
- Configure [your test dataset](https://docs.empirical.run/dataset/basics)
- Configure [scoring functions](https://docs.empirical.run/scoring/basics) to grade output quality


## Contribution guide

See [development docs](development/README.md).
8 changes: 4 additions & 4 deletions docs/config/json.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,11 @@ title: 'JSON'
description: 'Use a JSON file to configure your tests'
---

For simpler configurations, you can use a JSON file to configure your tests.
For simpler configurations, you can use a JSON file to configure your tests, instead
of [JavaScript](./javascript) or [TypeScript](./typescript).

Empirical uses the config file at `empiricalrc.json` to describe
the test to run. This configuration is declarative, in the sense that you define what you
want to test, and Empirical will internally implement the expected behavior.
To set this up, create the config file at `empiricalrc.json` which describes
the test to run.

## Configuration reference

Expand Down
18 changes: 8 additions & 10 deletions docs/introduction.mdx
Original file line number Diff line number Diff line change
@@ -1,22 +1,20 @@
---
title: Introduction
description: 'Welcome to empirical.run'
description: 'Welcome to Empirical'
---

Empirical is the fastest way to test different LLMs, prompts and other model configurations, across all the scenarios
that matter for your application.
Empirical is the fastest way to test different LLMs and model configurations, across
all the scenarios that matter for your application.

[Try it out!](./quickstart)
[Quick start →](./quickstart)

## With Empirical, you can

- Run your test datasets locally against off-the-shelf models
- Test your own custom models and RAG applications (see [how-to](./models/custom))
- Reports to view, compare, analyze outputs on a web UI
- Run your test datasets locally against [off-the-shelf](./models/model) or [custom models](./models/custom)
- Compare model outputs on a web UI, and [test changes quickly](./reporter)
- Score your outputs with [scoring functions](./scoring/basics)
- Run [tests on CI/CD](./running-in-ci)


## Walk through

Watch a 6 mins demo video showing how Empirical can run the [HumanEval benchmark](https://github.com/empirical-run/empirical/tree/main/examples/humaneval).
Expand All @@ -33,5 +31,5 @@ Watch a 6 mins demo video showing how Empirical can run the [HumanEval benchmark

## Open source

Empirical is open source on [GitHub](https://github.com/empirical-run/empirical). Star the repo, file issues or pull requests
to contribute to the project.
Empirical is open source on [GitHub](https://github.com/empirical-run/empirical). Star
the repo, file issues or pull requests to contribute to the project.
103 changes: 17 additions & 86 deletions docs/quickstart.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,129 +3,60 @@ title: 'Quick start'
description: 'Try Empirical in 3 steps'
---

Empirical bundles together a CLI and a web app. The CLI handles running tests and
the web app visualizes results.
Empirical bundles together a test runner and a web app, as the [test reporter](./reporter). These
can be used through the CLI in your terminal window.

Everything runs locally, with a JSON configuration file, `empiricalrc.json`.
Empirical relies on a configuration file, typically located at `empiricalrc.js`
which describes the test to run. This configuration is declarative, which means
that you define what you want to test, and Empirical will internally implement
the expected behavior.

Required: Node.js 20+ needs to be installed on your system.
> Required: Node.js 20+ needs to be installed on your system.
## Start with a basic example

In this example, we will ask an LLM to parse user messages to extract entities and
In this example, we will ask an LLM to extract entities from user messages and
give us a structured JSON output. For example, "I'm Alice from Maryland" will
become `"{name: 'Alice', location: 'Maryland'}"`.
become `{name: 'Alice', location: 'Maryland'}`.

Our test will succeed if the model outputs valid JSON.

<Steps>
<Step title="Set up Empirical">
Use the CLI to create a sample configuration file in `empiricalrc.json`.
Use the CLI to create a sample configuration file in `empiricalrc.js`.

```sh
npx empiricalrun init
```

Read the file to see the configured models and dataset samples that we will test
for. The default configuration uses models from OpenAI.
npm init empirical

```sh
cat empiricalrc.json
# For TypeScript
npm init empirical -- --using-ts
```
</Step>

<Step title="Run the test">
Run the test samples against the models with the `run` command.
Run the example dataset against the selected models.

```sh
npx empiricalrun
```

This step requires the `OPENAI_API_KEY` environment variable to authenticate with
OpenAI. This execution will cost $0.0026, based on the selected models.
OpenAI. This run will cost $0.0026, based on the selected models.
</Step>

<Step title="See results">
Use the `ui` command to open the reporter web app in your web browser and see
side-by-side results.
Use the `ui` command to open the reporter web app and see side-by-side results.

```sh
npx empiricalrun ui
```
</Step>

<Step title="[Bonus] Fix GPT-4 Turbo">
GPT-4 Turbo tends to fail our JSON syntax check, because it returns outputs
in markdown syntax (with backticks ` ```json `). We can fix this behavior by enabling
[JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode).

```json
{
"model": "gpt-4-turbo-preview",
// ...
// Existing properties
"parameters": {
"response_format": {
"type": "json_object"
}
}
}
```

<Accordion title="empiricalrc.json: Updated with JSON mode">
```json empiricalrc.json
{
"runs": [
{
"type": "model",
"provider": "openai",
"model": "gpt-3.5-turbo",
"prompt": "Extract the name, age and location from the message, and respond with a JSON object. If an entity is missing, respond with null.\n\nMessage: {{user_message}}"
},
{
"type": "model",
"provider": "openai",
"model": "gpt-4-turbo-preview",
"parameters": {
"response_format": {
"type": "json_object"
}
},
"prompt": "Extract the name, age and location from the message, and respond with a JSON object. If an entity is missing, respond with null.\n\nMessage: {{user_message}}"
}
],
"dataset": {
"samples": [
{
"inputs": {
"user_message": "Hi my name is John Doe. I'm 26 years old and I work in real estate."
}
},
{
"inputs": {
"user_message": "This is Alice. I am a nurse from Maryland. I was born in 1990."
}
}
]
},
"scorers": [
{
"type": "json-syntax"
}
]
}
```
</Accordion>

Re-running the test with `npx empiricalrun` will give us better results
for GPT-4 Turbo.
</Step>
</Steps>


## Make it yours

Edit the `empiricalrc.json` file to make Empirical work for your use-case.
Edit the `empiricalrc.js` file to make Empirical work for your use-case.

- Configure which [models to use](./models/basics)
- Configure [your test dataset](./dataset/basics)
Expand Down
16 changes: 6 additions & 10 deletions docs/scoring/javascript.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ are supported.

<CodeGroup>

```js Inline
```js Inline function
export default {
scorers: [
function ({ output, inputs }) {
Expand Down Expand Up @@ -44,12 +44,9 @@ export default {
```

```js With types
import { Config, AsyncScoringFn } from "empiricalrun";
import { Config, JSScriptScorer } from "empiricalrun";

async function customScorer({
output,
inputs,
}: Parameters<AsyncScoringFn>[0]): ReturnType<AsyncScoringFn> {
const customScorer: JSScriptScorer = async({ output, inputs }) => {
// Use output and inputs to calculate the score
// ...
return {
Expand All @@ -61,7 +58,6 @@ async function customScorer({
export default {
scorers: [customScorer],
};

```

</CodeGroup>
Expand All @@ -70,14 +66,14 @@ The function has the following signature:

- **Arguments**
- Object with
- output: object with key `value` to get the output value (string) and key `metadata` to get metadata (object)
- output: object with key `value` to get the output value (string) and key `metadata` to get metadata (object); see [output object](./../models/output)
- inputs: object of key-value pairs from the dataset sample
- **Returns**
- List of scores: each result is an object with score (number between 0 to 1), message (optional, string) and name (optional, string)
- Score object: object with `score` (number between 0 to 1), `message` (optional, string) and `name` (optional, string)

## Multiple scores

It is possible for the method to return an array of scores. Use `name` to distinguish
It is possible for the method to return an array of score objects. Use `name` to distinguish
between them.

```js
Expand Down
5 changes: 2 additions & 3 deletions docs/scoring/python.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ the `scorers` section of the configuration. The `path` key should be the path to
In the script, you need to define an `evaluate` method, with the following signature:

- **Arguments**
- output: dict with key `value` to get the output value (string) and key `metadata` to get metadata (dict)
- output: dict with key `value` to get the output value (string) and key `metadata` to get metadata (dict); see [output object](./../models/output)
- inputs: dict of key-value pairs from the dataset sample
- **Returns**
- List of results: each result is dict with score (number between 0 to 1), message (optional, string) and name (optional, string)
Expand Down Expand Up @@ -70,5 +70,4 @@ npx empiricalrun --python-path PATH_TO_PYTHON_BINARY

## Limitations

- The Python script must complete execution within 10 seconds
- `async` Python functions are not supported
- The Python script must complete execution within 20 seconds
Loading

0 comments on commit b7e3759

Please sign in to comment.