review comments

empirical-run · May 22, 2024 · b7e3759 · b7e3759
1 parent 4260dcf
commit b7e3759
Show file tree

Hide file tree

Showing 7 changed files with 85 additions and 159 deletions.
diff --git a/README.md b/README.md
@@ -3,52 +3,55 @@
 [![npm](https://img.shields.io/npm/v/empiricalrun)](https://npmjs.com/package/empiricalrun)
 [![Discord](https://img.shields.io/badge/discord-empirical.run-blue?logo=discord&logoColor=white&color=5d68e8)](https://discord.gg/NeR6jj8dw9)
 
-Empirical is the fastest way to test your LLM app and iterate over prompts and other model configuration.
+Empirical is the fastest way to test different LLMs and model configurations, across
+all the scenarios that matter for your application.
 
-With Empirical, you can:
+With Empirical, you can
 
-- Run your test datasets locally against off-the-shelf models
-- Test your own custom models and RAG applications (see [how-to](https://docs.empirical.run/models/custom))
-- Reports to view, compare, analyze outputs on a web UI
+- Run your test datasets locally against [off-the-shelf](https://docs.empirical.run/models/model) or [custom models](https://docs.empirical.run/models/custom)
+- Compare model outputs on a web UI, and [test changes quickly](https://docs.empirical.run/reporter)
 - Score your outputs with [scoring functions](https://docs.empirical.run/scoring/basics)
 - Run [tests on CI/CD](https://docs.empirical.run/running-in-ci)
 
 https://github.com/empirical-run/empirical/assets/284612/3309283c-ddad-4c4e-8175-08a32460686c
 
 ## Usage
 
-[See quick start on docs →](https://docs.empirical.run/quickstart)
+[**See all docs →**](https://docs.empirical.run/quickstart)
 
-Empirical bundles together a CLI and a web app. The CLI handles running tests and
-the web app visualizes results.
+Empirical bundles together a test runner and a web app. These can be used through
+the CLI in your terminal window.
 
-Everything runs locally, with a JSON configuration file, `empiricalrc.json`.
-
-> Required: [Node.js](https://nodejs.org/en) 20+ needs to be installed on your system.
+Empirical relies on a configuration file, typically located at `empiricalrc.js`
+which describes the test to run.
 
 ### Start with a basic example
 
-In this example, we will ask an LLM to parse user messages to extract entities and
+In this example, we will ask an LLM to extract entities from user messages and
 give us a structured JSON output. For example, "I'm Alice from Maryland" will
-become `"{name: 'Alice', location: 'Maryland'}"`.
+become `{name: 'Alice', location: 'Maryland'}`.
 
 Our test will succeed if the model outputs valid JSON.
 
-1. Use the CLI to create a sample configuration file called `empiricalrc.json`.
+1. Use the CLI to create a sample configuration file called `empiricalrc.js`.
 
     ```sh
-    npx empiricalrun init
-    cat empiricalrc.json
+    npm init empirical
+
+    # For TypeScript
+    npm init empirical -- --using-ts
     ```
 
-2. Run the test samples against the models with the `run` command. This step requires
-   the `OPENAI_API_KEY` environment variable to authenticate with OpenAI. This
-   execution will cost $0.0026, based on the selected models.
+2. Run the example dataset against the selected models.
 
     ```sh
     npx empiricalrun
     ```
 
+   This step requires the `OPENAI_API_KEY` environment variable to
+   authenticate with OpenAI. This execution will cost $0.0026, based
+   on the selected models.
+
 3. Use the `ui` command to open the reporter web app and see side-by-side results.
 
     ```sh
@@ -57,13 +60,12 @@ Our test will succeed if the model outputs valid JSON.
 
 ### Make it yours
 
-Edit the `empiricalrc.json` file to make Empirical work for your use-case.
+Edit the `empiricalrc.js` file to make Empirical work for your use-case.
 
 - Configure which [models to use](https://docs.empirical.run/models/basics)
 - Configure [your test dataset](https://docs.empirical.run/dataset/basics)
 - Configure [scoring functions](https://docs.empirical.run/scoring/basics) to grade output quality
 
-
 ## Contribution guide
 
 See [development docs](development/README.md).
diff --git a/docs/config/json.mdx b/docs/config/json.mdx
@@ -3,11 +3,11 @@ title: 'JSON'
 description: 'Use a JSON file to configure your tests'
 ---
 
-For simpler configurations, you can use a JSON file to configure your tests.
+For simpler configurations, you can use a JSON file to configure your tests, instead
+of [JavaScript](./javascript) or [TypeScript](./typescript).
 
-Empirical uses the config file at `empiricalrc.json` to describe
-the test to run. This configuration is declarative, in the sense that you define what you
-want to test, and Empirical will internally implement the expected behavior.
+To set this up, create the config file at `empiricalrc.json` which describes
+the test to run.
 
 ## Configuration reference
 

diff --git a/docs/introduction.mdx b/docs/introduction.mdx
@@ -1,22 +1,20 @@
 ---
 title: Introduction
-description: 'Welcome to empirical.run'
+description: 'Welcome to Empirical'
 ---
 
-Empirical is the fastest way to test different LLMs, prompts and other model configurations, across all the scenarios
-that matter for your application.
+Empirical is the fastest way to test different LLMs and model configurations, across
+all the scenarios that matter for your application.
 
-[Try it out!](./quickstart)
+[Quick start →](./quickstart)
 
 ## With Empirical, you can
 
-- Run your test datasets locally against off-the-shelf models
-- Test your own custom models and RAG applications (see [how-to](./models/custom))
-- Reports to view, compare, analyze outputs on a web UI
+- Run your test datasets locally against [off-the-shelf](./models/model) or [custom models](./models/custom)
+- Compare model outputs on a web UI, and [test changes quickly](./reporter)
 - Score your outputs with [scoring functions](./scoring/basics)
 - Run [tests on CI/CD](./running-in-ci)
 
-
 ## Walk through
 
 Watch a 6 mins demo video showing how Empirical can run the [HumanEval benchmark](https://github.com/empirical-run/empirical/tree/main/examples/humaneval).
@@ -33,5 +31,5 @@ Watch a 6 mins demo video showing how Empirical can run the [HumanEval benchmark
 
 ## Open source
 
-Empirical is open source on [GitHub](https://github.com/empirical-run/empirical). Star the repo, file issues or pull requests
-to contribute to the project.
+Empirical is open source on [GitHub](https://github.com/empirical-run/empirical). Star
+the repo, file issues or pull requests to contribute to the project.
diff --git a/docs/quickstart.mdx b/docs/quickstart.mdx
@@ -3,129 +3,60 @@ title: 'Quick start'
 description: 'Try Empirical in 3 steps'
 ---
 
-Empirical bundles together a CLI and a web app. The CLI handles running tests and
-the web app visualizes results.
+Empirical bundles together a test runner and a web app, as the [test reporter](./reporter). These
+can be used through the CLI in your terminal window.
 
-Everything runs locally, with a JSON configuration file, `empiricalrc.json`.
+Empirical relies on a configuration file, typically located at `empiricalrc.js`
+which describes the test to run. This configuration is declarative, which means
+that you define what you want to test, and Empirical will internally implement
+the expected behavior.
 
-Required: Node.js 20+ needs to be installed on your system.
+> Required: Node.js 20+ needs to be installed on your system.
 
 ## Start with a basic example
 
-In this example, we will ask an LLM to parse user messages to extract entities and
+In this example, we will ask an LLM to extract entities from user messages and
 give us a structured JSON output. For example, "I'm Alice from Maryland" will
-become `"{name: 'Alice', location: 'Maryland'}"`.
+become `{name: 'Alice', location: 'Maryland'}`.
 
 Our test will succeed if the model outputs valid JSON.
 
 <Steps>
   <Step title="Set up Empirical">
-    Use the CLI to create a sample configuration file in `empiricalrc.json`.
+    Use the CLI to create a sample configuration file in `empiricalrc.js`.
 
     ```sh
-    npx empiricalrun init
-    ```
-
-    Read the file to see the configured models and dataset samples that we will test
-    for. The default configuration uses models from OpenAI.
+    npm init empirical
 
-    ```sh
-    cat empiricalrc.json
+    # For TypeScript
+    npm init empirical -- --using-ts
     ```
   </Step>
 
   <Step title="Run the test">
-    Run the test samples against the models with the `run` command.
+    Run the example dataset against the selected models.
 
     ```sh
     npx empiricalrun
     ```
 
     This step requires the `OPENAI_API_KEY` environment variable to authenticate with
-    OpenAI. This execution will cost $0.0026, based on the selected models.
+    OpenAI. This run will cost $0.0026, based on the selected models.
   </Step>
 
   <Step title="See results">
-    Use the `ui` command to open the reporter web app in your web browser and see
-    side-by-side results.
+    Use the `ui` command to open the reporter web app and see side-by-side results.
 
     ```sh
     npx empiricalrun ui
     ```
   </Step>
-
-  <Step title="[Bonus] Fix GPT-4 Turbo">
-    GPT-4 Turbo tends to fail our JSON syntax check, because it returns outputs
-   in markdown syntax (with backticks ` ```json `). We can fix this behavior by enabling
-   [JSON mode](https://platform.openai.com/docs/guides/text-generation/json-mode).
-
-    ```json
-    {
-      "model": "gpt-4-turbo-preview",
-      // ...
-      // Existing properties
-      "parameters": {
-        "response_format": {
-          "type": "json_object"
-        }
-      }
-    }
-    ```
-
-    <Accordion title="empiricalrc.json: Updated with JSON mode">
-    ```json empiricalrc.json
-    {
-      "runs": [
-        {
-          "type": "model",
-          "provider": "openai",
-          "model": "gpt-3.5-turbo",
-          "prompt": "Extract the name, age and location from the message, and respond with a JSON object. If an entity is missing, respond with null.\n\nMessage: {{user_message}}"
-        },
-        {
-          "type": "model",
-          "provider": "openai",
-          "model": "gpt-4-turbo-preview",
-          "parameters": {
-            "response_format": {
-              "type": "json_object"
-            }
-          },
-          "prompt": "Extract the name, age and location from the message, and respond with a JSON object. If an entity is missing, respond with null.\n\nMessage: {{user_message}}"
-        }
-      ],
-      "dataset": {
-        "samples": [
-          {
-            "inputs": {
-              "user_message": "Hi my name is John Doe. I'm 26 years old and I work in real estate."
-            }
-          },
-          {
-            "inputs": {
-              "user_message": "This is Alice. I am a nurse from Maryland. I was born in 1990."
-            }
-          }
-        ]
-      },
-      "scorers": [
-        {
-          "type": "json-syntax"
-        }
-      ]
-    }
-    ```
-    </Accordion>
-
-    Re-running the test with `npx empiricalrun` will give us better results
-    for GPT-4 Turbo.
-  </Step>
 </Steps>
 
 
 ## Make it yours
 
-Edit the `empiricalrc.json` file to make Empirical work for your use-case.
+Edit the `empiricalrc.js` file to make Empirical work for your use-case.
 
 - Configure which [models to use](./models/basics)
 - Configure [your test dataset](./dataset/basics)

diff --git a/docs/scoring/javascript.mdx b/docs/scoring/javascript.mdx
@@ -11,7 +11,7 @@ are supported.
 
 <CodeGroup>
 
-```js Inline
+```js Inline function
 export default {
   scorers: [
     function ({ output, inputs }) {
@@ -44,12 +44,9 @@ export default {
 ```
 
 ```js With types
-import { Config, AsyncScoringFn } from "empiricalrun";
+import { Config, JSScriptScorer } from "empiricalrun";
 
-async function customScorer({
-  output,
-  inputs,
-}: Parameters<AsyncScoringFn>[0]): ReturnType<AsyncScoringFn> {
+const customScorer: JSScriptScorer = async({ output, inputs }) => {
   // Use output and inputs to calculate the score
   // ...
   return {
@@ -61,7 +58,6 @@ async function customScorer({
 export default {
   scorers: [customScorer],
 };
-
 ```
 
 </CodeGroup>
@@ -70,14 +66,14 @@ The function has the following signature:
 
 - **Arguments**
   - Object with
-    - output: object with key `value` to get the output value (string) and key `metadata` to get metadata (object)
+    - output: object with key `value` to get the output value (string) and key `metadata` to get metadata (object); see [output object](./../models/output)
     - inputs: object of key-value pairs from the dataset sample
 - **Returns**
-  - List of scores: each result is an object with score (number between 0 to 1), message (optional, string) and name (optional, string)
+  - Score object: object with `score` (number between 0 to 1), `message` (optional, string) and `name` (optional, string)
 
 ## Multiple scores
 
-It is possible for the method to return an array of scores. Use `name` to distinguish
+It is possible for the method to return an array of score objects. Use `name` to distinguish
 between them.
 
 ```js

diff --git a/docs/scoring/python.mdx b/docs/scoring/python.mdx
@@ -19,7 +19,7 @@ the `scorers` section of the configuration. The `path` key should be the path to
 In the script, you need to define an `evaluate` method, with the following signature:
 
 - **Arguments**
-  - output: dict with key `value` to get the output value (string) and key `metadata` to get metadata (dict)
+  - output: dict with key `value` to get the output value (string) and key `metadata` to get metadata (dict); see [output object](./../models/output)
   - inputs: dict of key-value pairs from the dataset sample
 - **Returns**
   - List of results: each result is dict with score (number between 0 to 1), message (optional, string) and name (optional, string)
@@ -70,5 +70,4 @@ npx empiricalrun --python-path PATH_TO_PYTHON_BINARY
 
 ## Limitations
 
-- The Python script must complete execution within 10 seconds
-- `async` Python functions are not supported
+- The Python script must complete execution within 20 seconds