kaiko-ai · roman807 · Mar 7, 2024 · Mar 5, 2024 · Mar 5, 2024 · Mar 5, 2024
diff --git a/README.md b/README.md
@@ -30,15 +30,52 @@ _Oncology FM Evaluation Framework by KAIKO.ai_
 
 ### _About_
 
-`eva` is [KAIKO.ai](https://kaiko.ai/)'s evaluation framework for oncology foundation models.
+`eva` is [Kaiko](https://kaiko.ai/)'s evaluation framework for oncology foundation models.
 
 ## Installation
 
-TBD
-
-## How To Use
-
-TBD
+*Note: this section will be revised for the public package when publishing eva*
+
+
+### Download the eva repo
+
+First, install GIT LFS on your machine (see [reference](https://git-lfs.com/)) which is used to track assets, 
+such as sample images used for tests.
+```
+brew install git lfs
+```
+Navigate to the directory where you'd like to install *eva* and install git-lfs:
+```
+git lfs install
+```
+Now clone the repo:
+```
+git clone [email protected]:kaiko-ai/eva.git
+```
+
+### Environment and dependencies
+
+Now install *eva* and it's dependencies in a virtual environment. This can be done with the Python 
+package and dependency manager PDM (see [documentation](https://pdm-project.org/latest/)).
+
+Install PDM on your machine:
+```
+brew install pdm
+```
+Navigate to the eva root directory and run:
+```
+pdm install
+```
+This will install eva and all its dependencies in a virtual environment. Activate the venv with:
+```
+source .venv/bin/activate
+```
+Now you are ready to start! Start the documentation
+```
+mkdocs serve
+```
+and explore it in your [browser](http://127.0.0.1:8000/). Read through the main page and navigate
+to [how-to-use]http://127.0.0.1:8000/user-guide/how_to_use/ to run *eva*
 
 ## Datasets
 

diff --git a/docs/index.md b/docs/index.md
@@ -29,61 +29,83 @@ hide:
 
 _Oncology FM Evaluation Framework by Kaiko_
 
-With the first release, `eva` supports performance evaluation for vision Foundation Models ("FMs") and supervised machine learning ("ML") models on WSI-patch-level image classification- and radiology (CT-scans) segmentation tasks.
+With the first release, ***eva*** supports performance evaluation for vision Foundation Models ("FMs") and supervised machine learning ("ML") models on WSI-patch-level image classification- and radiology (CT-scans) segmentation tasks.
 
-The goal of this project is to provide the open-source community with an easy-to-use framework that follows industry practices to provide a robust, reproducible and fair evaluation benchmark across FMs of different sizes and architectures.
+The goal of this project is to provide the open-source community with an easy-to-use framework that follows industry best practices to provide a robust, reproducible and fair evaluation benchmark across FMs of different sizes and architectures.
 
 Support for additional modalities and tasks will be added in future releases.
 
 ## Use cases
 
 ### 1. Evaluate your own FMs on public benchmark datasets
 
-With a trained FM as input, you can run `eva` on several publicly available datasets & tasks for which `eva` provides out-of the box support. One `eva` run will automatically download and preprocess the relevant data, compute embeddings with the trained FM, fit and evaluate a classification head and report the mean and standard deviation of the relevant performance metrics the selected task.
+With a trained FM as input, you can run ***eva*** on several publicly available datasets & tasks for which ***eva*** provides out-of the box support. One ***eva*** run will automatically download and preprocess the relevant data, compute embeddings with the trained FM, fit and evaluate a classification head and report the mean and standard deviation of the relevant performance metrics the selected task.
 
 Supported datasets & tasks include:
 
--	**Patch Camelyon**: binary breast cancer classification
--	**BACH**: multiclass breast cancer classification
--	**CRC HE**: multiclass colorectal cancer classification
--	**TotalSegmentator**: radiology/CT-scan for segmentation of anatomical structures
+-	**[Patch Camelyon](datasets/patch_camelyon.md)**: binary breast cancer classification
+-	**[BACH](datasets/bach.md)**: multiclass breast cancer classification
+-	**[CRC](datasets/crc.md)**: multiclass colorectal cancer classification
+-	**[MHIST](datasets/mhist.md)**: binary colorectal cancer classification
+-	**[TotalSegmentator](datasets/total_segmentator.md)**: radiology/CT-scan for segmentation of anatomical structures
 
-To compare your FM, eva also provides support to evaluate and compare several publicly available models on the same tasks. These include:
+To evaluate FMs, ***eva*** provides support for several formats. These include model checkpoints saved with PyTorch lightning, models available from HuggingFace and onnx-models.
 
--	Pretrained Resnet18 (timm)
--	Baseline FM: DINO with randomly initialized ViT-S16 backbone
--	Lunit: DINO with ViT-S backbone
--	Kaiko: DINO with ViT-S backbone
 
 ### 2. Evaluate ML models on your own dataset & task
 
 If you have your own labelled dataset, all that is needed is to implement a dataset class tailored to your source data. Start from one our out-of-the box provided dataset classes, adapt it to your data and run eva to see how different publicly available models are performing on your task.
 
+## Evaluation results
+
+We evaluated the following seven FMs on eva on the 4 supported WSI-patch-level image classification tasks:
+
+| FM-backbone                                                                | PCam - val*      | PCam - test*    | BACH - val**    | CRC - val**      | MHIST - val* |
+|----------------------------------------------------------------------------|------------------|-----------------|-----------------|------------------|--------------|
+| DINO ViT-S16 random weights                                                | 0.765 (±0.0036)  | 0.726 (±0.0024) | 0.416 (±0.014)  | 0.643 (±0.0046)	 | TBD          |
+| DINO ViT-S16 imagenet                                                      | 0.871 (±0.0039)  | 0.856 (±0.0044) | 0.673 (±0.0041) | 0.936 (±0.0009)  | TBD          |
+| DINO ViT-B8 imagenet	                                                      | 0.872 (±0.0013)  | 0.854 (±0.0015) | 0.704 (±0.008)  | 0.942 (±0.0005)  | TBD          |
+| Kaiko DINO ViT-S16	                                                        | 0.911 (±0.0017)  | 0.899 (±0.002)  | 0.773 (±0.0069) | 0.954 (±0.0012)  | TBD          |
+| Kaiko DINO ViT-B8                                                          | 0.902 (±0.0013)  | 0.887 (±0.0031) | 0.798 (±0.0063) | 0.949 (±0.0001)  | TBD          | 
+| Lunit - ViT-S16                                                            | 0.89 (±0.0009)   | 0.897 (±0.0029) | 0.765 (±0.0108) | TBD              | TBD          | 
+| Owkin - ViT base (from [HuggingFace](https://huggingface.co/owkin/phikon)) | 	0.914 (±0.0012) | 0.919 (±0.0082) | 0.717 (±0.0031) | TBD              | TBD          | 
+
+The reported performance metrics are *balanced binary accuracy* * and *balanced multiclass accuracy* **
+
+The runs used the deafult setup described in the section below. The table shows the average performance & standard deviation over 5 runs. To replicate those results yourself, refer to the [Tutorials](user-guide/tutorials.md).
+
+***eva*** trains the decoder on the "train" split and uses the "validation" split for monitoring, early stopping and checkpoint selection. Evaluation results are reported on the "validation" split and, if available, on the "test" split.
+
 ## Evaluation setup
 
 For WSI-patch-level/microscopy image classification tasks, FMs that produce image embeddings are evaluated with a single linear layer MLP with embeddings as inputs and label-predictions as output.
 
-To standardize evaluations, the default configurations `eva` uses are based on the evaluation protocol proposed by Virchow [1] and dataset/task specific characteristics.
+To standardize evaluations, the default configurations ***eva*** uses are based on the evaluation protocol proposed by Virchow [1] and dataset/task specific characteristics. To stop training as appropriate we use early stopping after 10% of the maximal number of steps [2].
 
 |                         |                           |
 |-------------------------|---------------------------|
 | **Backbone**            | frozen                    |
 | **Hidden layers**       | none                      |
 | **Dropout**             | 0.0                       |
 | **Activation function** | none                      |
-| **Epochs**              | dataset/task specific*    |
-| **Batch size**          | dataset/task specific*    |
+| **Number of steps**     | 12,500                    |
+| **Base Batch size**     | 4,096                     |
+| **Batch size**          | dataset specific*         |
+| **Base learning rate**  | 0.01                      |
+| **Learning Rate**       | [Base learning rate] * [Batch size] / [Base batch size]   |
+| **Max epochs**          | [n samples] * [Number of steps] /  [Batch size]  |
+| **Early stopping**      | 10% * [Max epochs]  |
 | **Optimizer**           | SGD                       |
-| **Base Learning Rate**  | dataset/task specific*    |
 | **Momentum**            | 0.9                       |
 | **Weight Decay**        | 0.0                       |
 | **Nesterov momentum**   | true                      |
 | **LR Schedule**         | Cosine without warmup     |
 
-*The number of epochs, batch size and base learning rate were ran experiments with a pretrained DINO ViT-S16 FM to optimize for convergence with minimal running time, and robust results with repeated runs with different random seeds.
+*For smaller datasets (e.g. BACH with 400 samples) we reduce the batch size to 256 and scale the learning rate accordingly.
 
-[1]: [Virchow: A Million-Slide Digital Pathology Foundation Model, 2024](https://arxiv.org/pdf/2309.07778.pdf)
+- [1]: [Virchow: A Million-Slide Digital Pathology Foundation Model, 2024](https://arxiv.org/pdf/2309.07778.pdf)
+- [2]: [Scaling Self-Supervised Learning for Histopathology with Masked Image Modeling](https://www.medrxiv.org/content/10.1101/2023.07.21.23292757v1.full.pdf)
 
 ## Next steps
 
-Check out the [User Guide](user-guide/index.md) to get started with `eva` 
+Check out the [User Guide](user-guide/index.md) to get started with ***eva***
diff --git a/docs/user-guide/getting_started.md b/docs/user-guide/getting_started.md
@@ -1,11 +1,35 @@
 # Getting Started
 
-## Installing eva
+*Note: this section applies in the current form only to Kaiko-internal user testing and will be revised for the public package when publishing eva*
 
-### PIP
+## Installation
+
+
+- Create and activate a virtual environment with Python 3.10+
+
+- Install ***eva*** and the ***eva-vision*** package with:
 
-If you use pip, install eva with:
 ```
-pip install kaiko-eva
+pip install git+ssh://[email protected]/kaiko-ai/eva.git
+pip install "eva[vision]"
 ```
 
+- To be able to use the existing configs, you have to first download them from the [***eva*** GitHub repo](https://github.com/kaiko-ai/eva/tree/main):
+
+    - Download the repo as zip file by clicking on `Code` > `Download ZIP`
+    - Unzzip the file and copy the "config" folder into the directory where you installed eva
+
+
+## Run ***eva***
+
+Run a complete ***eva*** workflow with the:
+```
+python -m eva fit --config configs/vision/tests/online/bach.yaml 
+```
+This will:
+
+ - Download and extract the BACH dataset to `./data/bach`, if it has not been downloaded before.
+ - Fit a complete model consisting of the frozen FM-backbone (a pretrained `dino_vits16`) and a downstream head (single layer MLP) on the BACH-train split.
+ - Evaluate the trained model on the val split and report the results
+
+To learn more about how to run ***eva*** and customize your runs, familiarize yourself with [How to use ***eva***](how_to_use.md) and get started with [Tutorials](tutorials.md) 
diff --git a/docs/user-guide/how_to_use.md b/docs/user-guide/how_to_use.md
@@ -0,0 +1,53 @@
+# How to use ***eva***
+
+Before starting to use ***eva***, it's important to get familiar with the different workflows, subcommands and configurations.
+
+
+## ***eva*** subcommands
+
+To run an evaluation, we call:
+```
+python -m eva <subcommand> --config <path-to-config-file>
+```
+
+The *eva* interface supports the subcommands: `predict`, `fit` and `predict_fit`.
+
+ - **`fit`**: is used to train a decoder for a specific task and subsequently evaluate the performance. This can be done *online* or *offline* \*
+- **`predict`**: is used to compute embeddings for input images with a provided FM-checkpoint. This is the first step of the *offline* workflow
+- **`predict_fit`**: runs `predict` and `fit` sequentially. Like the `fit`-online run, it runs a complete evaluation with images as input.
+
+### \* *online* vs. *offline* workflows
+
+We distinguish between the *online* and *offline* workflow:
+
+- *online*: This mode uses raw images as input and generates the embeddings using a frozen FM backbone on the fly to train a downstream head network.
+- *offline*: In this mode, embeddings are pre-computed and stored locally in a first step, and loaded in a 2nd step from disk to train the downstream head network.
+
+The *online* workflow can be used to quickly run a complete evaluation without saving and tracking embeddings. The *offline* workflow runs faster (only one FM-backbone forward pass) and is ideal to experiment with different decoders on the same FM-backbone.
+
+
+## Run configurations
+
+### Config files
+
+The setup for an ***eva*** run is provided in a `.yaml` config file specified with the `--config` flag.
+
+A config file specifies the setup for the *trainer* (including callback for the model backbone), the *model* (setup of the trainable decoder) and *data* module. 
+
+To get a better understanding, inspect some of the provided [config files](https://github.com/kaiko-ai/eva/tree/main/configs/vision) (which you will download if you run the tutorials).
+
+
+### Environment variables
+
+To customize runs, you can overwrite the config-parameters listed below by setting them as environment variables.
+
+|                         |                           |
+|-------------------------|---------------------------|
+| `OUTPUT_ROOT`            | the directory to store logging outputs and recorded results |
+| `DINO_BACKBONE`          | the backbone architecture, e.g. "dino_vits16" |
+| `PRETRAINED`             | whether to load FM-backbone weights from a pretrained model |
+| `MONITOR_METRIC`         | the metric to monitor for early stopping and model checkpoint loading |
+| `EMBEDDINGS_DIR`         | the directory to store the computed embeddings |
+| `IN_FEATURES`            | the input feature dimension (embedding)           |
+| `BATCH_SIZE`             | Batch size for a training step |
+| `PREDICT_BATCH_SIZE`             | Batch size for a predict step |
diff --git a/docs/user-guide/index.md b/docs/user-guide/index.md
@@ -1,4 +1,5 @@
 # User Guide
 
 - [Getting started](getting_started.md)
-- [How to use eva](how_to_use.md)
+- [How to use eva](how_to_use.md)
+- [Tutorials](tutorials.md)
diff --git a/docs/user-guide/tutorials.md b/docs/user-guide/tutorials.md
@@ -0,0 +1,68 @@
+# Tutorials
+
+
+If not done so already, download the configs for this tutorial from the [***eva*** GitHub repo](https://github.com/kaiko-ai/eva/tree/main):
+
+1. Download the repo as zip file by clicking on `Code` > `Download ZIP`
+2. Unzzip the file and copy the "config" folder into the directory where you installed ***eva***
+
+
+## 1. Run an *online*-evaluation
+
+*Note: This step executes the same command & configuration as in the section "Getting started"*
+
+Run a complete online workflow with the following command:
+```
+python -m eva fit --config configs/vision/dino_vit/online/bach.yaml
+```
+
+The `fit` run will:
+
+ - Download and extract the BACH dataset to `./data/bach`, if it has not been downloaded before.
+ - Fit a complete model - the frozen FM-backbone (a pretrained `dino_vits16`) and a downstream head (single layer MLP) - on the BACH-train split.
+ - Evaluate the trained model on the val split and report the results
+
+Once the run is complete:
+
+ - check out some of the raw images in `<...>/eva/data/bach/ICIAR2018_BACH_Challenge` (this can already be done once the data download step is complete)
+ - check out the evaluation results json file in `<...>/eva/logs/dino_vit/online/bach` [**TBD**]
+
+
+
+## 2. Run a complete *offline*-evaluation
+
+Now, run a complete offline workflow with the following command:
+```
+python -m eva predict_fit --config configs/vision/dino_vit/offline/patch_camelyon.yaml
+```
+
+The `predict_fit` run will:
+
+ - Download and extract the BACH dataset to `./data/bach`, if it has not been downloaded before. If you ran the *online*-evaluation above before, this step will be skipped.
+ - ("predict") Computes the embeddings for all input images with the FM-backbone (a pretrained `dino_vits16`) and stores them in `./data/embeddings/bach` along with a `manifest.csv` file that keeps track of the mapping between input images and embeddings.
+ - ("fit") Fit a downstream head (single layer MLP) on the BACH-train split, using the computed embeddings and provided labels as input.
+ - Evaluate the trained model on the val split and report the results
+
+Once the run is complete:
+
+ - check out the evaluation results json file in `<...>/eva/logs/dino_vit/online/bach` [**TBD**]
+
+ Note: comparing the results with the run from 2. you will notice a difference in performance. This is because we ran the online workflow with fewer epochs. Optionally, to verify that both workflows produce identical results, change the `max_steps` parameter in `configs/vision/dino_vit/offline/patch_camelyon.yaml` to [**TBD**], and rerun th `predict_fit` command above.
+
+## 3. Run the fit step of the *offline*-evaluation
+
+If you ran the complete *offline*-evaluation above, you have already computed and stored all the embeddings for the BACH dataset with the pretrained `dino_vits16` FM-backbone. (In case you skipped the previous step, generate them now by running `python -m eva predict --config configs/vision/dino_vit/offline/patch_camelyon.yaml`)
+
+Now, run the fit step offline workflow with the following command with:
+```
+python -m eva fit --config configs/vision/dino_vit/offline/patch_camelyon.yaml
+```
+
+The *offline*-`fit` run will:
+
+ - ("fit") Fit a downstream head (single layer MLP) on the BACH-train split, using the computed embeddings and provided labels as input.
+ - Evaluate the trained model on the val split and report the results
+
+Once the run is complete:
+
+ - check out the evaluation results json file in `<...>/eva/logs/dino_vit/online/bach` [**TBD**], verify that the results are identical with those from the previous `predict_fit` run.
diff --git a/mkdocs.yml b/mkdocs.yml
@@ -32,6 +32,7 @@ nav:
       - user-guide/index.md
       - Getting started: user-guide/getting_started.md
       - How to use eva: user-guide/how_to_use.md
+      - Tutorials: user-guide/tutorials.md
   - Reference API:
     - reference/index.md
     - Interface: reference/interface.md
@@ -60,6 +61,6 @@ nav:
   - Datasets: 
     - datasets/index.md
     - BACH: datasets/bach.md
-    - CRC_HE: datasets/crc_he.md
+    - CRC: datasets/crc.md
     - PatchCamelyon: datasets/patch_camelyon.md
     - TotalSegmentator: datasets/total_segmentator.md