We aim to provide support in NeMo Guardrails for a wide range of LLMs from different providers, with a focus on open models. However, due to the complexity of the tasks required for employing dialog rails and most of the predefined input and output rails (e.g. moderation or fact-checking), not all LLMs are capable enough to be used.
This document aims to provide a summary of the evaluation experiments we have employed to assess the performance of various LLMs for the different type of rails.
For more details about the evaluation of guardrails, including datasets and quantitative results, please read this document. The tools used for evaluation are described in the same file, for a summary of topics read this section from the user guide. Any new LLM available in Guardrails should be evaluated using at least this set of tools.
The following tables summarize the LLM support for the main features of NeMo Guardrails, focusing on the different rails available out of the box. If you want to use an LLM and you cannot see a prompt in the prompts folder, please also check the configuration defined in the LLM examples' configurations.
Feature | gpt-3.5-turbo-instruct | text-davinci-003 | nemollm-43b | llama-2-13b-chat | falcon-7b-instruct | gpt-3.5-turbo | gpt-4 | gpt4all-13b-snoozy | vicuna-7b-v1.3 | mpt-7b-instruct | dolly-v2-3b | HF Pipeline model |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Dialog Rails | ✔️ (0.74) | ✔️ (0.83) | ✔️ (0.82) | ✔️ (0.77) | ✔️ (0.76) | ❗ (0.45) | ❗ | ❗ (0.54) | ❗ (0.54) | ❗ (0.50) | ❗ (0.40) | ❗ (DEPENDS ON MODEL) |
• Single LLM call | ✔️ (0.83) | ✔️ (0.81) | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
• Multi-step flow generation | EXPERIMENTAL | EXPERIMENTAL | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
Streaming | ✔️ | ✔️ | ✔️ | - | - | ✔️ | ✔️ | - | - | - | - | ✔️ |
Hallucination detection (SelfCheckGPT with AskLLM) | ✔️ | ✔️ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
AskLLM rails | ||||||||||||
• Jailbreak detection | ✔️ (0.88) | ✔️ (0.88) | ✔️ (0.86) | ❌ | ❌ | ✔️ (0.85) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
• Output moderation | ✔️ | ✔️ | ✔️ | ❌ | ❌ | ✔️ (0.85) | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |
• Fact-checking | ✔️ (0.81) | ✔️ (0.82) | ✔️ (0.81) | ✔️ (0.80) | ❌ | ✔️ (0.83) | ❌ | ❌ | ❌ | ❌ | ❌ | ❗ (DEPENDS ON MODEL) |
AlignScore fact-checking (LLM independent) | ✔️ (0.89) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
ActiveFence moderation (LLM independent) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Llama Guard moderation (LLM independent) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Got It AI RAG TruthChecker (LLM independent) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Patronus Lynx RAG Hallucination detection (LLM independent) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ |
Table legend:
- ✔️ - Supported (The feature is fully supported by the LLM based on our experiments and tests)
- ❗ - Limited Support (Experiments and tests show that the LLM is under-performing for that feature)
- ❌ - Not Supported (Experiments show very poor performance or no experiments have been done for the LLM-feature pair)
- - - Not Applicable (e.g. models support streaming, it depends how they are deployed)
The performance numbers reported in the table above for each LLM-feature pair are as follows:
- the banking dataset evaluation for dialog (topical) rails
- fact-checking using MSMARCO dataset and moderation rails experiments More details in the evaluation docs.