diff --git a/README.md b/README.md index 00c86d4..b5bada2 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,7 @@ # JudgeIt - An Auto Eval Framework for Scaling Gen AI Pipelines +The single biggest challenge in scaling any GenAI solution (like RAG, multi-turn conversations, or query rewriting) from PoC to production is the last-mile problem of evaluation. Statistical metrics (such as BLEU, ROUGE, or METEOR) have proven ineffective at accurately judging the quality of GenAI solutions, leaving human evaluation as the final step. However, human evaluation is slow and expensive, making it difficult to scale quickly. This is where 'JudgeIt' comes in—a framework designed to mimic human judgment with high precision and recall + Accuracy and reliability are paramount when building Generative AI pipelines. Therefore, it is critical to establish robust evaluation frameworks to test these pipelines as they are built. While human evaluation may be accurate, it often takes a high degree of manual effort and is difficult to scale efficiently. JudgeIt is an automated evaluation framework designed to accurately and efficiently assess various Generative AI pipelines, including RAG, multi-turn query rewriting, text-to-SQL conversion, and more. This service enables users to conduct batch evaluations across these different Generative AI pipelines. Users can input datasets containing generated text along with corresponding golden text. JudgeIt then employs an LLM as a judge to perform similarity evaluations between these inputs, providing an accurate assessment of the AI pipeline's performance.