Merge pull request #85 from gomate-community/pipeline

Pipeline
gomate-community · Jan 14, 2025 · df8ad59 · df8ad59
2 parents a28fff2 + d00b023
commit df8ad59
Show file tree

Hide file tree

Showing 16 changed files with 776 additions and 696 deletions.
diff --git a/README.md b/README.md
@@ -1,95 +1,93 @@
-# TrustRAG
+# TrustRAG:The RAG Framework within Reliable input,Trusted output
+A Configurable and Modular RAG Framework.
+
+\[ English | [中文](README_zh.md) \]
 
-可配置的模块化RAG框架。
 
 [![Python](https://img.shields.io/badge/Python-3.10.0-3776AB.svg?style=flat)](https://www.python.org)
 ![workflow status](https://github.com/gomate-community/rageval/actions/workflows/makefile.yml/badge.svg)
 [![codecov](https://codecov.io/gh/gomate-community/TrustRAG/graph/badge.svg?token=eG99uSM8mC)](https://codecov.io/gh/gomate-community/TrustRAG)
 [![pydocstyle](https://img.shields.io/badge/pydocstyle-enabled-AD4CD3)](http://www.pydocstyle.org/en/stable/)
 [![PEP8](https://img.shields.io/badge/code%20style-pep8-orange.svg)](https://www.python.org/dev/peps/pep-0008/)
 
-## 🔥TrustRAG 简介
 
-TrustRAG是一款配置化模块化的Retrieval-Augmented Generation (RAG) 框架，旨在提供**可靠的输入与可信的输出**
-，确保用户在检索问答场景中能够获得高质量且可信赖的结果。
+## 🔥Introduction to TrustRAG
 
-TrustRAG框架的设计核心在于其**高度的可配置性和模块化**，使得用户可以根据具体需求灵活调整和优化各个组件，以满足各种应用场景的要求。
+TrustRAG is a configurable and modular Retrieval-Augmented Generation (RAG) framework designed to provide **reliable input and trusted output**, ensuring users can obtain high-quality and trustworthy results in retrieval-based question-answering scenarios.
 
-## 🔨TrustRAG 框架
+The core design of TrustRAG lies in its **high configurability and modularity**, allowing users to flexibly adjust and optimize each component according to specific needs to meet the requirements of various application scenarios.
 
-![framework.png](resources%2Fframework.png)
+## 🔨TrustRAG Framework
 
-## ✨主要特色
+![framework.png](resources%2Fframework.png)
 
-**“Reliable input,Trusted output”**
+## ✨Key Features
 
-可靠的输入，可信的输出
+**“Reliable input, Trusted output”**
 
-## 🎉 更新记录
+## 🎉 Changelog
 
-- 支持多模态RAG问答，API使用**GLM-4V-Flash**，代码见[trustrag/applications/rag_multimodal.py](trustrag/applications/rag_multimodal.py)
-- TrustRAG 打包构建，支持pip和source两种方式安装
-- 添加[MinerU文档解析](https://github.com/gomate-community/TrustRAG/blob/main/docs/mineru.md)
-  ：一站式开源高质量数据提取工具，支持PDF/网页/多格式电子书提取`[20240907] `
-- RAPTOR:递归树检索器实现
-- 支持多种文件解析并且模块化目前支持解析的文件类型包括：`text`,`docx`,`ppt`,`excel`,`html`,`pdf`,`md`等
-- 优化了`DenseRetriever`，支持索引构建，增量追加以及索引保存，保存内容包括文档、向量以及索引
-- 添加`ReRank`的BGE排序、Rewriter的`HyDE`
-- 添加`Judge`的BgeJudge,判断文章是否有用 `20240711`
+- Support for multimodal RAG question-answering, API using **GLM-4V-Flash**, code available at [trustrag/applications/rag_multimodal.py](trustrag/applications/rag_multimodal.py)
+- TrustRAG packaging and build, supporting both pip and source installation
+- Added [MinerU document parsing](https://github.com/gomate-community/TrustRAG/blob/main/docs/mineru.md): A one-stop open-source high-quality data extraction tool, supporting PDF/webpage/multi-format e-book extraction `[20240907]`
+- RAPTOR: Recursive tree retriever implementation
+- Support for multiple file parsing and modularity, currently supported file types include: `text`, `docx`, `ppt`, `excel`, `html`, `pdf`, `md`, etc.
+- Optimized `DenseRetriever`, supporting index building, incremental appending, and index saving, including saving documents, vectors, and indexes
+- Added `ReRank` with BGE sorting, Rewriter with `HyDE`
+- Added `Judge` with BgeJudge, determining the usefulness of articles `20240711`
 
-## 🚀快速上手
+## 🚀Quick Start
 
-## 🛠️ 安装
+## 🛠️ Installation
 
-### 方法1：使用`pip`安装
+### Method 1: Install via `pip`
 
-1. 创建conda环境（可选）
+1. Create a conda environment (optional)
 
-```sehll
+```shell
 conda create -n trustrag python=3.9
 conda activate trustrag
 ```
 
-2. 使用`pip`安装依赖
+2. Install dependencies using `pip`
 
-```sehll
+```shell
 pip install trustrag   
 ```
 
-### 方法2：源码安装
+### Method 2: Install from source
 
-1. 下载源码
+1. Download the source code
 
 ```shell
 git clone https://github.com/gomate-community/TrustRAG.git
 ```
 
-2. 安装依赖
+2. Install dependencies
 
 ```shell
 pip install -e . 
 ```
 
-## 🚀 快速上手
+## 🚀 Quick Start
 
-### 1 模块介绍📝
+### 1 Module Overview📝
 
 ```text
 ├── applications
 ├── modules
-|      ├── citation:答案与证据引用
-|      ├── document：文档解析与切块，支持多种文档类型
-|      ├── generator：生成器
-|      ├── judger：文档选择
-|      ├── prompt：提示语
-|      ├── refiner：信息总结
-|      ├── reranker：排序模块
-|      ├── retrieval：检索模块
-|      └── rewriter：改写模块
+|      ├── citation: Answer and evidence citation
+|      ├── document: Document parsing and chunking, supports multiple document types
+|      ├── generator: Generator
+|      ├── judger: Document selection
+|      ├── prompt: Prompts
+|      ├── refiner: Information summarization
+|      ├── reranker: Ranking module
+|      ├── retrieval: Retrieval module
+|      └── rewriter: Rewriting module
 ```
 
-
-### 2 导入模块
+### 2 Import Modules
 
 ```python
 import pickle
@@ -106,12 +104,11 @@ from trustrag.modules.retrieval.dense_retriever import DenseRetrieverConfig
 from trustrag.modules.retrieval.hybrid_retriever import HybridRetriever, HybridRetrieverConfig
 ```
 
-
-### 3 文档解析以及切片
+### 3 Document Parsing and Chunking
 
 ```text
 def generate_chunks():
-    tp = TextParser()# 代表txt格式解析
+    tp = TextParser()  # Represents txt format parsing
     tc = TextChunker()
     paragraphs = tp.parse(r'H:/2024-Xfyun-RAG/data/corpus.txt', encoding="utf-8")
     print(len(paragraphs))
@@ -123,16 +120,15 @@ def generate_chunks():
     with open(f'{PROJECT_BASE}/output/chunks.pkl', 'wb') as f:
         pickle.dump(chunks, f)
 ```
->corpus.txt每行为一段新闻，可以自行选取paragraph读取的逻辑,语料来自[大模型RAG智能问答挑战赛](https://challenge.xfyun.cn/topic/info?type=RAG-quiz&option=zpsm)
+> Each line in `corpus.txt` is a news paragraph. You can customize the logic for reading paragraphs. The corpus is from [Large Model RAG Intelligent Question-Answering Challenge](https://challenge.xfyun.cn/topic/info?type=RAG-quiz&option=zpsm).
 
-`TextChunker`为文本块切块程序，主要特点使用[InfiniFlow/huqie](https://huggingface.co/InfiniFlow/huqie)作为文本检索的分词器，适合RAG场景。
+`TextChunker` is the text chunking program, primarily using [InfiniFlow/huqie](https://huggingface.co/InfiniFlow/huqie) as the text retrieval tokenizer, suitable for RAG scenarios.
 
+### 4 Building the Retriever
 
-### 4 构建检索器
+**Configuring the Retriever:**
 
-**配置检索器：**
-
-下面是一个混合检索器`HybridRetriever`配置参考，其中`HybridRetrieverConfig`需要由`BM25RetrieverConfig`和`DenseRetrieverConfig`配置构成。
+Below is a reference configuration for a hybrid retriever `HybridRetriever`, where `HybridRetrieverConfig` is composed of `BM25RetrieverConfig` and `DenseRetrieverConfig`.
 
 ```python
 # BM25 and Dense Retriever configurations
@@ -152,57 +148,57 @@ dense_config = DenseRetrieverConfig(
 config_info = dense_config.log_config()
 print(config_info)
 # Hybrid Retriever configuration
-# 由于分数框架不在同一维度，建议可以合并
+# Since the score frameworks are not on the same dimension, it is recommended to merge them
 hybrid_config = HybridRetrieverConfig(
     bm25_config=bm25_config,
     dense_config=dense_config,
-    bm25_weight=0.7,  # bm25检索结果权重
-    dense_weight=0.3  # dense检索结果权重
+    bm25_weight=0.7,  # BM25 retrieval result weight
+    dense_weight=0.3  # Dense retrieval result weight
 )
 hybrid_retriever = HybridRetriever(config=hybrid_config)
 ```
 
-**构建索引：**
+**Building the Index:**
 
 ````python
-# 构建索引
+# Build the index
 hybrid_retriever.build_from_texts(corpus)
-# 保存索引
+# Save the index
 hybrid_retriever.save_index()
 ````
 
-如果构建好索引之后，可以多次使用，直接跳过上面步骤，加载索引
+If the index is already built, you can skip the above steps and directly load the index:
 ```text
 hybrid_retriever.load_index()
 ```
 
-**检索测试：**
+**Retrieval Test:**
 
 ```python
-query = "支付宝"
+query = "Alipay"
 results = hybrid_retriever.retrieve(query, top_k=10)
 print(len(results))
 # Output results
 for result in results:
     print(f"Text: {result['text']}, Score: {result['score']}")
 ```
 
-### 5 排序模型
+### 5 Ranking Model
 ```python
 reranker_config = BgeRerankerConfig(
     model_name_or_path=reranker_model_path
 )
 bge_reranker = BgeReranker(reranker_config)
 ```
-### 6 生成器配置
+### 6 Generator Configuration
 ```python
 glm4_chat = GLM4Chat(llm_model_path)
 ```
 
-### 6 检索问答
+### 6 Retrieval Question-Answering
 
 ```python
-# ====================检索问答=========================
+# ====================Retrieval Question-Answering=========================
 test = pd.read_csv(test_path)
 answers = []
 for question in tqdm(test['question'], total=len(test)):
@@ -212,7 +208,7 @@ for question in tqdm(test['question'], total=len(test)):
         documents=[doc['text'] for idx, doc in enumerate(search_docs)]
     )
     # print(search_docs)
-    content = '\n'.join([f'信息[{idx}]：' + doc['text'] for idx, doc in enumerate(search_docs)])
+    content = '\n'.join([f'Information[{idx}]：' + doc['text'] for idx, doc in enumerate(search_docs)])
     answer = glm4_chat.chat(prompt=question, content=content)
     answers.append(answer[0])
     print(question)
@@ -223,9 +219,9 @@ test['answer'] = answers
 test[['answer']].to_csv(f'{PROJECT_BASE}/output/gomate_baseline.csv', index=False)
 ```
 
-## 🔧定制化RAG
+## 🔧Customizing RAG
 
-> 构建自定义的RAG应用
+> Building a custom RAG application
 
 ```python
 import os
@@ -253,14 +249,14 @@ class RagApplication():
         pass
 ```
 
-模块可见[rag.py](trustrag/applications/rag.py)
+The module can be found at [rag.py](trustrag/applications/rag.py)
 
-### 🌐体验RAG效果
+### 🌐Experience RAG Effects
 
-可以配置本地模型路径
+You can configure the local model path
 
 ```text
-# 修改成自己的配置！！！
+# Modify to your own configuration!!!
 app_config = ApplicationConfig()
 app_config.docs_path = "./docs/"
 app_config.llm_model_path = "/data/users/searchgpt/pretrained_models/chatglm3-6b/"
@@ -284,31 +280,31 @@ application.init_vector_store()
 python app.py
 ```
 
-浏览器访问：[127.0.0.1:7860](127.0.0.1:7860)
+Access via browser: [127.0.0.1:7860](127.0.0.1:7860)
 ![trustrag_demo.png](resources%2Ftrustrag_demo.png)
 
-app后台日志：
+App backend logs:
 ![app_logging3.png](resources%2Fapp_logging3.png)
 
 ## ⭐️ Star History
 
 [![Star History Chart](https://api.star-history.com/svg?repos=gomate-community/TrustRAG&type=Date)](https://star-history.com/#gomate-community/TrustRAG&Date)
 
-## 研究与开发团队
-
-本项目由网络数据科学与技术重点实验室[`GoMate`](https://github.com/gomate-community)团队完成，团队指导老师为郭嘉丰、范意兴研究员。
+## Research and Development Team
 
-## 技术交流群
+This project is completed by the [`GoMate`](https://github.com/gomate-community) team from the Key Laboratory of Network Data Science and Technology, under the guidance of researchers Jiafeng Guo and Yixing Fan.
 
-欢迎多提建议、Bad cases，欢迎进群及时交流，也欢迎大家多提PR</br>
+## Technical Exchange Group
 
-<img src="https://github.com/gomate-community/TrustRAG/blob/pipeline/resources/wechat.png" width="180px" height="270px">
+Welcome to provide suggestions and report bad cases. Join the group for timely communication, and PRs are also welcome.</br>
 
+<img src="https://raw.githubusercontent.com/gomate-community/TrustRAG/pipeline/resources/trustrag_group.png" width="180px">
 
-群满或者合作交流可以联系：
+If the group is full or for cooperation and exchange, please contact:
 
 <img src="https://raw.githubusercontent.com/yanqiangmiffy/Chinese-LangChain/master/images/personal.jpg" width="180px">
 
-## 致谢
-- 文档解析：[infiniflow/ragflow](https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md)
-- PDF文件解析[opendatalab/MinerU](https://github.com/opendatalab/MinerU)
+## Acknowledgments
+>This project thanks the following open-source projects for their support and contributions:
+- Document parsing: [infiniflow/ragflow](https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md)
+- PDF file parsing: [opendatalab/MinerU](https://github.com/opendatalab/MinerU)
diff --git a/README_en.md b/README_en.md