Skip to content

Commit

Permalink
Merge pull request #85 from gomate-community/pipeline
Browse files Browse the repository at this point in the history
Pipeline
  • Loading branch information
yanqiangmiffy authored Jan 14, 2025
2 parents a28fff2 + d00b023 commit df8ad59
Show file tree
Hide file tree
Showing 16 changed files with 776 additions and 696 deletions.
166 changes: 81 additions & 85 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,95 +1,93 @@
# TrustRAG
# TrustRAG:The RAG Framework within Reliable input,Trusted output
A Configurable and Modular RAG Framework.

\[ English | [中文](README_zh.md) \]

可配置的模块化RAG框架。

[![Python](https://img.shields.io/badge/Python-3.10.0-3776AB.svg?style=flat)](https://www.python.org)
![workflow status](https://github.com/gomate-community/rageval/actions/workflows/makefile.yml/badge.svg)
[![codecov](https://codecov.io/gh/gomate-community/TrustRAG/graph/badge.svg?token=eG99uSM8mC)](https://codecov.io/gh/gomate-community/TrustRAG)
[![pydocstyle](https://img.shields.io/badge/pydocstyle-enabled-AD4CD3)](http://www.pydocstyle.org/en/stable/)
[![PEP8](https://img.shields.io/badge/code%20style-pep8-orange.svg)](https://www.python.org/dev/peps/pep-0008/)

## 🔥TrustRAG 简介

TrustRAG是一款配置化模块化的Retrieval-Augmented Generation (RAG) 框架,旨在提供**可靠的输入与可信的输出**
,确保用户在检索问答场景中能够获得高质量且可信赖的结果。
## 🔥Introduction to TrustRAG

TrustRAG框架的设计核心在于其**高度的可配置性和模块化**,使得用户可以根据具体需求灵活调整和优化各个组件,以满足各种应用场景的要求。
TrustRAG is a configurable and modular Retrieval-Augmented Generation (RAG) framework designed to provide **reliable input and trusted output**, ensuring users can obtain high-quality and trustworthy results in retrieval-based question-answering scenarios.

## 🔨TrustRAG 框架
The core design of TrustRAG lies in its **high configurability and modularity**, allowing users to flexibly adjust and optimize each component according to specific needs to meet the requirements of various application scenarios.

![framework.png](resources%2Fframework.png)
## 🔨TrustRAG Framework

## ✨主要特色
![framework.png](resources%2Fframework.png)

**“Reliable input,Trusted output”**
## ✨Key Features

可靠的输入,可信的输出
**“Reliable input, Trusted output”**

## 🎉 更新记录
## 🎉 Changelog

- 支持多模态RAG问答,API使用**GLM-4V-Flash**,代码见[trustrag/applications/rag_multimodal.py](trustrag/applications/rag_multimodal.py)
- TrustRAG 打包构建,支持pip和source两种方式安装
- 添加[MinerU文档解析](https://github.com/gomate-community/TrustRAG/blob/main/docs/mineru.md)
:一站式开源高质量数据提取工具,支持PDF/网页/多格式电子书提取`[20240907] `
- RAPTOR:递归树检索器实现
- 支持多种文件解析并且模块化目前支持解析的文件类型包括:`text`,`docx`,`ppt`,`excel`,`html`,`pdf`,`md`
- 优化了`DenseRetriever`,支持索引构建,增量追加以及索引保存,保存内容包括文档、向量以及索引
- 添加`ReRank`的BGE排序、Rewriter的`HyDE`
- 添加`Judge`的BgeJudge,判断文章是否有用 `20240711`
- Support for multimodal RAG question-answering, API using **GLM-4V-Flash**, code available at [trustrag/applications/rag_multimodal.py](trustrag/applications/rag_multimodal.py)
- TrustRAG packaging and build, supporting both pip and source installation
- Added [MinerU document parsing](https://github.com/gomate-community/TrustRAG/blob/main/docs/mineru.md): A one-stop open-source high-quality data extraction tool, supporting PDF/webpage/multi-format e-book extraction `[20240907]`
- RAPTOR: Recursive tree retriever implementation
- Support for multiple file parsing and modularity, currently supported file types include: `text`, `docx`, `ppt`, `excel`, `html`, `pdf`, `md`, etc.
- Optimized `DenseRetriever`, supporting index building, incremental appending, and index saving, including saving documents, vectors, and indexes
- Added `ReRank` with BGE sorting, Rewriter with `HyDE`
- Added `Judge` with BgeJudge, determining the usefulness of articles `20240711`

## 🚀快速上手
## 🚀Quick Start

## 🛠️ 安装
## 🛠️ Installation

### 方法1:使用`pip`安装
### Method 1: Install via `pip`

1. 创建conda环境(可选)
1. Create a conda environment (optional)

```sehll
```shell
conda create -n trustrag python=3.9
conda activate trustrag
```

2. 使用`pip`安装依赖
2. Install dependencies using `pip`

```sehll
```shell
pip install trustrag
```

### 方法2:源码安装
### Method 2: Install from source

1. 下载源码
1. Download the source code

```shell
git clone https://github.com/gomate-community/TrustRAG.git
```

2. 安装依赖
2. Install dependencies

```shell
pip install -e .
```

## 🚀 快速上手
## 🚀 Quick Start

### 1 模块介绍📝
### 1 Module Overview📝

```text
├── applications
├── modules
| ├── citation:答案与证据引用
| ├── document:文档解析与切块,支持多种文档类型
| ├── generator:生成器
| ├── judger:文档选择
| ├── prompt:提示语
| ├── refiner:信息总结
| ├── reranker:排序模块
| ├── retrieval:检索模块
| └── rewriter:改写模块
| ├── citation: Answer and evidence citation
| ├── document: Document parsing and chunking, supports multiple document types
| ├── generator: Generator
| ├── judger: Document selection
| ├── prompt: Prompts
| ├── refiner: Information summarization
| ├── reranker: Ranking module
| ├── retrieval: Retrieval module
| └── rewriter: Rewriting module
```


### 2 导入模块
### 2 Import Modules

```python
import pickle
Expand All @@ -106,12 +104,11 @@ from trustrag.modules.retrieval.dense_retriever import DenseRetrieverConfig
from trustrag.modules.retrieval.hybrid_retriever import HybridRetriever, HybridRetrieverConfig
```


### 3 文档解析以及切片
### 3 Document Parsing and Chunking

```text
def generate_chunks():
tp = TextParser()# 代表txt格式解析
tp = TextParser() # Represents txt format parsing
tc = TextChunker()
paragraphs = tp.parse(r'H:/2024-Xfyun-RAG/data/corpus.txt', encoding="utf-8")
print(len(paragraphs))
Expand All @@ -123,16 +120,15 @@ def generate_chunks():
with open(f'{PROJECT_BASE}/output/chunks.pkl', 'wb') as f:
pickle.dump(chunks, f)
```
>corpus.txt每行为一段新闻,可以自行选取paragraph读取的逻辑,语料来自[大模型RAG智能问答挑战赛](https://challenge.xfyun.cn/topic/info?type=RAG-quiz&option=zpsm)
> Each line in `corpus.txt` is a news paragraph. You can customize the logic for reading paragraphs. The corpus is from [Large Model RAG Intelligent Question-Answering Challenge](https://challenge.xfyun.cn/topic/info?type=RAG-quiz&option=zpsm).
`TextChunker`为文本块切块程序,主要特点使用[InfiniFlow/huqie](https://huggingface.co/InfiniFlow/huqie)作为文本检索的分词器,适合RAG场景。
`TextChunker` is the text chunking program, primarily using [InfiniFlow/huqie](https://huggingface.co/InfiniFlow/huqie) as the text retrieval tokenizer, suitable for RAG scenarios.

### 4 Building the Retriever

### 4 构建检索器
**Configuring the Retriever:**

**配置检索器:**

下面是一个混合检索器`HybridRetriever`配置参考,其中`HybridRetrieverConfig`需要由`BM25RetrieverConfig``DenseRetrieverConfig`配置构成。
Below is a reference configuration for a hybrid retriever `HybridRetriever`, where `HybridRetrieverConfig` is composed of `BM25RetrieverConfig` and `DenseRetrieverConfig`.

```python
# BM25 and Dense Retriever configurations
Expand All @@ -152,57 +148,57 @@ dense_config = DenseRetrieverConfig(
config_info = dense_config.log_config()
print(config_info)
# Hybrid Retriever configuration
# 由于分数框架不在同一维度,建议可以合并
# Since the score frameworks are not on the same dimension, it is recommended to merge them
hybrid_config = HybridRetrieverConfig(
bm25_config=bm25_config,
dense_config=dense_config,
bm25_weight=0.7, # bm25检索结果权重
dense_weight=0.3 # dense检索结果权重
bm25_weight=0.7, # BM25 retrieval result weight
dense_weight=0.3 # Dense retrieval result weight
)
hybrid_retriever = HybridRetriever(config=hybrid_config)
```

**构建索引:**
**Building the Index:**

````python
# 构建索引
# Build the index
hybrid_retriever.build_from_texts(corpus)
# 保存索引
# Save the index
hybrid_retriever.save_index()
````

如果构建好索引之后,可以多次使用,直接跳过上面步骤,加载索引
If the index is already built, you can skip the above steps and directly load the index:
```text
hybrid_retriever.load_index()
```

**检索测试:**
**Retrieval Test:**

```python
query = "支付宝"
query = "Alipay"
results = hybrid_retriever.retrieve(query, top_k=10)
print(len(results))
# Output results
for result in results:
print(f"Text: {result['text']}, Score: {result['score']}")
```

### 5 排序模型
### 5 Ranking Model
```python
reranker_config = BgeRerankerConfig(
model_name_or_path=reranker_model_path
)
bge_reranker = BgeReranker(reranker_config)
```
### 6 生成器配置
### 6 Generator Configuration
```python
glm4_chat = GLM4Chat(llm_model_path)
```

### 6 检索问答
### 6 Retrieval Question-Answering

```python
# ====================检索问答=========================
# ====================Retrieval Question-Answering=========================
test = pd.read_csv(test_path)
answers = []
for question in tqdm(test['question'], total=len(test)):
Expand All @@ -212,7 +208,7 @@ for question in tqdm(test['question'], total=len(test)):
documents=[doc['text'] for idx, doc in enumerate(search_docs)]
)
# print(search_docs)
content = '\n'.join([f'信息[{idx}]:' + doc['text'] for idx, doc in enumerate(search_docs)])
content = '\n'.join([f'Information[{idx}]:' + doc['text'] for idx, doc in enumerate(search_docs)])
answer = glm4_chat.chat(prompt=question, content=content)
answers.append(answer[0])
print(question)
Expand All @@ -223,9 +219,9 @@ test['answer'] = answers
test[['answer']].to_csv(f'{PROJECT_BASE}/output/gomate_baseline.csv', index=False)
```

## 🔧定制化RAG
## 🔧Customizing RAG

> 构建自定义的RAG应用
> Building a custom RAG application
```python
import os
Expand Down Expand Up @@ -253,14 +249,14 @@ class RagApplication():
pass
```

模块可见[rag.py](trustrag/applications/rag.py)
The module can be found at [rag.py](trustrag/applications/rag.py)

### 🌐体验RAG效果
### 🌐Experience RAG Effects

可以配置本地模型路径
You can configure the local model path

```text
# 修改成自己的配置!!!
# Modify to your own configuration!!!
app_config = ApplicationConfig()
app_config.docs_path = "./docs/"
app_config.llm_model_path = "/data/users/searchgpt/pretrained_models/chatglm3-6b/"
Expand All @@ -284,31 +280,31 @@ application.init_vector_store()
python app.py
```

浏览器访问:[127.0.0.1:7860](127.0.0.1:7860)
Access via browser: [127.0.0.1:7860](127.0.0.1:7860)
![trustrag_demo.png](resources%2Ftrustrag_demo.png)

app后台日志:
App backend logs:
![app_logging3.png](resources%2Fapp_logging3.png)

## ⭐️ Star History

[![Star History Chart](https://api.star-history.com/svg?repos=gomate-community/TrustRAG&type=Date)](https://star-history.com/#gomate-community/TrustRAG&Date)

## 研究与开发团队

本项目由网络数据科学与技术重点实验室[`GoMate`](https://github.com/gomate-community)团队完成,团队指导老师为郭嘉丰、范意兴研究员。
## Research and Development Team

## 技术交流群
This project is completed by the [`GoMate`](https://github.com/gomate-community) team from the Key Laboratory of Network Data Science and Technology, under the guidance of researchers Jiafeng Guo and Yixing Fan.

欢迎多提建议、Bad cases,欢迎进群及时交流,也欢迎大家多提PR</br>
## Technical Exchange Group

<img src="https://github.com/gomate-community/TrustRAG/blob/pipeline/resources/wechat.png" width="180px" height="270px">
Welcome to provide suggestions and report bad cases. Join the group for timely communication, and PRs are also welcome.</br>

<img src="https://raw.githubusercontent.com/gomate-community/TrustRAG/pipeline/resources/trustrag_group.png" width="180px">

群满或者合作交流可以联系:
If the group is full or for cooperation and exchange, please contact:

<img src="https://raw.githubusercontent.com/yanqiangmiffy/Chinese-LangChain/master/images/personal.jpg" width="180px">

## 致谢
- 文档解析:[infiniflow/ragflow](https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md)
- PDF文件解析[opendatalab/MinerU](https://github.com/opendatalab/MinerU)
## Acknowledgments
>This project thanks the following open-source projects for their support and contributions:
- Document parsing: [infiniflow/ragflow](https://github.com/infiniflow/ragflow/blob/main/deepdoc/README.md)
- PDF file parsing: [opendatalab/MinerU](https://github.com/opendatalab/MinerU)
23 changes: 0 additions & 23 deletions README_en.md

This file was deleted.

Loading

0 comments on commit df8ad59

Please sign in to comment.