HtmlRAG 工具包中文文档

Quick Start (快速开始) ｜中文｜ English

一个可将HtmlRAG应用于你自己的检索增强生成（RAG）系统的工具包。

🔔重要提示：

类GenHTMLPruner的参数max_node_words在v0.1.0版本中被移除。
如果从htmlrag v0.0.4升级到v0.0.5，请下载最新的模型文件，它们位于modeling_llama.py和modeling_phi3.py。

📦 安装

使用pip安装该软件包：

pip install htmlrag

或者从源代码进行安装：

pip install -e.

📖 用户指南

🧹 HTML清理

from htmlrag import clean_html

question = "When was the bellagio in las vegas built?"
html = """
<html>
<head>
<h1>Bellagio Hotel in Las</h1>
</head>
<body>
<p class="class0">The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
</body>
<div>
<div>
<p>Some other text</p>
<p>Some other text</p>
</div>
</div>
<p class="class1"></p>
<!-- Some comment -->
<script type="text/javascript">
document.write("Hello World!");
</script>
</html>
"""

# 或者，你可以读取html文件并合并它们
# html_files=["/path/to/html/file1.html", "/path/to/html/file2.html"]
# htmls=[open(file).read() for file in html_files]
# html = "\n".join(htmls)

simplified_html = clean_html(html)
print(simplified_html)

# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# </html>

🔧 配置修剪参数

示例中的HTML文档相当简短。现实世界中的HTML文档可能更长、更复杂。为了处理这类情况，我们可以配置以下参数：

# 使用嵌入模型构建用于修剪的块树时，节点中的最大单词数
MAX_NODE_WORDS_EMBED = 10
# MAX_NODE_WORDS_EMBED = 256 # 针对现实世界HTML文档的推荐设置
# 使用嵌入模型修剪后的输出HTML文档中的最大标记数
MAX_CONTEXT_WINDOW_EMBED = 60
# MAX_CONTEXT_WINDOW_EMBED = 6144 # 针对现实世界HTML文档的推荐设置
# 使用生成模型构建用于修剪的块树时，节点中的最大单词数
MAX_NODE_WORDS_GEN = 5
# MAX_NODE_WORDS_GEN = 128 # 针对现实世界HTML文档的推荐设置
# 使用生成模型修剪后的输出HTML文档中的最大标记数
MAX_CONTEXT_WINDOW_GEN = 32
# MAX_CONTEXT_WINDOW_GEN = 4096 # 针对现实世界HTML文档的推荐设置

🌲 构建块树

from htmlrag import build_block_tree

# block_tree, simplified_html = build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_EMBED)
block_tree, simplified_html=build_block_tree(simplified_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # 针对中文文本
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <h1>Bellagio Hotel in Las</h1>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <div>
# <p>Some other text</p>
# <p>Some other text</p>
# </div>
# Block Path:  ['html', 'div']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

✂️ 使用嵌入模型修剪HTML块

from htmlrag import EmbedHTMLPruner

embed_model = "BAAI/bge-large-zh"
query_instruction_for_retrieval = "为这个句子生成表示以用于检索相关文章："
embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=True,
                                    query_instruction_for_retrieval=query_instruction_for_retrieval)
# 或者，你可以初始化一个远程TEI模型，参考https://github.com/huggingface/text-embeddings-inference。
# tei_endpoint="http://YOUR_TEI_ENDPOINT"
# embed_html_pruner = EmbedHTMLPruner(embed_model=embed_model, local_inference=False, query_instruction_for_retrieval = query_instruction_for_retrieval, endpoint=tei_endpoint)
block_rankings = embed_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [2, 0, 1]

# 或者，你可以使用BM25对块进行排序
from htmlrag import BM25HTMLPruner

bm25_html_pruner = BM25HTMLPruner()
block_rankings = bm25_html_pruner.calculate_block_rankings(question, simplified_html, block_tree)
print(block_rankings)

# [2, 0, 1]

from transformers import AutoTokenizer

chat_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.1-70B-Instruct")

pruned_html = embed_html_pruner.prune_HTML(simplified_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_EMBED)
print(pruned_html)

# <html>
# <h1>Bellagio Hotel in Las</h1>
# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# </html>

✂️ 使用生成模型修剪HTML块

from htmlrag import GenHTMLPruner
import torch

# 构建更精细的块树
# block_tree, pruned_html = build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN)
block_tree, pruned_html=build_block_tree(pruned_html, max_node_words=MAX_NODE_WORDS_GEN, zh_char=True) # 针对中文文本
for block in block_tree:
    print("Block Content: ", block[0])
    print("Block Path: ", block[1])
    print("Is Leaf: ", block[2])
    print("")

# Block Content:  <h1>Bellagio Hotel in Las</h1>
# Block Path:  ['html', 'title']
# Is Leaf:  True
# 
# Block Content:  <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>
# Block Path:  ['html', 'p']
# Is Leaf:  True

ckpt_path = "zstanjj/HTML-Pruner-Phi-3.8B"
# ckpt_path = "zstanjj/HTML-Pruner-Llama-1B"
if torch.cuda.is_available():
    device = "cuda"
else:
    device = "cpu"
gen_html_pruner = GenHTMLPruner(gen_model=ckpt_path, device=device)
block_rankings = gen_html_pruner.calculate_block_rankings(question, pruned_html, block_tree)
print(block_rankings)

# [1, 0]

pruned_html = gen_html_pruner.prune_HTML(pruned_html, block_tree, block_rankings, chat_tokenizer, MAX_CONTEXT_WINDOW_GEN)
print(pruned_html)

# <p>The Bellagio is a luxury hotel and casino located on the Las Vegas Strip in Paradise, Nevada. It was built in 1998.</p>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README_zh.md

README_zh.md

HtmlRAG 工具包中文文档

📦 安装

📖 用户指南

🧹 HTML清理

🔧 配置修剪参数

🌲 构建块树

✂️ 使用嵌入模型修剪HTML块

✂️ 使用生成模型修剪HTML块

Files

README_zh.md

Latest commit

History

README_zh.md

File metadata and controls

HtmlRAG 工具包中文文档

📦 安装

📖 用户指南

🧹 HTML清理

🔧 配置修剪参数

🌲 构建块树

✂️ 使用嵌入模型修剪HTML块

✂️ 使用生成模型修剪HTML块