Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

提几个抽取的建议 #16

Open
riddle911 opened this issue Sep 25, 2024 · 3 comments
Open

提几个抽取的建议 #16

riddle911 opened this issue Sep 25, 2024 · 3 comments

Comments

@riddle911
Copy link

  1. 测试后发现仅支持网页标题和正文,建议增加图片的抽取
  2. 为保证抽取效果,建议增加可配置xpath规则的手段,自定义抽取网页的title、body、img等数据
@sixgad
Copy link
Collaborator

sixgad commented Sep 25, 2024

@riddle911 感谢 riddle911的建议

  1. 图片链接提取更倾向于下游(所以未增加到当前抽取环节),例如下游任务中不光有img链接提取,还有pdf链接、音视频链接提取、表格/代码块特殊处理等环节。

  2. 当前支持部分自定义xpath规则配置,用法如下:

创建和自定义 rule.json

粒度为某个域名下的网页数据采用自定义抽取规则

extractor = GeneralExtractor(config_path='./rule.json')

"""
demo rule config file json:
{
"www.***.com": {
"clean": ["//script", "//style"],
"title": {
"mode": "xpath",
"value": "//div[@Class='media-body']/h4/text()"
},
"content": {
"mode": "xpath",
"value": "//div[@Class='message break-all']"
}
}
}
"""

@riddle911
Copy link
Author

感谢,已测试成功。不过目前仅有title 和content字段可以定义规则是吗。
请问后续是否有计划允许用户新增、自定义多个字段和抽取规则。这样不管是抽img还是抽video或者其他均比较自由。

@sixgad
Copy link
Collaborator

sixgad commented Sep 25, 2024

@riddle911 可以考虑在下个版本修复cchardet版本问题时一起上线

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants