We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The text was updated successfully, but these errors were encountered:
@riddle911 感谢 riddle911的建议
图片链接提取更倾向于下游(所以未增加到当前抽取环节),例如下游任务中不光有img链接提取,还有pdf链接、音视频链接提取、表格/代码块特殊处理等环节。
当前支持部分自定义xpath规则配置,用法如下:
extractor = GeneralExtractor(config_path='./rule.json')
""" demo rule config file json: { "www.***.com": { "clean": ["//script", "//style"], "title": { "mode": "xpath", "value": "//div[@Class='media-body']/h4/text()" }, "content": { "mode": "xpath", "value": "//div[@Class='message break-all']" } } } """
Sorry, something went wrong.
感谢,已测试成功。不过目前仅有title 和content字段可以定义规则是吗。 请问后续是否有计划允许用户新增、自定义多个字段和抽取规则。这样不管是抽img还是抽video或者其他均比较自由。
@riddle911 可以考虑在下个版本修复cchardet版本问题时一起上线
No branches or pull requests
The text was updated successfully, but these errors were encountered: