Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix issue where a text block is extracted multiple times by different patterns #108

Open
fqrious opened this issue Jan 13, 2025 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@fqrious
Copy link
Contributor

fqrious commented Jan 13, 2025

When running pattern extractors, each extractor is run separately over the document and because of this a url and a domain can be extracted from a single text block. for example

host_url = "https://example.org/file.html"

extracts both

{
    "type": "url",
    "spec_version": "2.1",
    "id": "url--de2cb18e-af1f-59d9-ada1-8b9c89661257",
    "value": "https://example.org/file.html"
}
 

AND

{
    "type": "domain-name",
    "spec_version": "2.1",
    "id": "domain-name--3500c14b-9bde-5393-b4e5-b260aea7e058",
    "value": "example.org"
}

by using the start_index, we can skip all extractions that overlap

@fqrious fqrious added this to Roadmap Jan 13, 2025
@github-project-automation github-project-automation bot moved this to Todo in Roadmap Jan 13, 2025
@fqrious fqrious added the enhancement New feature or request label Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

2 participants