fix issue where a text block is extracted multiple times by different patterns #108

fqrious · 2025-01-13T10:06:39Z

When running pattern extractors, each extractor is run separately over the document and because of this a url and a domain can be extracted from a single text block. for example

host_url = "https://example.org/file.html"

extracts both

{
    "type": "url",
    "spec_version": "2.1",
    "id": "url--de2cb18e-af1f-59d9-ada1-8b9c89661257",
    "value": "https://example.org/file.html"
}

AND

{
    "type": "domain-name",
    "spec_version": "2.1",
    "id": "domain-name--3500c14b-9bde-5393-b4e5-b260aea7e058",
    "value": "example.org"
}

by using the start_index, we can skip all extractions that overlap

fqrious assigned fqrious and himynamesdave Jan 13, 2025

fqrious added this to Roadmap Jan 13, 2025

github-project-automation bot moved this to Todo in Roadmap Jan 13, 2025

fqrious added the enhancement New feature or request label Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix issue where a text block is extracted multiple times by different patterns #108

fix issue where a text block is extracted multiple times by different patterns #108

fqrious commented Jan 13, 2025

fix issue where a text block is extracted multiple times by different patterns #108

fix issue where a text block is extracted multiple times by different patterns #108

Comments

fqrious commented Jan 13, 2025