Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

rmrbytes · 2024-10-24T10:07:13Z

Describe the bug
When using Document Splitter with pdf and split_type="passage", the result is always one document. This is using pypdf.

Expected behavior
The understanding I have is that it splits based on at least two line breaks \n\n

Additional context
When I tested using plain text it seems to be splitting correctly

To Reproduce

dir = '...'
files = [
{"filename": "rules.pdf", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}},
{"filename": "rules.txt", "meta": {"split_by" : "passage", "split_length":1, "split_overlap":0, "split_threshold":0}}
]
for file in files:
# set the filepath
file_path = Path(dir) / file["filename"]
router_res = file_type_router.run(sources=[file_path])
txt_docs = []
if 'text/plain' in router_res:
txt_docs = text_file_converter.run(sources=router_res['text/plain'])
elif 'application/pdf' in router_res:
txt_docs = pdf_converter.run(sources=router_res['application/pdf'])
elif 'text/markdown' in router_res:
txt_docs = markdown_converter.run(sources=router_res['text/markdown'])
document_splitter = DocumentSplitter(
split_by=file['meta']['split_by'],
split_length=file['meta']['split_length'],
split_overlap=file['meta']['split_overlap'],
split_threshold=file['meta']['split_threshold']
)
splitter_res = document_splitter.run([txt_docs['documents'][0]])
print(len(splitter_res['documents']))

System:

OS: Mac OS 14.6.1
GPU/CPU: CPU
Haystack version (commit or version number): 2.6.0
DocumentStore: Chromadb
Splitter: DocumentSplitter

The text was updated successfully, but these errors were encountered:

anakin87 · 2024-10-28T09:11:25Z

Hello!

The fact that your input is PDF or text does not have an impact on splitting.
Try to play with the parameters of DocumentSplitter.

This works, for example:

from haystack import Document
from haystack.components.preprocessors import DocumentSplitter

doc = Document(content="Hello World.\n\nMy name is John Doe.\n\nI live in Berlin", meta={"name": "doc1"})
splitter = DocumentSplitter(split_length=1, split_by="passage")

docs = splitter.run([doc])
print(docs)
# I get 3 documents

rmrbytes · 2024-10-28T09:23:55Z

Thanks @anakin87 for your response. I did mention above that it works with txt files as confirmed by your example. My issue is with PDFs (using pypdf). I will revisit by creating a simpler pdf with distinct two line returns.

lbux · 2024-11-08T05:16:07Z

I also ran into the issue, but I think it might just be the fact that the PDF format is a bit of a mess when people create documents. Some of the PDFs I have can correctly be split but some can't.

rmrbytes · 2024-11-08T05:18:42Z

Hmm. This was a simple text based PDF created via google docs. I thought it would be a "simple" pdf :-). For now, am testing using txt and md files till I get a handle on this. Thanks @lbux for sharing.

lbux · 2024-11-08T05:23:18Z

I'll give it another shot tomorrow just to be sure. I do remember the same document I used was having no issues with splitting by words or sentences, but it did fail when I did passage. I wonder if it's failing to read the \n\n somehow. Or maybe the converter is not properly handling it.

rmrbytes · 2024-11-08T05:39:30Z

For what it is worth, I did experiment with 3 pdfs and found that the entire document was a single chunk always.

lbux · 2024-11-09T19:23:43Z

You are correct. The issue persists regardless of what converter is used (Miner or PyPDF). I tried several files, and they all report 1. However, the issue also occurs if we use the NLTKDocumentSplitter. I'd have to dig a bit deeper to determine what the root cause is.

lbux · 2024-11-09T20:14:14Z

The issue is with the converters. PyPDF is a bit worse in handling PDFs when no parameters are added compared to PDFMiner. However, both implementations are not configured to infer paragraphs breaks based on spacing or layout analysis. When we see something we consider a paragraph, it is stored as "\n" as opposed to "\n\n" which the splitter expects.

Here is a screenshot of a PDF

And this is how PDFMiner converts it:
This is a test document. Here we have a sentence.\nThis is the start of a new paragraph.\nThis is not the start of a new paragraph.\nThe number should be 3.\n

This is how PyPDF converts it:
"Thisisatest document. Herewehaveasentence.\nThisisthestart of anewparagraph.Thisisnot thestart of anewparagraph.\nThenumbershouldbe3.

@anakin87 Would love if someone from the team could take a deeper look into it. If my findings are right, many customers could be incorrectly converting their documents.

anakin87 · 2024-11-10T00:00:21Z

@sjrl Have you by any chance encountered this problem in the past?

sjrl · 2024-11-11T07:11:35Z

@anakin87 We are also using PyPDF a lot but we either typically use:

split by word and respect sentence boundaries using the NLTKDocumentSplitter
split by page using the normal DocumentSplitter

so we don't have much experience with splitting by passage so we may have not noticed the missing newlines.

sjrl · 2024-11-11T07:20:01Z

It seems PDFMiner provides more easily understandable options to customize. Specifically, if you mess with the line_margin parameter (I think make it smaller) then it might preserve the extra newlines, but it's difficult to tell just looking at their documentation.

lbux · 2024-11-11T19:59:39Z

I think the main issue (with PDFMiner and probably PyPDF) is in the implementation of the converter:

def _converter(self, extractor) -> Document:
        """
        Extracts text from PDF pages then convert the text into Documents

        :param extractor:
            Python generator that yields PDF pages.

        :returns:
            PDF text converted to Haystack Document
        """
        pages = []
        for page in extractor:
            text = ""
            for container in page:
                # Keep text only
                if isinstance(container, LTTextContainer):
                    text += container.get_text()
            pages.append(text)

        # Add a page delimiter
        concat = "\f".join(pages)

        return Document(content=concat)

Our downstream task expects "\n\n" but the current implementation does not add the delineators. The current implementation seems to concatenate all the text of LTTextContainer expecting it to already have the delineators which is not the case. If we explicitly add them like so:

def _converter(self, extractor) -> Document:
        """
        Extracts text from PDF pages then convert the text into Documents

        :param extractor:
            Python generator that yields PDF pages.

        :returns:
            PDF text converted to Haystack Document
        """
        pages = []
        for page in extractor:
            text = ""
            for container in page:
                # Keep text only
                if isinstance(container, LTTextContainer):
                    container_text = container.get_text().strip()
                    if text:
                        text += "\n\n"
                    text += container_text
            pages.append(text.strip())

        # Add a page delimiter
        concat = "\f".join(pages)

        return Document(content=concat)

It is true that line_margin is important and that is what actually controls our LTTextContainer object, but if we just concatenate, it has no impact.

With these changes, I was able to get the correct, expected output:

This is a test document. Here we have a sentence.\n\n
This is the start of a new paragraph.\nThis is not the start of a new paragraph.\n\n
The number should be 3.

Further testing would probably have to be done, and I stripped some of the text to make it easier to debug (which should actually be done in the cleaner); however, I am confident this is the root cause.

anakin87 · 2024-11-11T20:08:14Z

@lbux thanks for debugging this issue!

We should probably investigate better, but your proposal sounds reasonable.

lbux · 2024-11-11T20:17:43Z

Haystack's PyPDF's conversion method is this:

def _default_convert(self, reader: "PdfReader") -> Document:
        text = "\f".join(page.extract_text() for page in reader.pages)
        return Document(content=text)

This also has the same limitation where we don't add the \n\n. Complicated logic can be added to manually insert the delineators and I added some basic heuristics to test it myself (if previous line ends with period and new line starts with uppercase letter). It worked fine, but it was ugly.

I then looked more into extract_text(), and it seems we are using the legacy extractor. They have an experimental version that we can enable by adding "extraction_mode="layout" as a parameter in extract_text() and it works out of the box for my test without any additional changes needed... Just another data point for possible solutions!

lbux · 2024-11-11T20:21:25Z

@lbux thanks for debugging this issue!

We should probably investigate better, but your proposal sounds reasonable.

Of course! My current project is explicitly dependent on retrieving exact paragraphs, so I had to fix this one way or another haha. Hoping for a permanent solution when time allows.

davidsbatista · 2025-01-06T17:13:01Z

@lbux, do you have any more examples (of PDFs) of where this fails? I'm trying to come up with a fix, and your suggested fix/approach for PDFMiner seems to be working so far. I still would like to test it with more examples.

lbux · 2025-01-09T20:12:02Z

@lbux, do you have any more examples (of PDFs) of where this fails? I'm trying to come up with a fix, and your suggested fix/approach for PDFMiner seems to be working so far. I still would like to test it with more examples.

Unfortunately, I do not have much to share that isn't copyrighted by my university. However, while testing this, I did use this document provided by Stanford since it has a mix of difficult content to parse but it is well structured.

When I test page 10 (just a random page), I get great results with PDFMiner and my changes. It fails a bit when it comes to the bullet points, but not as bad as PyPDF which I've had worse luck with. Overall, PDFMiner with my changes ended up working fine for my task. I've since stepped aside from the project I was working on, so I don't have much to add. I will be picking it back up soon.

davidsbatista · 2025-01-16T12:23:17Z

I have a PR/fix based on your suggestions - still in draft: #8729

I've tested on a different PDF file - one that's already part of our test_files (https://github.com/deepset-ai/haystack/blob/main/test/test_files/pdf/sample_pdf_2.pdf).

It seems to be OK to me. Each passage is correctly identified, and each page header and footer also seem to be OK; each is identified as a passage.

I understand that having one single solution that always correctly extracts text from PDF and builds correct Documents is hard. The only concern I have is whether this breaks something else. I've also added one more test page detection to see if this update changes its behavior; it seems not.

I will handle PyPDF in a different PR.

davidsbatista · 2025-01-16T14:31:28Z

Regarding PyPDFToDocument, it seems that defining it with extraction_mode layout, i.e.:

PyPDFToDocument(extraction_mode=PyPDFExtractionMode.LAYOUT)

works perfectly!

davidsbatista · 2025-01-16T15:14:48Z

the PR with the test for PyPDFToDocument - #8739

anakin87 added the community-triage label Oct 28, 2024

anakin87 added type:bug Something isn't working topic:preprocessing and removed community-triage labels Nov 11, 2024

julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Nov 25, 2024

julian-risch assigned davidsbatista Dec 9, 2024

davidsbatista mentioned this issue Jan 16, 2025

fix: PDFMinerToDocument convert function - adding double new lines between each container_text so that passages can be detected. #8729

Merged

davidsbatista mentioned this issue Jan 16, 2025

test: test that PyPDF can extract passages so that they are detect by DocumentSplitter #8739

Merged

davidsbatista closed this as completed in #8739 Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

rmrbytes commented Oct 24, 2024

anakin87 commented Oct 28, 2024

rmrbytes commented Oct 28, 2024

lbux commented Nov 8, 2024

rmrbytes commented Nov 8, 2024

lbux commented Nov 8, 2024 •

edited

Loading

rmrbytes commented Nov 8, 2024

lbux commented Nov 9, 2024

lbux commented Nov 9, 2024 •

edited

Loading

anakin87 commented Nov 10, 2024

sjrl commented Nov 11, 2024

sjrl commented Nov 11, 2024

lbux commented Nov 11, 2024

anakin87 commented Nov 11, 2024

lbux commented Nov 11, 2024

lbux commented Nov 11, 2024

davidsbatista commented Jan 6, 2025

lbux commented Jan 9, 2025

davidsbatista commented Jan 16, 2025

davidsbatista commented Jan 16, 2025

davidsbatista commented Jan 16, 2025

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Document Splitter always returns 1 document for split_type="passage" in pdfs #8491

Comments

rmrbytes commented Oct 24, 2024

anakin87 commented Oct 28, 2024

rmrbytes commented Oct 28, 2024

lbux commented Nov 8, 2024

rmrbytes commented Nov 8, 2024

lbux commented Nov 8, 2024 • edited Loading

rmrbytes commented Nov 8, 2024

lbux commented Nov 9, 2024

lbux commented Nov 9, 2024 • edited Loading

anakin87 commented Nov 10, 2024

sjrl commented Nov 11, 2024

sjrl commented Nov 11, 2024

lbux commented Nov 11, 2024

anakin87 commented Nov 11, 2024

lbux commented Nov 11, 2024

lbux commented Nov 11, 2024

davidsbatista commented Jan 6, 2025

lbux commented Jan 9, 2025

davidsbatista commented Jan 16, 2025

davidsbatista commented Jan 16, 2025

davidsbatista commented Jan 16, 2025

lbux commented Nov 8, 2024 •

edited

Loading

lbux commented Nov 9, 2024 •

edited

Loading