fix: skip comment blocks in `DOCXToDocument` #8764

Night-Quiet · 2025-01-23T04:27:49Z

Related Issues

fixes Loading issues with docx files saved using the latest version of WPS #8759

Proposed Changes:

issue: Loading issues with docx files saved using the latest version of WPS
solve: Added a judgment to skip invalid content

if isinstance(element, _Comment):
    continue

How did you test it?

Previous test code

import docx
from enum import Enum
from typing import List
from lxml.etree import _Comment
from docx.text.paragraph import Paragraph
from docx.document import Document as DocxDocument
from haystack.components.converters import DOCXToDocument


class DOCXTableFormat(Enum):
    """
    Supported formats for storing DOCX tabular data in a Document.
    """

    MARKDOWN = "markdown"
    CSV = "csv"

    def __str__(self):
        return self.value

    @staticmethod
    def from_str(string: str) -> "DOCXTableFormat":
        """
        Convert a string to a DOCXTableFormat enum.
        """
        enum_map = {e.value: e for e in DOCXTableFormat}
        table_format = enum_map.get(string.lower())
        if table_format is None:
            msg = f"Unknown table format '{string}'. Supported formats are: {list(enum_map.keys())}"
            raise ValueError(msg)
        return table_format



class DOCXToDocumentCustom(DOCXToDocument):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    
    def _extract_elements(self, document: "DocxDocument") -> List[str]:
        """
        Extracts elements from a DOCX file.

        :param document: The DOCX Document object.
        :returns: List of strings (paragraph texts and table representations) with page breaks added as '\f' characters.
        """
        elements = []
        for element in document.element.body:
            if isinstance(element, _Comment):
                continue
            
            if element.tag.endswith("p"):
                paragraph = Paragraph(element, document)
                if paragraph.contains_page_break:
                    para_text = self._process_paragraph_with_page_breaks(paragraph)
                else:
                    para_text = paragraph.text
                elements.append(para_text)
            elif element.tag.endswith("tbl"):
                table = docx.table.Table(element, document)
                table_str = (
                    self._table_to_markdown(table)
                    if self.table_format == DOCXTableFormat.MARKDOWN
                    else self._table_to_csv(table)
                )
                elements.append(table_str)

        return elements

no

Notes for the reviewer

@anakin87

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

CLAassistant · 2025-01-23T04:27:56Z

All committers have signed the CLA.

anakin87

Thanks for the contribution!

I left a few minor comments.
Please also add a release note, as explained here.

anakin87 · 2025-01-23T10:44:11Z

haystack/components/converters/docx.py

+with LazyImport("Run 'pip install lxml'") as lxml_import:
+    from lxml.etree import _Comment


Suggested change

with LazyImport("Run 'pip install lxml'") as lxml_import:

from lxml.etree import _Comment

from lxml.etree import _Comment

Since lxml is a dependency of python-docx (see https://github.com/python-openxml/python-docx/blob/master/requirements.txt), we can safely put this import in the docx_import lazy import block.

anakin87 · 2025-01-23T10:44:52Z

haystack/components/converters/docx.py

@@ -119,6 +121,7 @@ def __init__(self, table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.C
            If False, only the file name is stored.
        """
        docx_import.check()
+        lxml_import.check()


Suggested change

lxml_import.check()

if we incorporate the lxml import in the docx lazy import block, we can remove this line.

coveralls · 2025-01-23T10:47:31Z

Pull Request Test Coverage Report for Build 12922272483

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage decreased (-0.006%) to 91.294%

Files with Coverage Reduction	New Missed Lines	%
components/converters/docx.py	1	99.31%

Totals
Change from base Build 12888271194:	-0.006%
Covered Lines:	8851
Relevant Lines:	9695

💛 - Coveralls

fix bug deepset-ai#8759

cb22742

Night-Quiet requested a review from a team as a code owner January 23, 2025 04:27

Night-Quiet requested review from Amnah199 and removed request for a team January 23, 2025 04:27

anakin87 requested review from anakin87 and removed request for Amnah199 January 23, 2025 10:42

anakin87 requested changes Jan 23, 2025

View reviewed changes

anakin87 changed the title ~~fix bug #8759~~ fix: skip comment blocks in DOCXToDocument Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: skip comment blocks in `DOCXToDocument` #8764

fix: skip comment blocks in `DOCXToDocument` #8764

Night-Quiet commented Jan 23, 2025

CLAassistant commented Jan 23, 2025 •

edited

Loading

anakin87 left a comment

anakin87 Jan 23, 2025

anakin87 Jan 23, 2025

coveralls commented Jan 23, 2025

		with LazyImport("Run 'pip install lxml'") as lxml_import:
		from lxml.etree import _Comment

fix: skip comment blocks in DOCXToDocument #8764

Are you sure you want to change the base?

fix: skip comment blocks in DOCXToDocument #8764

Conversation

Night-Quiet commented Jan 23, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

CLAassistant commented Jan 23, 2025 • edited Loading

anakin87 left a comment

Choose a reason for hiding this comment

anakin87 Jan 23, 2025

Choose a reason for hiding this comment

anakin87 Jan 23, 2025

Choose a reason for hiding this comment

coveralls commented Jan 23, 2025

Pull Request Test Coverage Report for Build 12922272483

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

fix: skip comment blocks in `DOCXToDocument` #8764

fix: skip comment blocks in `DOCXToDocument` #8764

CLAassistant commented Jan 23, 2025 •

edited

Loading