Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: skip comment blocks in DOCXToDocument #8764

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Night-Quiet
Copy link

Related Issues

Proposed Changes:

issue: Loading issues with docx files saved using the latest version of WPS
solve: Added a judgment to skip invalid content

if isinstance(element, _Comment):
    continue

How did you test it?

Previous test code

import docx
from enum import Enum
from typing import List
from lxml.etree import _Comment
from docx.text.paragraph import Paragraph
from docx.document import Document as DocxDocument
from haystack.components.converters import DOCXToDocument


class DOCXTableFormat(Enum):
    """
    Supported formats for storing DOCX tabular data in a Document.
    """

    MARKDOWN = "markdown"
    CSV = "csv"

    def __str__(self):
        return self.value

    @staticmethod
    def from_str(string: str) -> "DOCXTableFormat":
        """
        Convert a string to a DOCXTableFormat enum.
        """
        enum_map = {e.value: e for e in DOCXTableFormat}
        table_format = enum_map.get(string.lower())
        if table_format is None:
            msg = f"Unknown table format '{string}'. Supported formats are: {list(enum_map.keys())}"
            raise ValueError(msg)
        return table_format



class DOCXToDocumentCustom(DOCXToDocument):
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
    
    def _extract_elements(self, document: "DocxDocument") -> List[str]:
        """
        Extracts elements from a DOCX file.

        :param document: The DOCX Document object.
        :returns: List of strings (paragraph texts and table representations) with page breaks added as '\f' characters.
        """
        elements = []
        for element in document.element.body:
            if isinstance(element, _Comment):
                continue
            
            if element.tag.endswith("p"):
                paragraph = Paragraph(element, document)
                if paragraph.contains_page_break:
                    para_text = self._process_paragraph_with_page_breaks(paragraph)
                else:
                    para_text = paragraph.text
                elements.append(para_text)
            elif element.tag.endswith("tbl"):
                table = docx.table.Table(element, document)
                table_str = (
                    self._table_to_markdown(table)
                    if self.table_format == DOCXTableFormat.MARKDOWN
                    else self._table_to_csv(table)
                )
                elements.append(table_str)

        return elements

no

Notes for the reviewer

@anakin87

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@Night-Quiet Night-Quiet requested a review from a team as a code owner January 23, 2025 04:27
@Night-Quiet Night-Quiet requested review from Amnah199 and removed request for a team January 23, 2025 04:27
@CLAassistant
Copy link

CLAassistant commented Jan 23, 2025

CLA assistant check
All committers have signed the CLA.

@anakin87 anakin87 requested review from anakin87 and removed request for Amnah199 January 23, 2025 10:42
Copy link
Member

@anakin87 anakin87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

I left a few minor comments.
Please also add a release note, as explained here.

Comment on lines +26 to +27
with LazyImport("Run 'pip install lxml'") as lxml_import:
from lxml.etree import _Comment
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with LazyImport("Run 'pip install lxml'") as lxml_import:
from lxml.etree import _Comment
from lxml.etree import _Comment

Since lxml is a dependency of python-docx (see https://github.com/python-openxml/python-docx/blob/master/requirements.txt), we can safely put this import in the docx_import lazy import block.

@@ -119,6 +121,7 @@ def __init__(self, table_format: Union[str, DOCXTableFormat] = DOCXTableFormat.C
If False, only the file name is stored.
"""
docx_import.check()
lxml_import.check()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lxml_import.check()

if we incorporate the lxml import in the docx lazy import block, we can remove this line.

@coveralls
Copy link
Collaborator

Pull Request Test Coverage Report for Build 12922272483

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 1 unchanged line in 1 file lost coverage.
  • Overall coverage decreased (-0.006%) to 91.294%

Files with Coverage Reduction New Missed Lines %
components/converters/docx.py 1 99.31%
Totals Coverage Status
Change from base Build 12888271194: -0.006%
Covered Lines: 8851
Relevant Lines: 9695

💛 - Coveralls

@anakin87 anakin87 changed the title fix bug #8759 fix: skip comment blocks in DOCXToDocument Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Loading issues with docx files saved using the latest version of WPS
4 participants