Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: update meta data before initializing new Document in DocumentSplitter #8745

Merged
merged 6 commits into from
Jan 20, 2025

Conversation

nickprock
Copy link
Contributor

Related Issues

Proposed Changes:

Updated code as illustrated in the issue.

How did you test it?

unit tests

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@nickprock nickprock requested review from a team as code owners January 17, 2025 14:08
@nickprock nickprock requested review from dfokina and anakin87 and removed request for a team January 17, 2025 14:08
@nickprock
Copy link
Contributor Author

Hi @anakin87 ,
on my local environment the code pass all checks but here there are some files (I haven't worked on they) that don't pass the format checker.
I don't understand.

@anakin87
Copy link
Member

Sometimes it's related to linter updates... I'll take a look later...

@github-actions github-actions bot added the topic:DX Developer Experience label Jan 17, 2025
@coveralls
Copy link
Collaborator

coveralls commented Jan 17, 2025

Pull Request Test Coverage Report for Build 12856711277

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 91.306%

Totals Coverage Status
Change from base Build 12856629462: 0.0%
Covered Lines: 8853
Relevant Lines: 9696

💛 - Coveralls

@anakin87 anakin87 force-pushed the dev/DocumentSplitter_update branch from 2deb45a to cb2bc32 Compare January 17, 2025 17:34
@anakin87
Copy link
Member

anakin87 commented Jan 17, 2025

  1. I understood the format problem, that I will fix in another PR (chore: update ruff version in pre-commit hook #8746)
  2. I'm having a hard time testing this bugfix: I'm not sure that we are creating duplicate IDs with the current code, so I'll ask @julian-risch to take a look

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite good to me already! Thank you @nickprock for contributing to Haystack!
There is a small bug in the code, which is also the reason why @anakin87 couldn't produce any duplicate documents. We're currently re-using meta throughout multiple iterations of the for loop, which we shouldn't. In particular, overwriting meta in meta = deepcopy(meta) is problematic.

If you do something like the following, that problem will be fixed:

copied_meta = deepcopy(meta)
copied_meta["page_number"] = splits_pages[i]
copied_meta["split_id"] = i
copied_meta["split_idx_start"] = split_idx
doc = Document(content=txt, meta=copied_meta)

Here is an example of a unit test you could please add to test/components/preprocessors/test_document_splitter.py

def test_duplicate_pages_get_different_doc_id(self):
    splitter = DocumentSplitter(split_by="page", split_length=1)
    doc1 = Document(content="This is some text.\fThis is some text.\fThis is some text.\fThis is some text.")
    splitter.warm_up()
    result = splitter.run(documents=[doc1])

    assert len({doc.id for doc in result["documents"]}) == 4 

@nickprock
Copy link
Contributor Author

Thanks @julian-risch , I'll work on it tomorrow.

in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta
test_duplicate_pages_get_different_doc_id
@nickprock
Copy link
Contributor Author

Hi, @anakin87 I applied the changes requested by @julian-risch and added the test.
On my local environment everything is fine, here there is always yesterday's error about formatting.

@julian-risch julian-risch self-requested a review January 20, 2025 08:49
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 👍 Thanks for adding a test too!

@julian-risch julian-risch changed the title document splitter update fix: update meta data before initializing new Document in DocumentSplitter Jan 20, 2025
@julian-risch julian-risch merged commit 542a7f7 into deepset-ai:main Jan 20, 2025
18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:core topic:DX Developer Experience topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DocumentSplitter updates Document's meta data after initializing the Document
4 participants