-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: update meta data before initializing new Document in DocumentSplitter #8745
fix: update meta data before initializing new Document in DocumentSplitter #8745
Conversation
Hi @anakin87 , |
Sometimes it's related to linter updates... I'll take a look later... |
Pull Request Test Coverage Report for Build 12856711277Details
💛 - Coveralls |
2deb45a
to
cb2bc32
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite good to me already! Thank you @nickprock for contributing to Haystack!
There is a small bug in the code, which is also the reason why @anakin87 couldn't produce any duplicate documents. We're currently re-using meta
throughout multiple iterations of the for loop, which we shouldn't. In particular, overwriting meta
in meta = deepcopy(meta)
is problematic.
If you do something like the following, that problem will be fixed:
copied_meta = deepcopy(meta)
copied_meta["page_number"] = splits_pages[i]
copied_meta["split_id"] = i
copied_meta["split_idx_start"] = split_idx
doc = Document(content=txt, meta=copied_meta)
Here is an example of a unit test you could please add to test/components/preprocessors/test_document_splitter.py
def test_duplicate_pages_get_different_doc_id(self):
splitter = DocumentSplitter(split_by="page", split_length=1)
doc1 = Document(content="This is some text.\fThis is some text.\fThis is some text.\fThis is some text.")
splitter.warm_up()
result = splitter.run(documents=[doc1])
assert len({doc.id for doc in result["documents"]}) == 4
Thanks @julian-risch , I'll work on it tomorrow. |
in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta
test_duplicate_pages_get_different_doc_id
Hi, @anakin87 I applied the changes requested by @julian-risch and added the test. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! 👍 Thanks for adding a test too!
Related Issues
Proposed Changes:
Updated code as illustrated in the issue.
How did you test it?
unit tests
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.