Skip to content

Commit

Permalink
feat: add RecursiveSplitter component for Document preprocessing (#…
Browse files Browse the repository at this point in the history
…8605)

* initial import

* adding initial version + tests

* adding more tests

* more tests

* incorporating SentenceSplitter based on NLTK

* adding more tests

* adding release notes

* adding LICENSE header

* removing unused imports

* fixing example docstring

* addding docstrings

* fixing tests and returning a dictionary

* updating release notes

* attending PR comments

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* wip: updating tests for split_idx_start and _split_overlap

* adding tests for split_idx and split_start and overlaps

* adjusting file for LICENSE checking

* adding more tests

* adding tests for page numbering

* adding tests for min split lenghts and falling back to character-level chunking based on size

* fixing linting issue

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* wip

* wip

* updating tests

* wip: fixing all tests after changes

* more tests

* wip: debugging sentence overlap

* wip: debugging page number

* wip

* wip; fixed bug with sentence tokenizer, needs to keep white spaces

* adding tests for counting pages on different split approaches

* NLTK checks done on SentenceSplitter

* fixing types

* adding detecting for full overlap with previous chunks

* fixing types

* improving docstring

* improving docstring

* adding custom lenght, 'character' use case

* customising overlap function for word and adding a few tests

* updating docstring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* wip: adding more tests for word unit length

* fix

* feat: `Tool` dataclass - unified abstraction to represent tools (#8652)

* draft

* del HF token in tests

* adaptations

* progress

* fix type

* import sorting

* more control on deserialization

* release note

* improvements

* support name field

* fix chatpromptbuilder test

* port Tool from experimental

* release note

* docs upd

* Update tool.py

---------

Co-authored-by: Daria Fokina <[email protected]>

* fix: fix deserialization issues in multi-threading environments (#8651)

* adding 'word' as default length

* fixing types

* handing both default strategies

* wip

* \f was not being counted properly

* updating tests

* fixing the overlap bug

* adding more tests

* refactoring _apply_overlap

* further refactoring

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* Update haystack/components/preprocessors/recursive_splitter.py

Co-authored-by: Sebastian Husch Lee <[email protected]>

* adding ticks to close code block

* fixing comments

* applying changes: split with space and force keep_white_spaces=True

* fixing some tests and replacing count words approach in more places

* keep_white_spaces = True only if not defined

* cleaning docs

* handling some more edge cases, when split is still too big and all separators ran

* fixing fallback whitespaces count to fixed word/char split based on split size

* cleaning

---------

Co-authored-by: Sebastian Husch Lee <[email protected]>
Co-authored-by: Stefano Fiorucci <[email protected]>
Co-authored-by: Daria Fokina <[email protected]>
Co-authored-by: Tobias Wochinger <[email protected]>
  • Loading branch information
5 people authored Jan 10, 2025
1 parent 741ce5d commit 4f73b19
Show file tree
Hide file tree
Showing 5 changed files with 1,246 additions and 2 deletions.
4 changes: 2 additions & 2 deletions haystack/components/preprocessors/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
from .document_cleaner import DocumentCleaner
from .document_splitter import DocumentSplitter
from .nltk_document_splitter import NLTKDocumentSplitter
from .sentence_tokenizer import SentenceSplitter
from .recursive_splitter import RecursiveDocumentSplitter
from .text_cleaner import TextCleaner

__all__ = ["DocumentSplitter", "DocumentCleaner", "NLTKDocumentSplitter", "SentenceSplitter", "TextCleaner"]
__all__ = ["DocumentSplitter", "DocumentCleaner", "RecursiveDocumentSplitter", "TextCleaner", "NLTKDocumentSplitter"]
Loading

0 comments on commit 4f73b19

Please sign in to comment.