Switch split task to token based splitting #283

ChrisJar · 2024-12-13T06:14:12Z

Description

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-12-13T06:14:15Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

src/nv_ingest/modules/transforms/nemo_doc_splitter.py

drobison00 · 2025-01-13T17:06:41Z

tests/nv_ingest/schemas/test_text_splitter_schema.py

@@ -0,0 +1,28 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.


Please add at least one more test that actually sets all the parameters for the schema to ensure its working as expected, and tests that confirm that the parameter limits are being handled correctly (can't set tokens to negative numbers, etc...)

drobison00 · 2025-01-13T20:42:40Z

src/nv_ingest/modules/transforms/text_splitter.py

+
+            if df_filtered.empty:
+                gdf = cudf.from_pandas(df)
+                message_meta = MessageMeta(df=gdf)


It doesn't look like we've mutated the original message at all at this point. We can probably just return the message without rebuilding the payload.

drobison00 · 2025-01-13T20:47:20Z

src/nv_ingest/modules/transforms/text_splitter.py

+            for _, row in df_filtered.iterrows():
+                content = row["metadata"]["content"]
+
+                if content is None:


Does this necessarily need to be an error? Or could we just set it to an empty string and continue processing?

drobison00 · 2025-01-13T21:01:45Z

src/nv_ingest/schemas/text_splitter_schema.py

+from typing_extensions import Annotated
+
+
+class TextSplitterSchema(BaseModel):


Probably want some additional validation so that chunk_overlap has to be < chunk_size.

drobison00 · 2025-01-13T21:05:17Z

src/nv_ingest/modules/transforms/text_splitter.py

+    offsets = encoding["offset_mapping"]
+
+    # Split the tokens into chunks of the desired size
+    chunks = [tokens[i : i + chunk_size] for i in range(0, len(tokens), chunk_size - chunk_overlap)]


Related to my comment in the schema, we will want to be sure that chunk_overlap is strictly less than chunk_size. Otherwise we could conceivably get a negative size here.

ChrisJar · 2025-01-13T23:06:55Z

@drobison00 Thanks for the reviews! I have a couple questions: How do you think we should go about preloading the vocab files in the case that the user doesn't want to allow downloads and in the case they do, how should we go about passing along the huggingface token to access gated models? My thought was to pull it from an environment variable on the client side like we do with unstructured and adobe and pass it along as another parameter in the schema

ChrisJar · 2025-01-14T18:18:25Z

Also I can't seem to reproduce the test failure locally

edknv · 2025-01-14T18:38:44Z

Also I can't seem to reproduce the test failure locally

We have a flaky test. I meant to look into fixing it, but for now you can go into Actions and rerun the test.

Switch split task to token based splitting

918740b

drobison00 reviewed Dec 13, 2024

View reviewed changes

Chris Jarrett added 9 commits December 18, 2024 11:22

Merge branch 'main' into token-split

f45d7c9

Move tokenizer out of loop

c12df54

Fix CLI

c88a4a2

Add chunk_overlap parameter

0cacc9e

Merge in main

2e495b2

Fix broken tests

d8a2de9

Add chunk overlap

a21d065

Rename nemo document splitter to text splitter

bc2bf48

Temp fix

cd18083

ChrisJar marked this pull request as ready for review January 9, 2025 01:27

Merge remote-tracking branch 'upstream/main' into token-split

081dc4e

randerzander requested a review from drobison00 January 10, 2025 00:36

drobison00 requested changes Jan 13, 2025

View reviewed changes

Address reviews

bab7f3d

Merge remote-tracking branch 'upstream/main' into token-split

125bb38

ChrisJar requested a review from drobison00 January 14, 2025 18:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch split task to token based splitting #283

Switch split task to token based splitting #283

ChrisJar commented Dec 13, 2024 •

edited

Loading

copy-pr-bot bot commented Dec 13, 2024

drobison00 Jan 13, 2025

drobison00 Jan 13, 2025

drobison00 Jan 13, 2025

drobison00 Jan 13, 2025

drobison00 Jan 13, 2025

ChrisJar commented Jan 13, 2025

ChrisJar commented Jan 14, 2025

edknv commented Jan 14, 2025

		@@ -0,0 +1,28 @@
		# SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.

		from typing_extensions import Annotated


		class TextSplitterSchema(BaseModel):

Switch split task to token based splitting #283

Are you sure you want to change the base?

Switch split task to token based splitting #283

Conversation

ChrisJar commented Dec 13, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Dec 13, 2024

drobison00 Jan 13, 2025

Choose a reason for hiding this comment

drobison00 Jan 13, 2025

Choose a reason for hiding this comment

drobison00 Jan 13, 2025

Choose a reason for hiding this comment

drobison00 Jan 13, 2025

Choose a reason for hiding this comment

drobison00 Jan 13, 2025

Choose a reason for hiding this comment

ChrisJar commented Jan 13, 2025

ChrisJar commented Jan 14, 2025

edknv commented Jan 14, 2025

ChrisJar commented Dec 13, 2024 •

edited

Loading