Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch split task to token based splitting #283

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

ChrisJar
Copy link
Collaborator

@ChrisJar ChrisJar commented Dec 13, 2024

Description

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Copy link

copy-pr-bot bot commented Dec 13, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ChrisJar ChrisJar marked this pull request as ready for review January 9, 2025 01:27
@@ -0,0 +1,28 @@
# SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add at least one more test that actually sets all the parameters for the schema to ensure its working as expected, and tests that confirm that the parameter limits are being handled correctly (can't set tokens to negative numbers, etc...)


if df_filtered.empty:
gdf = cudf.from_pandas(df)
message_meta = MessageMeta(df=gdf)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like we've mutated the original message at all at this point. We can probably just return the message without rebuilding the payload.

for _, row in df_filtered.iterrows():
content = row["metadata"]["content"]

if content is None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this necessarily need to be an error? Or could we just set it to an empty string and continue processing?

from typing_extensions import Annotated


class TextSplitterSchema(BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably want some additional validation so that chunk_overlap has to be < chunk_size.

offsets = encoding["offset_mapping"]

# Split the tokens into chunks of the desired size
chunks = [tokens[i : i + chunk_size] for i in range(0, len(tokens), chunk_size - chunk_overlap)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment in the schema, we will want to be sure that chunk_overlap is strictly less than chunk_size. Otherwise we could conceivably get a negative size here.

@ChrisJar
Copy link
Collaborator Author

@drobison00 Thanks for the reviews! I have a couple questions: How do you think we should go about preloading the vocab files in the case that the user doesn't want to allow downloads and in the case they do, how should we go about passing along the huggingface token to access gated models? My thought was to pull it from an environment variable on the client side like we do with unstructured and adobe and pass it along as another parameter in the schema

@ChrisJar
Copy link
Collaborator Author

Also I can't seem to reproduce the test failure locally

@ChrisJar ChrisJar requested a review from drobison00 January 14, 2025 18:19
@edknv
Copy link
Collaborator

edknv commented Jan 14, 2025

Also I can't seem to reproduce the test failure locally

We have a flaky test. I meant to look into fixing it, but for now you can go into Actions and rerun the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants