-
Notifications
You must be signed in to change notification settings - Fork 168
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch split task to token based splitting #283
base: main
Are you sure you want to change the base?
Conversation
@@ -0,0 +1,28 @@ | |||
# SPDX-FileCopyrightText: Copyright (c) 2024, NVIDIA CORPORATION & AFFILIATES. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add at least one more test that actually sets all the parameters for the schema to ensure its working as expected, and tests that confirm that the parameter limits are being handled correctly (can't set tokens to negative numbers, etc...)
|
||
if df_filtered.empty: | ||
gdf = cudf.from_pandas(df) | ||
message_meta = MessageMeta(df=gdf) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like we've mutated the original message at all at this point. We can probably just return the message
without rebuilding the payload.
for _, row in df_filtered.iterrows(): | ||
content = row["metadata"]["content"] | ||
|
||
if content is None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this necessarily need to be an error? Or could we just set it to an empty string and continue processing?
from typing_extensions import Annotated | ||
|
||
|
||
class TextSplitterSchema(BaseModel): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably want some additional validation so that chunk_overlap has to be < chunk_size.
offsets = encoding["offset_mapping"] | ||
|
||
# Split the tokens into chunks of the desired size | ||
chunks = [tokens[i : i + chunk_size] for i in range(0, len(tokens), chunk_size - chunk_overlap)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to my comment in the schema, we will want to be sure that chunk_overlap is strictly less than chunk_size. Otherwise we could conceivably get a negative size here.
@drobison00 Thanks for the reviews! I have a couple questions: How do you think we should go about preloading the vocab files in the case that the user doesn't want to allow downloads and in the case they do, how should we go about passing along the huggingface token to access gated models? My thought was to pull it from an environment variable on the client side like we do with unstructured and adobe and pass it along as another parameter in the schema |
Also I can't seem to reproduce the test failure locally |
We have a flaky test. I meant to look into fixing it, but for now you can go into Actions and rerun the test. |
Description
Checklist