Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix newline and apostrophe handling for BPE #574

Merged
merged 9 commits into from
Oct 19, 2023
Merged

Conversation

sayanshaw24
Copy link
Contributor

@sayanshaw24 sayanshaw24 commented Oct 14, 2023

fixes newline and apostrophe handling for issues discovered by long text testing.

Update: CLIPTokenizer tokenizes apostrophes separately (['you', ''', 're']) but CLIPTokenizerFast tokenizes together (['you', ''re']), unless ftfy is installed: huggingface/transformers#22166.

@sayanshaw24 sayanshaw24 requested a review from a team as a code owner October 14, 2023 00:51
operators/tokenizer/bpe_utils.hpp Outdated Show resolved Hide resolved
operators/tokenizer/bpe_kernels.cc Show resolved Hide resolved
@sayanshaw24 sayanshaw24 changed the title Fix certain BPE issues Fix newline and apostrophe handling for BPE Oct 18, 2023
@@ -3,6 +3,7 @@
import unittest

import numpy as np
import ftfy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need import it here? I think it was used by HF tokenizers.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah but they did not check in that PR so HF natively does not import it so it fails without that import locally - might still pass on the ci but it will cause issues for locally testing in the future so i think it is better to have it in the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite follow here, If you want other be aware that ftfy should be installed to to avoid the test failure. You may check ftfy installation here and raise a warning it is missed.

test/test_cliptok.py Outdated Show resolved Hide resolved
@@ -3,6 +3,7 @@
import unittest

import numpy as np
import ftfy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not quite follow here, If you want other be aware that ftfy should be installed to to avoid the test failure. You may check ftfy installation here and raise a warning it is missed.

@sayanshaw24
Copy link
Contributor Author

sounds good, will add the ftfy warning in a separate PR.

@sayanshaw24 sayanshaw24 merged commit 4d2930e into main Oct 19, 2023
41 checks passed
@sayanshaw24 sayanshaw24 deleted the sayanshaw/bpe-fix branch October 19, 2023 07:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants