-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix newline and apostrophe handling for BPE #574
Conversation
@@ -3,6 +3,7 @@ | |||
import unittest | |||
|
|||
import numpy as np | |||
import ftfy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need import it here? I think it was used by HF tokenizers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah but they did not check in that PR so HF natively does not import it so it fails without that import locally - might still pass on the ci but it will cause issues for locally testing in the future so i think it is better to have it in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite follow here, If you want other be aware that ftfy should be installed to to avoid the test failure. You may check ftfy installation here and raise a warning it is missed.
@@ -3,6 +3,7 @@ | |||
import unittest | |||
|
|||
import numpy as np | |||
import ftfy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not quite follow here, If you want other be aware that ftfy should be installed to to avoid the test failure. You may check ftfy installation here and raise a warning it is missed.
sounds good, will add the ftfy warning in a separate PR. |
fixes newline and apostrophe handling for issues discovered by long text testing.
Update: CLIPTokenizer tokenizes apostrophes separately (['you', ''', 're']) but CLIPTokenizerFast tokenizes together (['you', ''re']), unless ftfy is installed: huggingface/transformers#22166.