-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zero width spaces (U+200b) inside the token #1010
Comments
In English our policy is to retain the exact original string for each sentence. UniversalDependencies/UD_English-EWT#83 explains how we mark the token with |
@nschneid yes, I see this case (U+00AD) in English-EWT. Is the Should we distinguish between normal spaces (U+0020) and all other spaces ( |
Does it separate two words? If so I would say it should be encoded in UD with the |
I just fixed the Portuguese-PUD dataset. Thank you |
The guidelines say that "spaces" cannot occur in columns other than FORM, LEMMA and MISC, and in FORM and LEMMA they can occur only in expressions specifically defined for the language. However, it is not specified what types of spaces are meant. The validator uses the Nevertheless, the character is still intended to separate words rather than being their part. So if it is preserved in treebanks, it should probably be a separate token. And if it is a separate token, I cannot see it tagged and attached as anything else than punctuation – although I cannot say I like such a solution. It would be also possible to say that the validator should consider both |
Here is the list of all instances of
|
There is only one character from the br_keb-ud-test.conllu :
All other spaces in FORM and LEMMA columns are normal spaces ( |
A refined rule for columns FORM and LEMMA could probably be: "The value cannot begin or end with a space-like character". |
According to the standard should space-like characters such as zero width space (U+200b) be included in the tokens or skipped like the normal space character ?
Here are some examples:
be_hse-ud-dev.conllu
tr_penn-ud-train.conllu
pt_pud-ud-test.conllu
The text was updated successfully, but these errors were encountered: