-
Notifications
You must be signed in to change notification settings - Fork 896
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New model for unsupported language (Albanian: sq) #1360
Comments
Random request, this is really hard to read, please check the formatting next time on the stack traces |
Try adding
|
… a note on how to develop a new language's Pipeline. Related to #1360
many thanks AngledLuffa, now it works. And sorry about the awful formatting :( |
unfortunately, the new option and/or the dev branch doesn't seem to work. If I load models using the config dictionary, I get the following
but the pos processor is actually loaded!
│ ├── sq_nel_nocharlm_parser_checkpoint.pt |
The I can see that you're loading the POS model first before the depparse. Sanity check first - is the POS model labeling either upos or xpos? If somehow it was trained to only label the features, I could see it throwing this kind of error. Otherwise, it really looks from the code that this particular error should happen - it only triggers if both upos and xpos are missing for a word.
If the POS model should be working, what happens if you run the pipeline without the depparse and print out the results? Are there any sentences for which the POS is actually missing? I wonder if that can happen if the POS model has blank tags in the dataset it's learning from |
Many thanks for the detailed answer! This is really strange, I have tried to load the pipeline as I do in the script and it worked correctly on a few sentence. I have also tried to pass to the script a small txt file with some sentences and it worked too. |
If it "misses" things to be incorrect, that's one thing. But I do very much wonder why it would label anything Are you able to send the data + the data you are trying to test on, or maybe just send the model and the test data? I'd really like to see it in action myself to debug this issue. Another possible debugging step would be to examine the output of just the tokenizer and the POS w/o any of the subsequent models and check for any words which are missing both xpos and upos. |
Fascinating. I ran an experiment on English with DET/DT replaced with blanks. Apparently, giving the tagger empty tags for the POS tag results in it labeling words with None as tags. This must be what's happening to you - there are entries in your training data which don't have either UPOS or XPOS. Is this something you want to fix on your end? Maybe the tagger is supposed to ignore those items, or learn to tag them with |
... to be more precise, it IS learning to tag words w/o tags with |
thing is, I have already used these data to train a model two or three times last November and it worked fine. I have just added a few sentences for teaching the parser to recognize mwt like Albanian ta = të + e. |
It will successfully train a tagger even if there are empty tags. However, it's learned to recognize some words as having the empty tag, and that's the label the tagger gives those words. Did I express that clearly? I did the following experiment. Instead of sentences such as this in English, where
I changed all instances of
Now the tagger I trained labels I think it might make more sense to either throw an error when training a tagger on a partially complete file, or possibly treat single blank tags as masked out. Learning to recognize the blank tag doesn't seem very useful... In the meantime, if you find and eliminate those blank tags from your dataset, I believe this error will go away. |
ok, I have successfully parsed a file with just the pos tagging. Indeed, there are some tokens without UPOS. Actually, just one i.e., the stupid " punctuation 🔝 |
Indeed. I just need to figure out what the right approach is. The two leading candidates in my mind are to stop the tagger from training if there are blank UPOS, so as to give the user a chance to go back and fix the issue, or to treat the blanks as unlabeled tokens in the tagger which don't get a label of any kind. The second one is more appealing to me ideologically, but the problem is that in a case similar to yours where maybe all the punctuation was unlabeled, then they would all get tagged with the most likely known tag at test time (perhaps NOUN, for example). If you have an alternate suggestion, happy to hear it. |
I have corrected the dataset, retrained the model and now the parser works fine. |
…ly labeled training data was causing problems when training a non-UD dataset #1360
This error message is now part of the 1.8.2 release. Is there anything else you need addressed? |
great! thank you, everything looks good! |
… a note on how to develop a new language's Pipeline. Related to #1360
…ly labeled training data was causing problems when training a non-UD dataset #1360
@rahonalab I'm wondering - there is only a very small Albanian UD dataset on universaldependencies.org, and I don't see any planned Albanian expansions. Can I ask what dataset you used for this? If there is any publicly available data (larger than the UD dataset) we could add this language as a standard language to Stanza. |
hello! I have used two datasets which we plan to release as UD treebanks soon. I'll keep you posted |
That would be excellent! Looking forward to it. |
hello, the first of the two datasets for Albanian has been released in UD 2.15: https://github.com/UniversalDependencies/UD_Albanian-STAF It's a bit tiny (200 sentences, 3,3K tokens), but I hope it can already serve as a training model. I am not responsible for the second dataset, but here's a paper describing it: https://aclanthology.org/2024.clib-1.7.pdf |
Thanks for the heads up! Do you know if the second treebank you just used will also be part of UD? If you don't know, I can contact the authors. Furthermore, do you have any thoughts on the interoperability of the two treebanks? Are we able to just add the training data from the two of them together, or will there be significant differences in the annotation schemes? My guess would be they will be interoperable, since from reading your work it appears you have used models trained from their dataset to bootstrap the annotation of your dataset. Such a situation would be ideal, as we can easily combine the two datasets in that case. |
Hello! Yes, they have plans for a UD release. |
Excellent, thanks for the heads up. It's possible to train the tagger on UPOS and XPOS from both treebanks, but just the features from your treebank, so that's what I'll do for Albanian. If'n the other treebank gets added, I'll add that to the mix as well. Incidentally, you might very well be able to update their treebank with your newer feature scheme by starting with such a tagger, silver tagging their dataset with your feature versions, and then hand correcting them. 60 sentences might not be too many sentences for such a project, and often other treebank maintainers are happy to get improved annotation schemes. What do you mean by syntactic relations - do you mean there are dependency types which appear in your treebank but don't appear in the other treebank? That would be harder to make use of with our current model, although perhaps we could make a version of the dependency parser which has the same input layers but two prediction heads, and therefore can train the bottom and middle layers so it is possible to learn from different dependency annotation schemes. Do you have a recommendation for word vectors to use for these models? Fasttext has word vectors: https://fasttext.cc/docs/en/crawl-vectors.html but frequently I have found that a dedicated project to building embeddings will produce something that performs better on downstream tasks than those embeddings. Also, if you can think of NER, sentiment, coref, or (doubtful) constituency datasets for Albanian, we can build models for that as well. |
Thank you! I have used fasttest for training Albanian models. |
I find that the fasttext vectors are better than random initialization, although not by a huge amount. Would you clarify what you mean by the syntactic relations are different - is it that the dependency trees have different dependency types? In that case, the trees probably shouldn't mix together, right? |
Yes, STAF and TSA differ in a few dependency types. The new sub-dependency types introduced in STAF are listed in the documentation, while the difference between the two treebanks is in the treatment of the clitic pronouns. In Albanian, indirect object and, to a lesser extent, direct object are marked twice: on the nominal argument and on a co-referring pronoun. STAF annotates both for obj/iobj, while TSA annotates the former for obj/iobj and the latter for expl - see 'Clitic Doubling' in TSA's paper. |
Ultimately our depparse doesn't have any capacity to learn from two different labeling schemes for trees. We can add it (as we did for the POS and NER tags) but in the meantime I'll make tokenize, mwt, lemma, and pos from both treebanks and just make the depparse from STAF |
How about the MWT? I notice that the STAF dataset has MWT labeled, whereas TSA does not. I think some of the tokens are the same across datasets, though. For example: STAF, MWT label on
TSA, no MWT label on
There's also this in TSA, although it doesn't show up in STAF:
|
Yes, there is also support for MWTs in the STAF treebank. I was able to compile the MWT processor with the unreleased data and it worked pretty well. |
That's good to hear - my concern here is that the TSA treebank doesn't have MWT, and apparently does have tokens which could have been labeled MWT. Therefore it probably shouldn't be combined with STAF for training MWT ... although I wonder if it'd be a simple and useful improvement to just add them. For example, the couple I linked above look like obvious candidates. Would you like to file such an issue? I can also look into figuring out what I can based on examples like the above, where |
Do you want me to open an issue on TSA? I can but I am not sure that the treebank is still mantained… |
I did try, but I'm not sure we'll hear back any time soon either. Still I believe if the treebank is truly unmaintained, it's possible for someone else to offer to help with it - otherwise eventually the treebank will fall behind some updated requirement and not be published any more. As it stands now it's a bit less clear how much we can use of that data - without MWT, the tokenization also can't be combined or it will be learning not to mark things as MWT. So that leaves pretty much the lemmas and the UPOS, which is at least something, I suppose. One thing that would help a lot if you have the time to look for other MWT in that treebank aside from |
I took a stab at the MWT change: UniversalDependencies/UD_Albanian-TSA#8 I looked for all the ones in STAF, and all the words with |
I just pushed a new version of Stanza with an Albanian model built from both of the treebanks. If you have time, would you double check the MWT updates in that PR I posted above? Thanks! |
I'll check the new model and let you know. |
Sorry for the double bug report.
Can you please tell me what is the right procedure to load a model for a language that is not currently supported i..e, Albanian (sq).
I have tried the following two things:
pipeline = stanza.Pipeline("sq", dir="DIR_TO_THE_MODEL",download_method=None)
It doesn't work:
2024-03-02 15:25:18 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value
# Language code for the language to build the Pipeline in 'lang': 'sq', # Processor-specific arguments are set with keys "{processor_name}_{argument_name}" # You only need model paths if you have a specific model outside of stanza_resources 'tokenize_model_path': '/corpus/models/stanza/sq/tokenize/sq_nel_tokenizer.pt', 'pos_model_path': '/corpus/models/stanza/sq/pos/sq_nel_tagger.pt', 'lemma_model_path': '/corpus/models/stanza/sq/lemma/sq_nel_lemmatizer.pt', 'depparse_model_path': '/corpus/models/stanza/sq/depparse/sq_nel_parser.pt', 'pos_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt', 'depparse_pretrain_path': '/corpus/models/stanza/sq/pretrain/sq_fasttext.pretrain.pt', })
But, again, it doesn't work:
2024-03-02 16:00:25 WARNING: Unsupported language: sq. Traceback (most recent call last): File "/tools/ud-stanza-other.py", line 149, in <module> main() File "/tools/ud-stanza-other.py", line 105, in main nlp = stanza.Pipeline(**config, logging_level="DEBUG") ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/local/lib/python3.12/site-packages/stanza/pipeline/core.py", line 268, in __init__ logger.info(f'Loading these models for language: {lang} ({lang_name}):\n{load_table}') ^^^^^^^^^ UnboundLocalError: cannot access local variable 'lang_name' where it is not associated with a value
As a workaround, I have put a code of a supported language, but it's not ideal, as it might load other models...
Thanks!
The text was updated successfully, but these errors were encountered: