Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Almost-nonsense output with rnnmorph #39

Open
adamdrazsky opened this issue Nov 22, 2022 · 1 comment
Open

Almost-nonsense output with rnnmorph #39

adamdrazsky opened this issue Nov 22, 2022 · 1 comment

Comments

@adamdrazsky
Copy link

adamdrazsky commented Nov 22, 2022

Hi, I'm using rnnmorph on python3.9 to attach tags to the input to russian-g2p for improved accenting.
Output is very strange, multiple pluses as well as accents on top of letters. See below...

The code:

from russian_g2p.Accentor import Accentor
from rnnmorph.predictor import RNNMorphPredictor

accentor = Accentor()
predictor = RNNMorphPredictor(language="ru")

testing_corpus = [
    # testing дома
    'Я сегодня остался дома',  # до+ма
    'рабочие строят высокие бетонные дома',  # дома+
    # testing козы
    'На поляне пасутся козы',  # ко+зы
    'У моей козы сломана нога',  # козы+
    # testing уже
    'я уже пришёл из школы',  # уже+
    'эта юбка уже, чем полоска'  #у+же
] 

sentences_tagged = [predictor.predict(testing_sentence.replace(",", "").split(' ')) for testing_sentence in testing_corpus]

for sent in sentences_tagged:
  sent_tagged = [[word.word, f"{word.pos} {word.tag}"] for word in sent]
  sent_accented = [accentor.do_accents([[word[0], word[1]]]) for word in sent_tagged]
  print(sent_accented)

Output:

[['я+']], [['сего+дня']], [['оста+лся']], [['+д+о+́+м+а+']]]
[[['рабо+чие']], [['стро+ят']], [['высо+кие']], [['бето+нные']], [['дома+']]]
[[['на']], [['поля+не']], [['пасу+тся']], [['+к+о+́+з+ы+']]]
[[['у']], [['мое+й']], [['+к+о+з+ы+́+']], [['сло+мана']], [['нога+']]]
[[['я+']], [['у+же']], [['пришё+л']], [['из']], [['шко+лы']]]
[[['э+та']], [['ю+бка']], [['у+же']], [['че+м']], [['поло+ска']]]

As you can see, some words have multiple accent markers; I'm expecting only one per word at most.

What's going on here?

@adamdrazsky
Copy link
Author

adamdrazsky commented Nov 28, 2022

For quick testing, do:

accentor.do_accents([['дома', 'ADV Degree=Pos']])
output:
[['+д+о+́+м+а+']]
expected:
[['до+ма']]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant