Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebSocket Closure, Background Noise Interruption, and ASR Resource Management in vosk-asterisk #53

Open
siship opened this issue Oct 28, 2024 · 0 comments

Comments

@siship
Copy link

siship commented Oct 28, 2024

Hello,

We have configured the vosk-asterisk and tried to create some demos with custom Kaldi models. We have been using Vosk offline API - it's great! thank you @nshmyrev.

While creating the demo for a streaming application where interruption is important and doing the testing we got a few issues.

Vosk websocket streaming-based ASR server:

• Number of cores: 12
• RAM: 56GB
• Total number of instances: 51
• Load averages:
    ◦ For ~50 active calls: ~17
    ◦ For ~67 active calls: ~22
    ◦ For ~30 active calls: ~12
    ◦ For ~15 active calls: ~5

model.config

--min-active=200
--max-active=7000
--beam=11.0
--lattice-beam=6.0
--acoustic-scale=1.0
--frame-subsampling-factor=3
--endpoint.silence-phones=1:2:3:4:5:
--endpoint.rule1.min-trailing-silence=20
--endpoint.rule2.min-trailing-silence=0.5
--endpoint.rule3.min-trailing-silence=1.0
--endpoint.rule4.min-trailing-silence=2.0

A few observations:
• It can only handle 50 active streaming calls without delay with the mentioned resources (offline vosk api can handle atleast 300 calls at a time with the same resources)
• The WebSocket closes unexpectedly before sending the final text result, without any error. As a result, we receive None from Vosk.
• Even small background noise interrupts the call.

Questions:
1. How can we prevent the IVR hold tone from triggering the ASR model? Is there any easy solution?
2. The ASR is triggered continuously, even during long silences (e.g., when the user is listening to long IVR questions), consuming more resources. Is there a more efficient way to handle this?
3. [Related to point 2] Is it possible to integrate an external VAD algorithm, such as Silero-VAD, with vosk-asterisk to sense silence and noise before sending to ASR?
4. We noticed that when speaking on speakerphone (hands-free), the recognition results are somewhat inconsistent and inaccurate. Could this be an issue with how Asterisk is recording the audio?

Versions:
Asterisk 18.21.0
Centos 7

Thanks in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant