Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the Speech Module #14

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Assets/SpeechSDK.meta

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions Assets/StreamingAssets.meta

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions Assets/i5 Toolkit for Unity/Runtime/Speech Module.meta

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

46 changes: 46 additions & 0 deletions Assets/i5 Toolkit for Unity/Runtime/Speech Module/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Speech Module
The speech module provides a extendable Speech-To-Text (Speech Recognition) and Text-To-Speech (Speech Synthesis) functionalities to Unity program on Windows Standalone, UWP, and Android platforms.

## Components
The speech module consists of three components: speech recognizers, speech synthesizers, and the `SpeechProvider`. You can implement your own recognizers and synthesizers if needed. All licenses of third-party libraries can be found in `THIRD-PARTY-NOTICES` under the `Third Party Plugins` folder.

### Speech Recognizer
All speech recognizers should implement the `ISpeechRecognizer` interface and realize its `StartRecordingAsync()` and `StopRecordingAsync()` methods, `Language` and `IsApplicable` properties, and `OnRecognitionResultReceived` event. It should also inherits ``MonoBehavior``. There are two pre-implemented instances in the module:
- `AzureSpeechRecognizer`, which uses the [Azure Congitive Service](https://azure.microsoft.com/en-us/services/cognitive-services/#overview) of Microsoft. It provides two modes: SingleShot and Continuous. In the SingleShot mode, it stops automatically when it detects silence, while in the Continuous mode, user must stop it manually. You needs a subscribtion key and service region to use the Azure service, and of course also internet conection. To use this recognizer, one needs the [SpeechSDK](https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/quickstarts/setup-platform?pivots=programming-language-csharp&tabs=windows%2Cubuntu%2Cunity%2Cjre%2Cmaven%2Cbrowser%2Cmac%2Cpypi).
- `NativeSpeechRecognizer`, which is neural-network based and can run offline on the device. It uses the [Vosk](https://alphacephei.com/vosk/index) library. Since it is neural-network based, one must specify neural-network models. They can be downloaded [here](https://alphacephei.com/vosk/models). It is suggested to download the small models, which are typically around 40 to 50MB. The models must be placed under `Assets/StreamingAssets` folder. On the inspector view, you need to specify the path of the model. If the model is placed under the `StreamingAssets` folder, the path is only the name of the model with ".zip" at the end. Basically you can add any language that has a model.

The Speech SDK is not included in the package, you need to import them. See the next chapter for detail.
### Speech Synthesizer
All speech synthesizer should implement the `ISpeechSyntheizer` interface and realize its `StartSynthesizingAndSpeakingAsync()` method, `Language`, `IsApplicable` and `OutputForm` properties, and the `OnSynthesisResultReceived` event. The `OutputForm` has two values: `To Speaker` and `As Byte Stream`. Considering some APIs allow developers get the raw byte stream, we suggest you to use `As Byte Stream` if you need a spatial sound setting. In this case, the stream will be converted to an `Audio Clip` and played by an `Audio Source` on the attached `GameObject`. It is useful especially when you develop an agent, since the spatial sound make it more human-like. However, if the API that you want to call don't support this, you can then neglect this property, the `SpeechProvider` would take care of it.

Again, there are two implemented instances:
- `AzureSpeechSynthesizer`, which works similar to the `AzureSpeechRecognizer`.
- `NativeSpeechSynthesizer`, which is an offline synthesizer. For Windows Standalone, it uses the [Microsoft Speech API (SAPI)](https://docs.microsoft.com/en-us/previous-versions/windows/desktop/ee125663(v=vs.85)) through the `interop.speechlib.dll`. For UWP, it uses the `Windows.Media.SpeechSynthesis` API through the `TextToSpeechUWP` script, which is a slightly modified version of the [`TextToSpeech` script in the `MixedRealityToolkit` of Microsoft](https://github.com/microsoft/MixedRealityToolkit-Unity/blob/main/Assets/MRTK/SDK/Features/Audio/TextToSpeech.cs). For Andorid, it uses the scripts and Android plugins from the GitHub repository [nir-takemi/UnityTTS](https://github.com/nir-takemi/UnityTTS). The native synthesizer only supports English on all platforms.

All third-party libraries are not included, you need to import them. See the next chapter for detail.

### Speech Provider
The `SpeechProvider` requires at least one `ISpeechRecognizer` and one `ISpeechSynthesizer` on the same `GameObject`. The ones with higher priorities should be placed on top of other recognizers and synthesizers ON the inspector. It manages the `ISpeechRecognizer` and `ISpeechSynthesizer` and exposes their functionalities to users. So you only need to re-implement your own `ISpeechRecognizer` and `ISpeechSynthesizer` if needed, and don't need to care about the user-interaction aspects for each of them. There maybe also other settings (SerializeField) on the recognizers and synthesizers. In case of the selected recognizer or synthesizer is not applicable by checking their `IsApplicable` property, it would automatically find another applicable one. For synthesizers, it repeats the synthesis for the given text again. However, for recognizers, users must repeat what they said since the audio data is not buffered on the device. Moreover, by settings its properties and subcribing its events, the values would be propagated to all recognizers and synthesizers, so you don't need to set them one by one.

## Import the Libraries for Pre-implemented Recognizers and Synthesizers
In order to reduce the package size and not to force users to download external resources that they will even not use, the third-party libraries of the pre-implemented recognizers and synthesizers must be imported manually and some custom scripting symbols are defined for them, so that developers who use other modules in the i5 Toolkit but not the speech module don't need to download those resources. Note that the scripts themselves for the recognizers and synthesizers are contained in the package.

To import the above introduced `AzureSpeechRecognizer/Synthesizer` and `NativeSpeechRecognizer/Synthesizer`, you need to navigate to the _i5 Toolkit - Import Speech Module_ on the menu bar at the top of Unity Editor. By clicking on a recognizer/synthesizer, an importer will automatically download all resources required and import them, then it will set the corresponding custom scripting symbol. All custom scripting symbols used here are:
- I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
- I5_TOOLKIT_USE_AZURE_SPEECH_SYNTHESIZER
- I5_TOOLKIT_USE_NATIVE_SPEECH_RECOGNIZER
- I5_TOOLKIT_USE_NATIVE_SPEECH_SYNTHESIZER

During the importing process, the importer will first download the required resources. A progress bar will be displayed on top of the editor window. However, you can still do other things during the download process. After that, the importer will import the downloaded package (except for the Vosk neural network models). There will be no pop-up window for importing so it will be done automatically. The imported packages are under `Assets/SpeechSDK` and `Assets/i5 Toolkit for Unity Speech Module Plugin` folder for Azure and native recognizers/synthesizers, respectively. After the importing, the package file will be deleted automatically.

If you don't want to use a specific recognizer/synthesizer anymore, you should manually delete the corresponding scripting symbol from the custom scripting symbols after you deleted the third-party packages. You can find those symbols in _PlayerSettings - Other Settings - Scripting Define Symbols_.

## What You Should Notice
- Don't subscribe to the events or setting the properties of `SpeechProvider` in `Awake()`, since it might haven't initialized all recognizers and synthesizers due to the execution order of scripts.
- Although some recognizer don't require a manually stop, e.g. the `AzureSpeechRecognizer` on the SingleShot mode, it is still a good choice to add a stop button on the UI and call the `StopRecordingAsync()` method of the `SpeechProvider`. When you implement such a recognizer, you can just leave the `StopRecordingAsync()` method empty.
- If you are quite sure about your use cases and only want to use one recognizer/synthesizer, you can also omit the `SpeechProvider` and directly interact with the recognizer/synthesizer.
- `PrimaryAudioOutputForm` and `Language` properties of `SpeechProvider` may not influence all recognizers or syntheisizers because they may not support them.
- Although the methods for recognizing and synthesizing do have return values, they are not guaranteed to be meaningful. In facts, they are meaningless in most cases and should only be used for `await`. Instead, you should subscribe to the `OnRecognitionResultReceived` and `OnSynthesisResultReceived` events to deal with the results.
- The neural-network models for the `NativeSpeechRecognizer` must be stored in the `StreamingAssets` folder, because they would be decompressed to the `PersistentDataPath` on the first start, so they need to be "as is" after build and shouldn't be compressed by Unity.
- The `NativeSpeechSyntheizer` only works with `Mono` Backend and `.NET 4.x` API compatibility level on Windows Standalone. For Android, it must be built with an API level greater or equal than 21.
- Although some third-party libraries contain plugins for other platforms, e.g. MacOS or IOS, they are removed, but you can still find them on the corresponding websites.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
using UnityEngine;
using System.Threading.Tasks;
using System;

#if I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
using Microsoft.CognitiveServices.Speech;
using Microsoft.CognitiveServices.Speech.Audio;
#endif

namespace i5.Toolkit.Core.SpeechModule
{
/// <summary>
/// A speech recognizer (Speech-To-Text) using Azure Congitive Service and Speech SDK. See https://docs.microsoft.com/en-us/azure/cognitive-services/speech-service/index-speech-to-text
/// Need subscribtion key and service region.
/// </summary>
public class AzureSpeechRecognizer : MonoBehaviour, ISpeechRecognizer
{
[Tooltip("You can find your subscription key on Azure Portal.")]
[SerializeField] private string subscriptionKey;
[Tooltip("You can find your service region on Azure Portal.")]
[SerializeField] private string serviceRegion;
[Tooltip("The Single Shot mode receives a silence as a stop symbol and only supports audio up to 15 seconds. The Continuous mode requires a manually stop.")]
[SerializeField] private AzureRecognitionMode mode;

#if I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
private SpeechRecognizer speechRecognizer;
private SpeechConfig speechConfig;
#endif
void Start() {
#if I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
speechConfig = SpeechConfig.FromSubscription(subscriptionKey, serviceRegion);
#else
Debug.LogError("The required Speech SDK for AzureSpeechRecognizer cannot be found, or the I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER directive is not defined on current platform.");
#endif
}

/// <summary>
/// Fires when the recognizer receives the result.
/// Please subscribe the OnRecognitionResultReceived event and avoid using the return value, because the result is empty for successful continuous recognition.
/// </summary>
public event Action<RecognitionResult> OnRecognitionResultReceived;

/// <summary>
/// Supported Language. You may also add any language you want.
/// </summary>
public Language Language { get; set; }

/// Applicable if the component is enabled and there is an internet connection. <summary>
public bool IsApplicable => enabled && Application.internetReachability != NetworkReachability.NotReachable;

/// <summary>
/// Start recording and recognizing according to the recognition mode.
/// Please subscribe the OnRecognitionResultReceived event and avoid using the return value, because the result is empty for successful continuous recognition.
/// Note that the continuous recognition is running in another thread, so you cannot call Unity APIs in the OnRecognitionResultReceived event.
/// However, you can use a Queue<Action> to enable it.
/// </summary>
/// <returns>The result of the recognition.</returns>
public async Task<RecognitionResult> StartRecordingAsync() {
#if I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
RecognitionResult result;
SourceLanguageConfig sourceLanguageConfig;
switch (Language) {
case Language.en_US:
sourceLanguageConfig = SourceLanguageConfig.FromLanguage("en-US");
break;
case Language.de_DE:
sourceLanguageConfig = SourceLanguageConfig.FromLanguage("de-DE");
break;
default:
sourceLanguageConfig = SourceLanguageConfig.FromLanguage("en-US");
break;
}
var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
speechRecognizer = new SpeechRecognizer(speechConfig, sourceLanguageConfig, audioConfig);
if (mode == AzureRecognitionMode.SingleShot) {
result = await StartSingleShotRecordingAsync();
}
else {
result = await StartContinuousRecordingAsync();
}
return result;
#else
await Task.Run(() => Debug.LogError("The required Speech SDK for AzureSpeechRecognizer cannot be found, or the I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER directive is not defined on current platform."));
return RecognitionResult.RequiredModulesNotFoundResult;
#endif
}

/// <summary>
/// Stop recording. Only used for countinuous recognition.
/// </summary>
public async Task StopRecordingAsync() {
#if I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
if (mode == AzureRecognitionMode.Countinuous) {
await speechRecognizer.StopContinuousRecognitionAsync();
}
#else
await Task.Run(() => Debug.LogError("The required Speech SDK for AzureSpeechRecognizer cannot be found, or the I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER directive is not defined on current platform."));
#endif
}

#if I5_TOOLKIT_USE_AZURE_SPEECH_RECOGNIZER
private async Task<RecognitionResult> StartSingleShotRecordingAsync() {
Debug.Log("Speak into your microphone.");
var speechRecognitionResult = await speechRecognizer.RecognizeOnceAsync();
RecognitionResult result = ParseAzureRecognitionResult(speechRecognitionResult);
OnRecognitionResultReceived?.Invoke(result);
Debug.Log("Recognition Stopped.");
return result;
}

private async Task<RecognitionResult> StartContinuousRecordingAsync() {
Debug.Log("Speak into your microphone. Stop recording when finished.");
RecognitionResult result = new RecognitionResult();
var stopRecognition = new TaskCompletionSource<int>();
speechRecognizer.Recognizing += (s, e) => Debug.Log($"RECOGNIZING: Text={e.Result.Text}");

speechRecognizer.Recognized += (s, e) => OnRecognitionResultReceived?.Invoke(ParseAzureRecognitionResult(e.Result));

speechRecognizer.Canceled += (s, e) =>
{
stopRecognition.TrySetResult(0);
result = ParseAzureRecognitionResult(e.Result);
};

speechRecognizer.SessionStopped += (s, e) =>
{
Debug.Log("Recognition Stopped");
stopRecognition.TrySetResult(0);
};
await speechRecognizer.StartContinuousRecognitionAsync();
return result;

}

//Parse the SpeechRecognitionResult of Azure to our RecognitionResult.
private RecognitionResult ParseAzureRecognitionResult(SpeechRecognitionResult speechRecognitionResult) {
RecognitionResult result = new RecognitionResult();
switch (speechRecognitionResult.Reason) {
case ResultReason.RecognizedSpeech:
result.State = ResultState.Succeeded;
result.Text = speechRecognitionResult.Text;
result.Message = "Recognition Succeeded." + $" Text: {result.Text}";
break;
case ResultReason.NoMatch:
result.State = ResultState.NoMatch;
result.Message = "No Match: Speech could not be recognized.";
break;
case ResultReason.Canceled:
var cancellation = CancellationDetails.FromResult(speechRecognitionResult);
result.State = ResultState.Failed;
result.Message = $"Failed: Reason: {cancellation.Reason}";
if (cancellation.Reason == CancellationReason.Error) {
result.Message += $" AzureErrorCode={cancellation.ErrorCode}.\nDid you set the speech resource key and region values?";
}
break;
default:
result.Message = result.Text;
break;
}
Debug.Log(result.Message);
return result;
}
#endif

private enum AzureRecognitionMode
{
SingleShot,
Countinuous
}
}
}

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading