Background
Many business scenarios in the company utilize voice recognition functions. Our voice team independently developed a voice recognition model. The solution involved the interaction between cloud models and side SDKs. The side was responsible for voice collection, Voice Activity Detection (VAD), Opus encoding, and real-time transmission to the cloud. After the cloud recognized the voice, it would return the recognition results. During the process of adapting these business scenarios to HarmonyOS, it was discovered that the HarmonyOS native intelligence provides a local voice recognition SDK, so we can encapsulate it for use.
Scene Introduction
The native voice recognition capability supports two modes:
- Short voice mode (not exceeding 60 seconds)
- Long voice mode (not exceeding 8 hours)
API Interface Introduction
1. Engine Initialization
speechRecognizer.createEngine
let asrEngine: speechRecognizer.SpeechRecognitionEngine;
// Create the engine and return it in the form of a callback
// Set the parameters for creating the engine
let extraParam: Record<string, Object> = {"locate": "CN", "recognizerMode": "short"};
let initParamsInfo: speechRecognizer.CreateEngineParams = {
language: 'zh-CN',
online: 1,
extraParams: extraParam
};
// Call the createEngine method
speechRecognizer.createEngine(initParamsInfo, (err: BusinessError, speechRecognitionEngine: speechRecognizer.SpeechRecognitionEngine) => {
if (!err) {
console.info('Succeeded in creating engine.');
// Receive the instance of the created engine
asrEngine = speechRecognitionEngine;
} else {
// When the engine cannot be created, error code 1002200008 is returned. Reason: The engine is being destroyed.
console.error(`Failed to create engine. Code: ${err.code}, message: ${err.message}.`);
}
});
The main requirement is to construct the engine parameters speechRecognizer.CreateEngineParams
:
-
language
: Language. -
online
: Mode. 1 represents offline. Currently, only the offline engine is supported. -
extraParams
: Region information, etc.-
locate
: Region information, optional. If not set, the default is "CN". Currently, only "CN" is supported. -
recognizerMode
: Recognition mode, including "short" for short voice and "long" for long voice.
-
In the callback, error information can be viewed:
- When the engine cannot be created, error code 1002200001 is returned. Reasons include unsupported language, unsupported mode, initialization timeout, and non-existent resources, which lead to the failure of engine creation.
- When the engine cannot be created, error code 1002200006 is returned. Reason: The engine is busy, usually triggered when multiple applications call the voice recognition engine simultaneously.
- When the engine cannot be created, error code 1002200008 is returned. Reason: The engine is being destroyed.
2. Setting the RecognitionListener
Callback
The callback mainly handles events during the recognition process. The most important one is onResult
for processing recognition content. Different conversations correspond to different sessionId
s:
// Create a callback object
let setListener: speechRecognizer.RecognitionListener = {
// Callback for successful start of recognition
onStart(sessionId: string, eventMessage: string) {
},
// Event callback
onEvent(sessionId: string, eventCode: number, eventMessage: string) {
},
// Recognition result callback, including intermediate and final results
onResult(sessionId: string, result: speechRecognizer.SpeechRecognitionResult) {
},
// Recognition completion callback
onComplete(sessionId: string, eventMessage: string) {
},
// Error callback. Error codes are returned through this method. For example, error code 1002200006 indicates that the recognition engine is busy and is currently recognizing.
onError(sessionId: string, errorCode: number, errorMessage: string) {
}
}
// Set the callback
asrEngine.setListener(setListener);
3. Starting Recognition
let audioParam: speechRecognizer.AudioInfo = {audioType: 'pcm', sampleRate: 16000, soundChannel: 1, sampleBit: 16};
let extraParam: Record<string, Object> = {"vadBegin": 2000, "vadEnd": 3000, "maxAudioDuration": 40000};
let recognizerParams: speechRecognizer.StartParams = {
sessionId: sessionId,
audioInfo: audioParam,
extraParams: extraParam
};
// Call the method to start listening
asrEngine.startListening(recognizerParams);
The main task is to set the relevant parameters for starting recognition:
-
sessionId
: Session ID, which should correspond to thesessionId
in theonResult
callback. -
audioInfo
: Audio configuration information, optional.-
audioType
: Currently, only PCM is supported. If you want to recognize MP3 files, etc., you need to decode them before passing them to the engine. -
sampleRate
: The sampling rate of the audio. Currently, only a sampling rate of 16000 is supported. -
sampleBit
: The sampling bit number of the audio return. Currently, only 16 bits are supported. -
soundChannel
: The channel number information of the audio return. Currently, only channel 1 is supported. -
extraParams
: The compression ratio of the audio. For PCM format audio, the default is 0.
-
-
extraParams
: Additional configuration information, mainly including:-
recognitionMode
: Real-time voice recognition mode (the default is 1 if not passed).- 0: Real-time recording recognition (the application needs to enable the recording permission:
ohos.permission.MICROPHONE
). If you need to end the recording, call thefinish
method. - 1: Real-time audio-to-text recognition. When this mode is enabled, you need to additionally call the
writeAudio
method and pass in the audio stream to be recognized.
- 0: Real-time recording recognition (the application needs to enable the recording permission:
-
vadBegin
: Voice Activity Detection (VAD) front-end point setting. The parameter range is[500, 10000]
. The default is 10000 ms if no parameter is passed. -
vadEnd
: Voice Activity Detection (VAD) back-end point setting. The parameter range is[500, 10000]
. The default is 800 ms if no parameter is passed. -
maxAudioDuration
: The maximum supported audio duration.- In short voice mode, the supported range is
[20000 - 60000]
ms, and the default is 20000 ms if no parameter is passed. - In long voice mode, the supported range is
[20000 - 8 * 60 * 60 * 1000]
ms.
- In short voice mode, the supported range is
-
The role of VAD is mainly voice activity detection, and it does not recognize silent data.
4. Passing in the Audio Stream
asrEngine.writeAudio(sessionId, uint8Array);
Write audio data to the engine. The audio stream can be read from the microphone or audio files.
Note: The length of the audio stream only supports 640 or 1280.
5. Other Interfaces
-
listLanguages
: Query the language information supported by the voice recognition service. -
finish
: End the recognition. -
cancel
: Cancel the recognition. -
shutdown
: Release the resources of the recognition engine.
Best Practice
In the scenario of real-time recognition, it is necessary to read audio from the microphone in real time, write it into the asrEngine
, and obtain the recognition results in the onResult
callback.
Configure the audio collection parameters and create an AudioCapturer
instance:
import { audio } from '@kit.AudioKit';
let audioStreamInfo: audio.AudioStreamInfo = {
samplingRate: audio.AudioSamplingRate.SAMPLE_RATE_16000, // Sampling rate
channels: audio.AudioChannel.CHANNEL_1, // Channel
sampleFormat: audio.AudioSampleFormat.SAMPLE_FORMAT_S16LE, // Sampling format
encodingType: audio.AudioEncodingType.ENCODING_TYPE_RAW // Encoding format
};
let audioCapturerInfo: audio.AudioCapturerInfo = {
source: audio.SourceType.SOURCE_TYPE_MIC,
capturerFlags: 0
};
let audioCapturerOptions: audio.AudioCapturerOptions = {
streamInfo: audioStreamInfo,
capturerInfo: audioCapturerInfo
};
audio.createAudioCapturer(audioCapturerOptions, (err, data) => {
if (err) {
console.error(`Invoke createAudioCapturer failed, code is ${err.code}, message is ${err.message}`);
} else {
console.info('Invoke createAudioCapturer succeeded.');
let audioCapturer = data;
}
});
Here, note that the sampling rate, sound channel, and sampling bit number should meet the requirements of the ASR engine: 16k sampling, mono channel, and 16-bit sampling bit number.
Then call the on('readData')
method to subscribe to and monitor the callback for reading audio data:
import { BusinessError } from '@kit.BasicServicesKit';
import { fileIo } from '@kit.CoreFileKit';
let bufferSize: number = 0;
class Options {
offset?: number;
length?: number;
}
let readDataCallback = (buffer: ArrayBuffer) => {
// Write the buffer to the asr engine
asrEngine.writeAudio(sessionId, new Uint8Array(buffer));
};
audioCapturer.on('readData', readDataCallback);
Here, note the display of the size of the written buffer. The ASR only supports 640 or 1280.
Summary
This article introduced the voice recognition capability provided by HarmonyOS officially, explained in detail the ASR engine interfaces, and finally implemented the real-time microphone voice recognition function based on microphone data collection.
Top comments (0)