kouwei qing

Posted on Dec 25

HarmonyOS Native Intelligence - Voice Recognition in Practice

#harmonyos

Background

Many business scenarios in the company utilize voice recognition functions. Our voice team independently developed a voice recognition model. The solution involved the interaction between cloud models and side SDKs. The side was responsible for voice collection, Voice Activity Detection (VAD), Opus encoding, and real-time transmission to the cloud. After the cloud recognized the voice, it would return the recognition results. During the process of adapting these business scenarios to HarmonyOS, it was discovered that the HarmonyOS native intelligence provides a local voice recognition SDK, so we can encapsulate it for use.

Scene Introduction

The native voice recognition capability supports two modes:

Short voice mode (not exceeding 60 seconds)
Long voice mode (not exceeding 8 hours)

API Interface Introduction

1. Engine Initialization

speechRecognizer.createEngine

let asrEngine: speechRecognizer.SpeechRecognitionEngine;
// Create the engine and return it in the form of a callback
// Set the parameters for creating the engine
let extraParam: Record<string, Object> = {"locate": "CN", "recognizerMode": "short"};
let initParamsInfo: speechRecognizer.CreateEngineParams = {
  language: 'zh-CN',
  online: 1,
  extraParams: extraParam
};
// Call the createEngine method
speechRecognizer.createEngine(initParamsInfo, (err: BusinessError, speechRecognitionEngine: speechRecognizer.SpeechRecognitionEngine) => {
  if (!err) {
    console.info('Succeeded in creating engine.');
    // Receive the instance of the created engine
    asrEngine = speechRecognitionEngine;
  } else {
    // When the engine cannot be created, error code 1002200008 is returned. Reason: The engine is being destroyed.
    console.error(`Failed to create engine. Code: ${err.code}, message: ${err.message}.`);
  }
});

The main requirement is to construct the engine parameters speechRecognizer.CreateEngineParams:

language: Language.
online: Mode. 1 represents offline. Currently, only the offline engine is supported.
extraParams: Region information, etc.
- locate: Region information, optional. If not set, the default is "CN". Currently, only "CN" is supported.
- recognizerMode: Recognition mode, including "short" for short voice and "long" for long voice.

In the callback, error information can be viewed:

When the engine cannot be created, error code 1002200001 is returned. Reasons include unsupported language, unsupported mode, initialization timeout, and non-existent resources, which lead to the failure of engine creation.
When the engine cannot be created, error code 1002200006 is returned. Reason: The engine is busy, usually triggered when multiple applications call the voice recognition engine simultaneously.
When the engine cannot be created, error code 1002200008 is returned. Reason: The engine is being destroyed.

2. Setting the `RecognitionListener` Callback

The callback mainly handles events during the recognition process. The most important one is onResult for processing recognition content. Different conversations correspond to different sessionIds:

// Create a callback object
let setListener: speechRecognizer.RecognitionListener = {
  // Callback for successful start of recognition
  onStart(sessionId: string, eventMessage: string) {

  },
  // Event callback
  onEvent(sessionId: string, eventCode: number, eventMessage: string) {

  },
  // Recognition result callback, including intermediate and final results
  onResult(sessionId: string, result: speechRecognizer.SpeechRecognitionResult) {

  },
  // Recognition completion callback
  onComplete(sessionId: string, eventMessage: string) {

  },
  // Error callback. Error codes are returned through this method. For example, error code 1002200006 indicates that the recognition engine is busy and is currently recognizing.
  onError(sessionId: string, errorCode: number, errorMessage: string) {

  }
}
// Set the callback
asrEngine.setListener(setListener);

3. Starting Recognition

let audioParam: speechRecognizer.AudioInfo = {audioType: 'pcm', sampleRate: 16000, soundChannel: 1, sampleBit: 16};
let extraParam: Record<string, Object> = {"vadBegin": 2000, "vadEnd": 3000, "maxAudioDuration": 40000};
let recognizerParams: speechRecognizer.StartParams = {
  sessionId: sessionId,
  audioInfo: audioParam,
  extraParams: extraParam
};
// Call the method to start listening
asrEngine.startListening(recognizerParams);

The main task is to set the relevant parameters for starting recognition:

sessionId: Session ID, which should correspond to the sessionId in the onResult callback.
audioInfo: Audio configuration information, optional.
- audioType: Currently, only PCM is supported. If you want to recognize MP3 files, etc., you need to decode them before passing them to the engine.
- sampleRate: The sampling rate of the audio. Currently, only a sampling rate of 16000 is supported.
- sampleBit: The sampling bit number of the audio return. Currently, only 16 bits are supported.
- soundChannel: The channel number information of the audio return. Currently, only channel 1 is supported.
- extraParams: The compression ratio of the audio. For PCM format audio, the default is 0.
extraParams: Additional configuration information, mainly including:
- recognitionMode: Real-time voice recognition mode (the default is 1 if not passed).
  - 0: Real-time recording recognition (the application needs to enable the recording permission: ohos.permission.MICROPHONE). If you need to end the recording, call the finish method.
  - 1: Real-time audio-to-text recognition. When this mode is enabled, you need to additionally call the writeAudio method and pass in the audio stream to be recognized.
- vadBegin: Voice Activity Detection (VAD) front-end point setting. The parameter range is [500, 10000]. The default is 10000 ms if no parameter is passed.
- vadEnd: Voice Activity Detection (VAD) back-end point setting. The parameter range is [500, 10000]. The default is 800 ms if no parameter is passed.
- maxAudioDuration: The maximum supported audio duration.
  - In short voice mode, the supported range is [20000 - 60000] ms, and the default is 20000 ms if no parameter is passed.
  - In long voice mode, the supported range is [20000 - 8 * 60 * 60 * 1000] ms.

The role of VAD is mainly voice activity detection, and it does not recognize silent data.

4. Passing in the Audio Stream

asrEngine.writeAudio(sessionId, uint8Array);

Write audio data to the engine. The audio stream can be read from the microphone or audio files.
Note: The length of the audio stream only supports 640 or 1280.

5. Other Interfaces

listLanguages: Query the language information supported by the voice recognition service.
finish: End the recognition.
cancel: Cancel the recognition.
shutdown: Release the resources of the recognition engine.

Best Practice

In the scenario of real-time recognition, it is necessary to read audio from the microphone in real time, write it into the asrEngine, and obtain the recognition results in the onResult callback.

Configure the audio collection parameters and create an AudioCapturer instance:

import { audio } from '@kit.AudioKit';

let audioStreamInfo: audio.AudioStreamInfo = {
  samplingRate: audio.AudioSamplingRate.SAMPLE_RATE_16000, // Sampling rate
  channels: audio.AudioChannel.CHANNEL_1, // Channel
  sampleFormat: audio.AudioSampleFormat.SAMPLE_FORMAT_S16LE, // Sampling format
  encodingType: audio.AudioEncodingType.ENCODING_TYPE_RAW // Encoding format
};

let audioCapturerInfo: audio.AudioCapturerInfo = {
  source: audio.SourceType.SOURCE_TYPE_MIC,
  capturerFlags: 0
};

let audioCapturerOptions: audio.AudioCapturerOptions = {
  streamInfo: audioStreamInfo,
  capturerInfo: audioCapturerInfo
};

audio.createAudioCapturer(audioCapturerOptions, (err, data) => {
  if (err) {
    console.error(`Invoke createAudioCapturer failed, code is ${err.code}, message is ${err.message}`);
  } else {
    console.info('Invoke createAudioCapturer succeeded.');
    let audioCapturer = data;
  }
});

Here, note that the sampling rate, sound channel, and sampling bit number should meet the requirements of the ASR engine: 16k sampling, mono channel, and 16-bit sampling bit number.

Then call the on('readData') method to subscribe to and monitor the callback for reading audio data:

import { BusinessError } from '@kit.BasicServicesKit';
import { fileIo } from '@kit.CoreFileKit';

let bufferSize: number = 0;
class Options {
  offset?: number;
  length?: number;
}

let readDataCallback = (buffer: ArrayBuffer) => {
  // Write the buffer to the asr engine
  asrEngine.writeAudio(sessionId, new Uint8Array(buffer));
};
audioCapturer.on('readData', readDataCallback);

Here, note the display of the size of the written buffer. The ASR only supports 640 or 1280.

Summary

This article introduced the voice recognition capability provided by HarmonyOS officially, explained in detail the ASR engine interfaces, and finally implemented the real-time microphone voice recognition function based on microphone data collection.

DEV Community

HarmonyOS Native Intelligence - Voice Recognition in Practice

Background

Scene Introduction

API Interface Introduction

1. Engine Initialization

2. Setting the `RecognitionListener` Callback

3. Starting Recognition

4. Passing in the Audio Stream

5. Other Interfaces

Best Practice

Summary

Top comments (0)

Read next

Your own Wealth: The Artwork and Science involving Tax Planning

Comparing AWS RDS and Amazon Aurora: Which Managed Database Service is Right for You?

First post

The Psychology of Color in Photography

Background

Scene Introduction

API Interface Introduction

1. Engine Initialization

2. Setting the RecognitionListener Callback

3. Starting Recognition

4. Passing in the Audio Stream

5. Other Interfaces

Best Practice

Summary

Read next

Your own Wealth: The Artwork and Science involving Tax Planning

Comparing AWS RDS and Amazon Aurora: Which Managed Database Service is Right for You?

First post

The Psychology of Color in Photography

2. Setting the `RecognitionListener` Callback