Practical Application of AI Image Recognition and Speech Recognition in the Smart Photo Album Application of HarmonyOS Next

This article aims to deeply explore the practical application of AI image recognition and speech recognition technologies in building a smart photo album application based on the Huawei HarmonyOS Next system (up to API 12 as of now), and summarize it based on practical development experience. It mainly serves as a vehicle for technical sharing and communication. There may be mistakes and omissions. Colleagues are welcome to put forward valuable opinions and questions so that we can make progress together. This article is original content, and any form of reprint must indicate the source and the original author.

I. Requirements and Architecture Design of the Smart Photo Album Application

(1) In-depth Analysis of Functional Requirements

Requirements for Image Classification In the smart photo album application, image classification is an important function, aiming to help users manage and browse photos more conveniently. Users usually have a large number of photos, and manual classification is both time-consuming and cumbersome. Through AI image recognition technology, it can automatically identify key information such as scenes, people, and objects in the photos, and classify the photos according to this information. For example, photos containing people are classified into the "People" folder, landscape photos are classified into the "Landscape" folder, and photos containing pets are classified into the "Pets" folder, etc. In this way, users can quickly find specific types of photos, improving the efficiency of photo album management and browsing. At the same time, for photos of some special events or themes, such as wedding photos, travel photos, etc., more detailed classification can also be carried out through AI image recognition, making it convenient for users to recall and share these precious moments.
Requirements for Voice Search in the Photo Album The voice search function in the photo album provides users with a more convenient way to search for photos in the album. Users do not need to manually enter keywords; they only need to say the content of the photos they want to find, such as "Find the photos taken at the beach last summer" or "Show the group photos of me and my family." The system can understand the user's intention through speech recognition technology, and then use AI image recognition technology to search for photos that meet the conditions in the photo album and display them to the user. This method is especially suitable for situations where users are busy with their hands or it is inconvenient to operate the mobile phone manually, such as when driving, cooking, etc., greatly improving the convenience and efficiency of photo album search.

(2) Architecture Design Based on HarmonyOS Next

Considerations for Data Storage Design
- Photo Storage Structure: To manage photos efficiently, designing a reasonable photo storage structure is crucial. A hierarchical storage method can be adopted to classify and store photos according to dimensions such as shooting time, location, and people. For example, create a structure in the file system with the year as the top-level directory and the month as the subdirectory, and classify and store photos according to the shooting time. At the same time, for photos of people, a storage structure with the person's name as the directory can be created according to the result of person recognition, which is convenient for quickly finding photos of a specific person. In addition, metadata can be added to the photos, such as shooting equipment, shooting parameters, photo descriptions, and other information, which is convenient for subsequent search and management.
- Index Creation: Establishing an effective index is the key to improving the efficiency of photo album search and management. Use AI image recognition technology to extract the feature information of photos, such as scene features, person features, etc., and create indexes based on these features. For example, for landscape photos, features such as color, texture, and landform can be extracted to create indexes; for photos of people, features such as facial features and identity information of people can be extracted to create indexes. At the same time, combined with the keywords of speech recognition, create text indexes and associate the keywords in the voice commands with the metadata and recognition results of the photos. In this way, when searching the photo album, photos that meet the conditions can be quickly located according to the indexes, improving the search speed.
Architecture Planning of Functional Modules
- Image Recognition Module: This module is responsible for implementing the AI image recognition function, using the AI image recognition capabilities of HarmonyOS Next to analyze and process the photos in the photo album. It includes an image scene recognition sub-module, which is used to recognize the scene types in the photos, such as beaches, mountains, urban streets, etc.; a subject segmentation sub-module, which separates the main subject objects (such as people, animals, objects, etc.) in the photos from the background, facilitating subsequent editing and classification; a feature extraction sub-module, which extracts the key features of the photos for creating indexes and classification. Through the collaborative work of these sub-modules, the intelligent recognition and analysis of photos are realized.
- Speech Recognition Module: Focuses on the implementation of the speech recognition function. Receive the user's voice commands through Core Speech Kit and convert the voice signals into text information. Then, conduct semantic understanding and analysis of the text, and extract key information, such as search keywords, operation commands, etc. For example, when the user says "Find my pet photos," the speech recognition module recognizes "pet" as the search keyword and then passes it to the photo album search module for subsequent processing.
- User Interaction Module: Responsible for interacting with the user and providing a friendly user interface and operation experience. It includes a photo album display interface, which displays the classification of photos, search results, etc. in an intuitive way; a voice input interface, which is convenient for users to input voice commands; an operation feedback interface, which promptly feeds back the operation results and prompt information of the system to the user. At the same time, the user interaction module is also responsible for handling the user's gesture operations, such as swiping to browse photos, clicking to view details, long-pressing for editing, etc., to achieve multimodal user interaction.

(3) Technical Integration to Enhance the User Experience

In the system architecture, AI image recognition and speech recognition technologies are integrated in the following ways to provide a convenient photo album management and browsing experience.

When the user opens the smart photo album application, the AI image recognition module automatically scans and recognizes the photos in the photo album, extracts feature information, and creates indexes. At the same time, the user can input voice commands through the speech recognition module. After the speech recognition module converts the voice into text, it associates it with the recognition results and indexes of the AI image recognition module. For example, if the user says "Show the most recent landscape photos," the speech recognition module passes "recent" and "landscape" as keywords to the photo album search module. The photo album search module searches for landscape photos that meet the conditions in the indexes according to these keywords and displays them to the user through the user interaction module. During the user's photo browsing process, the AI image recognition module can real-time recognize the people, scenes, and other information in the photos and display relevant tags and prompts on the interface, making it convenient for the user to understand the content of the photos. If the user wants to edit or share the photos, it can also be completed through voice commands or gesture operations, such as saying "Crop this photo" or selecting the editing option by long-pressing the photo, achieving the deep integration of AI image recognition and speech recognition technologies and providing the user with a more intelligent and convenient photo album management and browsing experience.

II. Implementation of Core Functions and Technical Integration

(1) Implementation and Optimization of the AI Image Recognition Function

Implementation Process Using the Capabilities of HarmonyOS Next Although the specific AI image recognition development library is not clearly mentioned in the document, we can assume that there is a similar function library (similar to TensorFlow Lite or OpenCV on other platforms). The following is a simplified conceptual code example to show the basic process of image scene recognition (assuming libraries and functions):

import { AIImageRecognitionLibrary } from '@ohos.aiimagerecognition';

// Load the photo (assuming the photo file path has been obtained)
let photoPath = 'photo.jpg';
let photo = AIImageRecognitionLibrary.loadImage(photoPath);

// Perform image scene recognition
let sceneResult = AIImageRecognitionLibrary.recognizeScene(photo);

console.log('Scene recognition result:', sceneResult.scene);

In this example, first, the photo is loaded, then the image scene recognition function is called to recognize the photo, and finally, the recognition result is output. In actual development, detailed parameter settings and function calls need to be made according to the specific library and API used, including model selection, recognition threshold setting, etc.

Optimization of the Deep Learning Model and Examples of Effect Improvement To improve the effect of AI image recognition, the deep learning model can be optimized. For example, adopt model compression technology to reduce the size of the model and improve the running speed of the model on the device. The following is a simple model quantization code example (assuming using TensorFlow Lite for model quantization):

import tensorflow as tf

# Load the original model
model_path = 'original_model.tflite'
interpreter = tf.lite.Interpreter(model_path=model_path)
interpreter.allocate_tensors()

# Perform model quantization
converter = tf.lite.TFLiteConverter.from_interpreter(interpreter)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
quantized_model = converter.convert()

# Save the quantized model
with open('quantized_model.tflite', 'wb') as f:
    f.write(quantized_model)

Through model quantization, the model can run faster on HarmonyOS Next devices without significantly reducing the recognition accuracy, improving the efficiency of AI image recognition. At the same time, increasing the diversity of training data and collecting more photos under different scenes and shooting conditions for training can also improve the generalization ability of the model, thereby enhancing the recognition effect.

(2) Implementation and Linkage of the Speech Recognition Function

Implementation Process through Core Speech Kit The following is a simplified code example to show how to implement voice command recognition through Core Speech Kit (assuming the relevant interfaces and classes have been correctly imported):

import { SpeechRecognizer } from '@kit.CoreSpeechKit';

// Create a speech recognizer instance
let recognizer = SpeechRecognizer.createSpeechRecognizer();

// Set the recognition parameters (such as language, sampling rate, etc.)
let params = {
    language: 'zh_CN',
    sampleRate: 16000
};
recognizer.setRecognitionParams(params);

// Start speech recognition
recognizer.startRecognition();

// Register the recognition result callback function
recognizer.on('result', (result) => {
    console.log('Recognition result:', result.text);
});

In this example, first, a speech recognizer instance is created and the recognition parameters are set, then speech recognition is started, and the recognition result is obtained through the callback function.

Implementation Methods of the Linkage between AI Image Recognition and Speech Recognition The key to realizing the linkage between AI image recognition and speech recognition lies in the transfer and collaborative processing of data. When the speech recognition module obtains the text result of the voice command, it passes it to the photo album search module. The photo album search module uses the indexes created by the AI image recognition module to search for relevant photos in the photo album according to the keywords in the voice command. For example, if the voice command is "Find the photos containing flowers," the photo album search module searches for photo features related to "flowers" in the indexes and returns a list of matching photos. Then, these photos are displayed to the user through the user interaction module. At the same time, when displaying the photos, the AI image recognition module can further analyze the photos, such as recognizing the types and colors of the flowers in the photos, and display relevant prompts on the interface to enhance the user's understanding of the photos.

(3) Data Caching and Processing Strategies

Design of the Data Caching Mechanism To improve system performance and reduce repeated calculations and data loading time, a data caching mechanism can be designed. For the results of AI image recognition, such as the scene recognition results of photos and the subject segmentation results, etc., caching can be carried out. When the user views the recognized photos again, the recognition results are directly obtained from the cache without the need for re-recognition calculations. A combination of memory caching and disk caching can be adopted. For the recognition results of photos that are frequently accessed recently, they are stored in the memory cache for fast reading; for the results of photos that are not frequently accessed, they are stored in the disk cache to save memory space. At the same time, set the validity period and elimination strategy of the cache. When the cached data expires or the memory is insufficient, eliminate the old cached data according to certain rules.
Optimization of the Data Processing Strategy In terms of data processing, the use of asynchronous processing and multi-threading technology can improve the response speed of the system. For example, when the user opens the photo album, the AI image recognition module can asynchronously recognize the photos in the photo album and create indexes in the background thread without affecting the user's operations in the foreground. At the same time, for the processing of speech recognition results, an asynchronous method can also be adopted to avoid the phenomenon of freezing during the speech recognition process. During the data transmission process, optimize the data format and transmission method to reduce the amount of data and transmission time. For example, compress the feature vectors of the photos before storing and transmitting them, and decompress them when needed to improve the data processing efficiency.

III. User Experience Optimization and Application Expansion

(1) User Experience Evaluation and Feedback Processing

Evaluation Indicators and Methods Evaluating the user experience of the smart photo album application can start from multiple aspects. Recognition accuracy is one of the key indicators. By manually annotating the true classification and content of a part of the photos and then comparing them with the results of AI image recognition, the accuracy rate is calculated. For example, select 100 landscape photos and check the proportion of them correctly classified as landscape photos by AI image recognition. The ease of operation can be evaluated through user testing, and record the time and number of operation steps required for users to complete common operations (such as searching for photos, classifying photos, editing photos, etc.). For example, count the average time from when the user issues a voice search command to when seeing the search results, and the number of clicks when the user manually classifies photos. User satisfaction can be collected through questionnaires or user feedback to understand the user's satisfaction with the functions of the photo album, interface design, recognition effect, etc.
User Feedback Collection and Optimization Measures Collecting user feedback is an important basis for optimizing the user experience. A feedback entry can be set in the application to encourage users to submit their opinions and suggestions. For example, if the user encounters recognition errors or inconvenient operations during use, they can directly provide feedback within the application. According to the user feedback, corresponding optimization measures are taken. If the user reports that the recognition accuracy of photos in certain scenes is low, more training data of that scene can be collected in a targeted manner, and the AI image recognition model can be retrained or optimized. If the user thinks that the interface operation is not intuitive enough, the user interface can be redesigned to simplify the operation process and improve the ease of operation.

(2) Optimization Measures to Improve the User Experience

Optimization Strategies for Interface Design Optimizing the interface design can start from two aspects: visual effects and interaction design. In terms of visual effects, choose a simple and beautiful layout and color scheme to make the photo album interface look comfortable and refreshing. For example, use large-sized thumbnails to display photos, which is convenient for users to browse; use soft colors as the background to highlight the photo content. In terms of interaction design, simplify the operation process and reduce the number of user operations. For example, on the photo album display interface, provide quick operation buttons such as one-click search and one-click classification; on the photo viewing interface, use gesture operations such as swiping and zooming to achieve photo browsing and zooming, which is in line with the user's operation habits.
Implementation of the Personalized Recommendation Function Use AI image recognition and the user's historical behavior data to implement the personalized recommendation function. The AI image recognition module analyzes the content of the photos in the user's photo album and extracts the user's interest preferences, such as the preferred scenes, people, shooting styles, etc. Then, according to these preferences, when the user opens the photo album or browses the photos, relevant photos or photo albums are recommended. For example, if the user often takes landscape photos, the system can recommend some famous landscape photography works or photos of popular scenic spots nearby. At the same time, combined with the user's browsing history and operation behaviors, such as frequently viewing the photos of a certain person or editing certain types of photos, further optimize the personalized recommendation content and improve the accuracy and attractiveness of the recommendation.
Improvement of the Voice Interaction Process Improving the voice interaction process can improve the accuracy of speech recognition and the user experience. In the voice command input stage, provide voice prompts and guidance to help users express their needs more accurately. For example, when the user clicks the voice input button, the system prompts "Please say the content of the photos you want to find." During the speech recognition process, display the progress and results of speech recognition in real time, so that the user can know whether the system has correctly understood their commands. If there is ambiguity in the recognition result, promptly confirm with the user or provide alternative options. In the voice command execution stage, provide clear and accurate feedback according to the user's commands. For example, when the user searches for photos, after the system finds the photos, it notifies the user "A total of [X] photos that meet the conditions have been found" and displays the photo list.

(3) Discussion of Extended Functions and Scene Demonstration

Photo Editing Suggestion Function Based on AI Image Recognition Based on AI image recognition technology, provide users with a photo editing suggestion function. When the user selects a photo for editing, the AI image recognition module analyzes the content and features of the photo, such as the expression and posture of the person, the color and composition of the photo, etc., and then provides corresponding editing suggestions according to these analysis results. For example, if the expression of the person in the photo is not natural enough, it is recommended that the user use filters or adjust the facial expression; if the composition of the photo is not reasonable, it is recommended that the user crop the photo or adjust the shooting angle. These editing suggestions can be presented to the user in the form of text prompts or preset editing templates, helping the user to quickly and easily edit high-quality photos.
Application Scenarios and Effects of Integration with Social Platforms Integrating the smart photo album application with social platforms can expand the sharing and social