visit https://github.com/langhuihui/monibuca to see the source code
Author's Foreword
As developers of the Monibuca streaming server, we have been continuously seeking to provide more efficient and flexible streaming solutions. With the evolution of Web frontend technologies, especially the widespread application of Media Source Extensions (MSE), we gradually recognized that traditional streaming transmission solutions can no longer meet the demands of modern applications. During our exploration and practice, we discovered that fMP4 (fragmented MP4) technology effectively bridges traditional media formats with modern Web technologies, providing users with a smoother video experience.
In the implementation of the MP4 plugin for the Monibuca project, we faced the challenge of efficiently converting recorded MP4 files into a format compatible with MSE playback. Through in-depth research on the HLS v7 protocol and fMP4 container format, we ultimately developed a comprehensive solution supporting real-time conversion from MP4 to fMP4, seamless merging of multiple MP4 segments, and optimizations for frontend MSE playback. This article shares our technical exploration and implementation approach during this process.
Introduction
As streaming media technology evolves, video distribution methods continue to advance. From traditional complete downloads to progressive downloads, and now to widely used adaptive bitrate streaming technology, each advancement has significantly enhanced the user experience. This article will explore the implementation of fMP4 (fragmented MP4) technology based on HLS v7, and how it integrates with Media Source Extensions (MSE) in modern Web frontends to create efficient and smooth video playback experiences.
Evolution of HLS Protocol and Introduction of fMP4
Traditional HLS and Its Limitations
HTTP Live Streaming (HLS) is an HTTP adaptive bitrate streaming protocol developed by Apple. In earlier versions, HLS primarily used TS (Transport Stream) segments as the media container format. Although the TS format has good error resilience and streaming characteristics, it also has several limitations:
- Larger file size compared to container formats like MP4
- Each TS segment needs to contain complete initialization information, causing redundancy
- Lower integration with other parts of the Web technology stack
HLS v7 and fMP4
HLS v7 introduced support for fMP4 (fragmented MP4) segments, marking a significant advancement in the HLS protocol. As a media container format, fMP4 offers the following advantages over TS:
- Smaller file size, higher transmission efficiency
- Shares the same underlying container format with other streaming protocols like DASH, facilitating a unified technology stack
- Better support for modern codecs
- Better compatibility with MSE (Media Source Extensions)
In HLS v7, seamless playback of fMP4 segments is achieved by specifying initialization segments using the #EXT-X-MAP
tag in the playlist.
MP4 File Structure and fMP4 Basic Principles
Traditional MP4 Structure
Traditional MP4 files follow the ISO Base Media File Format (ISO BMFF) specification and mainly consist of the following parts:
- ftyp (File Type Box): Indicates the format and compatibility information of the file
- moov (Movie Box): Contains metadata about the media, such as track information, codec parameters, etc.
- mdat (Media Data Box): Contains the actual media data
In traditional MP4, the moov
is usually located at the beginning or end of the file and contains all the metadata and index data for the entire video. This structure is not friendly for streaming transmission because the player needs to acquire the complete moov
before playback can begin.
Below is a diagram of the MP4 file box structure:
graph TD
MP4[MP4 File] --> FTYP[ftyp box]
MP4 --> MOOV[moov box]
MP4 --> MDAT[mdat box]
MOOV --> MVHD[mvhd: Movie header]
MOOV --> TRAK1[trak: Video track]
MOOV --> TRAK2[trak: Audio track]
TRAK1 --> TKHD1[tkhd: Track header]
TRAK1 --> MDIA1[mdia: Media info]
TRAK2 --> TKHD2[tkhd: Track header]
TRAK2 --> MDIA2[mdia: Media info]
MDIA1 --> MDHD1[mdhd: Media header]
MDIA1 --> HDLR1[hdlr: Handler]
MDIA1 --> MINF1[minf: Media info container]
MDIA2 --> MDHD2[mdhd: Media header]
MDIA2 --> HDLR2[hdlr: Handler]
MDIA2 --> MINF2[minf: Media info container]
MINF1 --> STBL1[stbl: Sample table]
MINF2 --> STBL2[stbl: Sample table]
STBL1 --> STSD1[stsd: Sample description]
STBL1 --> STTS1[stts: Time-to-sample]
STBL1 --> STSC1[stsc: Sample-to-chunk]
STBL1 --> STSZ1[stsz: Sample size]
STBL1 --> STCO1[stco: Chunk offset]
STBL2 --> STSD2[stsd: Sample description]
STBL2 --> STTS2[stts: Time-to-sample]
STBL2 --> STSC2[stsc: Sample-to-chunk]
STBL2 --> STSZ2[stsz: Sample size]
STBL2 --> STCO2[stco: Chunk offset]
fMP4 Structural Characteristics
fMP4 (fragmented MP4) restructures the traditional MP4 format with the following key features:
- Divides media data into multiple fragments
- Each fragment contains its own metadata and media data
- The file structure is more suitable for streaming transmission
The main components of fMP4:
- ftyp: Same as traditional MP4, located at the beginning of the file
- moov: Contains overall track information, but not specific sample information
- moof (Movie Fragment Box): Contains metadata for specific fragments
- mdat: Contains media data associated with the preceding moof
Below is a diagram of the fMP4 file box structure:
graph TD
FMP4[fMP4 File] --> FTYP[ftyp box]
FMP4 --> MOOV[moov box]
FMP4 --> MOOF1[moof 1: Fragment 1 metadata]
FMP4 --> MDAT1[mdat 1: Fragment 1 media data]
FMP4 --> MOOF2[moof 2: Fragment 2 metadata]
FMP4 --> MDAT2[mdat 2: Fragment 2 media data]
FMP4 -.- MOOFN[moof n: Fragment n metadata]
FMP4 -.- MDATN[mdat n: Fragment n media data]
MOOV --> MVHD[mvhd: Movie header]
MOOV --> MVEX[mvex: Movie extends]
MOOV --> TRAK1[trak: Video track]
MOOV --> TRAK2[trak: Audio track]
MVEX --> TREX1[trex 1: Track extends]
MVEX --> TREX2[trex 2: Track extends]
MOOF1 --> MFHD1[mfhd: Fragment header]
MOOF1 --> TRAF1[traf: Track fragment]
TRAF1 --> TFHD1[tfhd: Track fragment header]
TRAF1 --> TFDT1[tfdt: Track fragment decode time]
TRAF1 --> TRUN1[trun: Track run]
This structure allows the player to immediately begin processing subsequent moof
+mdat
fragments after receiving the initial ftyp
and moov
, making it highly suitable for streaming transmission and real-time playback.
Conversion Principles from MP4 to fMP4
The MP4 to fMP4 conversion process can be illustrated by the following sequence diagram:
sequenceDiagram
participant MP4 as Source MP4 File
participant Demuxer as MP4 Parser
participant Muxer as fMP4 Muxer
participant fMP4 as Target fMP4 File
MP4->>Demuxer: Read MP4 file
Note over Demuxer: Parse file structure
Demuxer->>Demuxer: Extract ftyp info
Demuxer->>Demuxer: Parse moov box
Demuxer->>Demuxer: Extract tracks info<br>(video, audio tracks)
Demuxer->>Muxer: Pass track metadata
Muxer->>fMP4: Write ftyp box
Muxer->>Muxer: Create streaming-friendly moov
Muxer->>Muxer: Add mvex extension
Muxer->>fMP4: Write moov box
loop For each media sample
Demuxer->>MP4: Read sample data
Demuxer->>Muxer: Pass sample
Muxer->>Muxer: Create moof box<br>(time and position info)
Muxer->>Muxer: Create mdat box<br>(actual media data)
Muxer->>fMP4: Write moof+mdat pair
end
Note over fMP4: Conversion complete
As shown in the diagram, the conversion process consists of three key steps:
Parse the source MP4 file: Read and parse the structure of the original MP4 file, extract information about video and audio tracks, including codec type, frame rate, resolution, and other metadata.
Create the initialization part of fMP4: Build the file header and initialization section, including the ftyp and moov boxes. These serve as the initialization segment, containing all the information needed by the decoder, but without actual media sample data.
Create fragments for each sample: Read the sample data from the original MP4 one by one, then create corresponding moof and mdat box pairs for each sample (or group of samples).
This conversion method transforms MP4 files that were only suitable for download-and-play into fMP4 format suitable for streaming transmission.
Multiple MP4 Segment Merging Technology
User Requirement: Time-Range Recording Downloads
In scenarios such as video surveillance, course playback, and live broadcast recording, users often need to download recorded content within a specific time range. For example, a security system operator might only need to export video segments containing specific events, or a student on an educational platform might only want to download key parts of a course. However, since systems typically divide recorded files by fixed durations (e.g., 30 minutes or 1 hour) or specific events (such as the start/end of a live broadcast), the time range needed by users often spans multiple independent MP4 files.
In the Monibuca project, we developed a solution based on time range queries and multi-file merging to address this need. Users only need to specify the start and end times of the content they require, and the system will:
- Query the database to find all recording files that overlap with the specified time range
- Extract relevant time segments from each file
- Seamlessly merge these segments into a single downloadable file
This approach greatly enhances the user experience, allowing them to precisely obtain the content they need without having to download and browse through large amounts of irrelevant video content.
Database Design and Time Range Queries
To support time range queries, our recording file metadata in the database includes the following key fields:
- Stream Path: Identifies the video source
- Start Time: The start time of the recording segment
- End Time: The end time of the recording segment
- File Path: The storage location of the actual recording file
- Type: The file format, such as "mp4"
When a user requests recordings within a specific time range, the system executes a query similar to the following:
SELECT * FROM record_streams
WHERE stream_path = ? AND type = 'mp4'
AND start_time <= ? AND end_time >= ?
This returns all recording segments that intersect with the requested time range, after which the system needs to extract the relevant parts and merge them.
Technical Challenges of Multiple MP4 Merging
Merging multiple MP4 files is not a simple file concatenation but requires addressing the following technical challenges:
- Timestamp Continuity: Ensuring that the timestamps in the merged video are continuous, without jumps or overlaps
- Codec Consistency: Handling cases where different MP4 files may use different encoding parameters
- Metadata Merging: Correctly merging the moov box information from various files
- Precise Cutting: Precisely extracting content within the user-specified time range from each file
In practical applications, we implemented two merging strategies: regular MP4 merging and fMP4 merging. These strategies each have their advantages and are suitable for different application scenarios.
Regular MP4 Merging Process
sequenceDiagram
participant User as User
participant API as API Service
participant DB as Database
participant MP4s as Multiple MP4 Files
participant Muxer as MP4 Muxer
participant Output as Output MP4 File
User->>API: Request time-range recording<br>(stream, startTime, endTime)
API->>DB: Query records within specified range
DB-->>API: Return matching recording list
loop For each MP4 file
API->>MP4s: Read file
MP4s->>Muxer: Parse file structure
Muxer->>Muxer: Parse track info
Muxer->>Muxer: Extract media samples
Muxer->>Muxer: Adjust timestamps for continuity
Muxer->>Muxer: Record sample info and offsets
Note over Muxer: Skip samples outside time range
end
Muxer->>Output: Write ftyp box
Muxer->>Output: Write adjusted sample data
Muxer->>Muxer: Create moov containing all sample info
Muxer->>Output: Write merged moov box
Output-->>User: Provide merged file to user
In this approach, the merging process primarily involves arranging media samples from different MP4 files in sequence and adjusting timestamps to ensure continuity. Finally, a new moov
box containing all sample information is generated. The advantage of this method is its good compatibility, as almost all players can play the merged file normally, making it suitable for download and offline playback scenarios.
It's particularly worth noting that in the code implementation, we handle the overlap relationship between the time range in the parameters and the actual recording time, extracting only the content that users truly need:
if i == 0 {
startTimestamp := startTime.Sub(stream.StartTime).Milliseconds()
var startSample *box.Sample
if startSample, err = demuxer.SeekTime(uint64(startTimestamp)); err != nil {
tsOffset = 0
continue
}
tsOffset = -int64(startSample.Timestamp)
}
// In the last file, frames beyond the end time are skipped
if i == streamCount-1 && int64(sample.Timestamp) > endTime.Sub(stream.StartTime).Milliseconds() {
break
}
fMP4 Merging Process
sequenceDiagram
participant User as User
participant API as API Service
participant DB as Database
participant MP4s as Multiple MP4 Files
participant Muxer as fMP4 Muxer
participant Output as Output fMP4 File
User->>API: Request time-range recording<br>(stream, startTime, endTime)
API->>DB: Query records within specified range
DB-->>API: Return matching recording list
Muxer->>Output: Write ftyp box
Muxer->>Output: Write initial moov box<br>(including mvex)
loop For each MP4 file
API->>MP4s: Read file
MP4s->>Muxer: Parse file structure
Muxer->>Muxer: Parse track info
Muxer->>Muxer: Extract media samples
loop For each sample
Note over Muxer: Check if sample is within target time range
Muxer->>Muxer: Adjust timestamp
Muxer->>Muxer: Create moof+mdat pair
Muxer->>Output: Write moof+mdat pair
end
end
Output-->>User: Provide merged file to user
The fMP4 merging is more flexible, with each sample packed into an independent moof
+mdat
fragment, maintaining independently decodable characteristics, which is more conducive to streaming transmission and random access. This approach is particularly suitable for integration with MSE and HLS, providing support for real-time streaming playback, allowing users to efficiently play merged content directly in the browser without waiting for the entire file to download.
Handling Codec Compatibility in Merging
In the process of merging multiple recordings, a key challenge we face is handling potential codec parameter differences between files. For example, during long-term recording, a camera might adjust video resolution due to environmental changes, or an encoder might reinitialize, causing changes in encoding parameters.
To solve this problem, Monibuca implements a smart track version management system that identifies changes by comparing encoder-specific data (ExtraData):
sequenceDiagram
participant Muxer as Merger
participant Track as Track Manager
participant History as Track Version History
loop For each new track
Muxer->>Track: Check track encoding parameters
Track->>History: Compare with existing track versions
alt Found matching track version
History-->>Track: Return existing track
Track-->>Muxer: Use existing track
else No matching version
Track->>Track: Create new track version
Track->>History: Add to version history
Track-->>Muxer: Use new track
end
end
This design ensures that even if there are encoding parameter changes in the original recordings, the merged file can maintain correct decoding parameters, providing users with a smooth playback experience.
Performance Optimization
When processing large video files or a large number of concurrent requests, the performance of the merging process is an important consideration. We have adopted the following optimization measures:
- Streaming Processing: Process samples frame by frame to avoid loading entire files into memory
- Parallel Processing: Use parallel processing for multiple independent tasks (such as file parsing)
- Smart Caching: Cache commonly used encoding parameters and file metadata
- On-demand Reading: Only read and process samples within the target time range
These optimizations enable the system to efficiently process large-scale recording merging requests, completing processing within a reasonable time even for long-term recordings spanning hours or days.
The multiple MP4 merging functionality greatly enhances the flexibility and user experience of Monibuca as a streaming server, allowing users to precisely obtain the recorded content they need, regardless of how the original recordings are segmented and stored.
Media Source Extensions (MSE) and fMP4 Compatibility Implementation
MSE Technology Overview
Media Source Extensions (MSE) is a JavaScript API that allows web developers to directly manipulate media stream data. It enables custom adaptive bitrate streaming players to be implemented entirely in the browser without relying on external plugins.
The core working principle of MSE is:
- Create a MediaSource object
- Create one or more SourceBuffer objects
- Append media fragments to the SourceBuffer
- The browser is responsible for decoding and playing these fragments
Perfect Integration of fMP4 with MSE
The fMP4 format has natural compatibility with MSE, mainly reflected in:
- Each fragment of fMP4 can be independently decoded
- The clear separation of initialization segments and media segments conforms to MSE's buffer management model
- Precise timestamp control enables seamless splicing
The following sequence diagram shows how fMP4 works with MSE:
sequenceDiagram
participant Client as Browser Client
participant Server as Server
participant MSE as MediaSource API
participant Video as HTML5 Video Element
Client->>Video: Create video element
Client->>MSE: Create MediaSource object
Client->>Video: Set video.src = URL.createObjectURL(mediaSource)
MSE-->>Client: sourceopen event
Client->>MSE: Create SourceBuffer
Client->>Server: Request initialization segment (ftyp+moov)
Server-->>Client: Return initialization segment
Client->>MSE: appendBuffer(initialization segment)
loop During playback
Client->>Server: Request media segment (moof+mdat)
Server-->>Client: Return media segment
Client->>MSE: appendBuffer(media segment)
MSE-->>Video: Decode and render frames
end
In Monibuca's implementation, we've made special optimizations for MSE: creating independent moof and mdat for each frame. Although this approach adds some overhead, it provides high flexibility, particularly suitable for low-latency real-time streaming scenarios and precise frame-level operations.
Integration of HLS and fMP4 in Practical Applications
In practical applications, we combine fMP4 technology with the HLS v7 protocol to implement time-range-based on-demand playback. The system can find the corresponding MP4 records from the database based on the time range specified by the user, and then generate an fMP4 format HLS playlist:
sequenceDiagram
participant Client as Client
participant Server as HLS Server
participant DB as Database
participant MP4Plugin as MP4 Plugin
Client->>Server: Request fMP4.m3u8<br>with time range parameters
Server->>DB: Query MP4 records within specified range
DB-->>Server: Return record list
Server->>Server: Create HLS v7 playlist<br>Version: 7
loop For each record
Server->>Server: Calculate duration
Server->>Server: Add media segment URL<br>/mp4/download/{stream}.fmp4?id={id}
end
Server->>Server: Add #EXT-X-ENDLIST marker
Server-->>Client: Return HLS playlist
loop For each segment
Client->>MP4Plugin: Request fMP4 segment
MP4Plugin->>MP4Plugin: Convert to fMP4 format
MP4Plugin-->>Client: Return fMP4 segment
end
Through this approach, we maintain compatibility with existing HLS clients while leveraging the advantages of the fMP4 format to provide more efficient streaming services.
Conclusion
As a modern media container format, fMP4 combines the efficient compression of MP4 with the flexibility of streaming transmission, making it highly suitable for video distribution needs in modern web applications. Through integration with HLS v7 and MSE technologies, more efficient and flexible streaming services can be achieved.
In the practice of the Monibuca project, we have successfully built a complete streaming solution by implementing MP4 to fMP4 conversion, merging multiple MP4 files, and optimizing fMP4 fragment generation for MSE. The application of these technologies enables our system to provide a better user experience, including faster startup times, smoother quality transitions, and lower bandwidth consumption.
As video technology continues to evolve, fMP4, as a bridge connecting traditional media formats with modern Web technologies, will continue to play an important role in the streaming media field. The Monibuca project will also continue to explore and optimize this technology to provide users with higher quality streaming services.
Top comments (0)