Debjit Bhattacharjee

Posted on Mar 9

DeepSeek 3FS: A High-Performance Distributed File System for Modern Workloads

#distributedsystems #systemdesign #ai #opensource

In this blog post, we’ll dive deep into the design and implementation of DeepSeek 3FS, a distributed file system engineered for high-performance workloads like data analytics and machine learning. We’ll explore its architecture, components, file system interfaces, metadata management, and chunk storage system, with detailed explanations, diagrams, and flowcharts to break down the complexity.

Introduction to DeepSeek 3FS

DeepSeek 3FS is a distributed file system designed to provide strong consistency, high throughput, and fault tolerance, leveraging RDMA networks (InfiniBand or RoCE) and SSDs for optimal performance. It aims to bridge the gap between traditional file system semantics and modern object stores, offering a unified namespace and flexible data placement for applications.

The system comprises four main components:

Cluster Manager: Handles membership changes and distributes cluster configurations.
Metadata Service: Manages file metadata using a transactional key-value store.
Storage Service: Stores file chunks with strong consistency using Chain Replication with Apportioned Queries (CRAQ).
Client: Provides two interfaces—FUSE client for ease of adoption and a native client for performance-critical applications.

Let’s break down each component and their interactions.

System Architecture

The 3FS architecture is designed for scalability and fault tolerance, with all components communicating over an RDMA network for low-latency, high-bandwidth data transfers.

Components and Their Roles

Cluster Manager:
- Manages membership and configuration changes.
- Multiple managers are deployed; one is elected as the primary using a distributed coordination service (e.g., ZooKeeper or etcd).
- Receives heartbeats from metadata and storage services to detect failures.
- Distributes updated cluster configurations to services and clients.
Metadata Service:
- Stateless and scalable, handling file metadata operations (e.g., open, create).
- Stores metadata in a transactional key-value store (FoundationDB in production).
- Clients can connect to any metadata service for load balancing.
Storage Service:
- Manages local SSDs and provides a chunk store interface.
- Implements CRAQ for strong consistency and high read throughput.
- File chunks are replicated across multiple SSDs for fault tolerance.
Client:
- FUSE Client: Integrates with applications via the FUSE kernel module for ease of use.
- Native Client: Offers asynchronous zero-copy I/O for performance-critical applications.

Architecture Diagram

Below is a high-level architecture diagram of 3FS, showing the interactions between components:

   +-------------------------+        +-------------------------+

   |                         |        |                         |

   |      Cluster Manager     | <----> |     Metadata Service    |

   |                         |        |                         |

   +-------------------------+        +-------------------------+

                 |                             |

                 |                             |

                 v                             v

   +-------------------------+        +-------------------------+

   |                         |        |                         |

   |    Storage Service      | <----> |    Client (FUSE/Native) |

   |                         |        |                         |

   +-------------------------+        +-------------------------+

File System Interfaces

3FS provides a POSIX-like file system interface with enhancements for modern workloads, addressing limitations of object stores while maintaining compatibility with existing applications.

Why File System Semantics?

Unlike object stores, 3FS offers:

Atomic Directory Manipulation: Supports operations like moving or deleting directories atomically, critical for workflows involving temporary directories.
Symbolic and Hard Links: Enables lightweight snapshots for dynamic datasets.
Familiar Interface: Simplifies adoption by supporting file-based data formats (e.g., CSV, Parquet) without requiring new APIs.

Limitations of FUSE

While the FUSE client simplifies integration, it introduces performance overheads:

Memory Copy Overhead: Data transfers between kernel and user space increase latency.
Multi-threading Bottlenecks: Lock contention in the FUSE shared queue limits scalability (benchmarks show ~400K 4KiB reads/sec).
Concurrent Writes: Linux FUSE (v5.x) does not support concurrent writes to the same file, requiring workarounds like writing to multiple files.

Native Client with Asynchronous Zero-Copy API

To address FUSE limitations, 3FS implements a native client with an asynchronous zero-copy API inspired by Linux io_uring. Key data structures include:

Iov: A shared memory region for zero-copy read/write operations, registered with InfiniBand for RDMA.
Ior: A ring buffer for request queuing, supporting batched and parallel I/O operations.

The native client spawns multiple threads to fetch and dispatch I/O requests to storage services, minimizing RPC overhead for small reads.

Flowchart: Native Client I/O Operation

File Metadata Management

File metadata in 3FS is stored in FoundationDB, a distributed transactional key-value store providing Serializable Snapshot Isolation (SSI).

Metadata Structures

Inodes: Store attributes (e.g., ownership, permissions, timestamps) with a unique 64-bit ID.
- File inodes include chunk size, chain table range, and shuffle seed.
- Directory inodes include parent inode ID and layout configurations.
Directory Entries: Map parent inode IDs and entry names to target inode IDs.

Key Encoding

Inode keys: "INOD" + inode_id (little-endian for distribution across FoundationDB nodes).
Directory entry keys: "DENT" + parent_inode_id + entry_name".

Metadata Operations

Read-only Transactions: Used for queries (e.g., fstat, listdir).
Read-write Transactions: Used for updates (e.g., create, rename), with automatic retries on conflicts.

Dynamic File Attributes

File Deletion: For write-opened files, deletion is deferred until all file descriptors are closed.
File Length Updates: Clients periodically report maximum write positions; final length is computed on close or fsync by querying storage services.
Optimizations: Uses rendezvous hashing to distribute length updates and hints in inodes to avoid querying all chains for small files.

Chunk Storage System

The chunk storage system is designed for high bandwidth and fault tolerance, using CRAQ for replication and balanced data placement across SSDs.

Data Placement with CRAQ

Files are split into chunks, replicated across storage targets using CRAQ:

Write Path: Requests propagate from the head to the tail of a chain.
Read Path: Requests can be served by any target, balancing load across replicas.

Chain Table Example

Chain	Version	Target 1 (head)	Target 2	Target 3 (tail)
1	1	A1	B1	C1
2	1	D1	E1	F1

Each chain has a version number, incremented on updates by the cluster manager.

Balanced Traffic During Recovery

To mitigate bottlenecks during failures, 3FS distributes read traffic across multiple SSDs using balanced incomplete block design. For example, if node A fails, its traffic is split evenly among other nodes.

Recovery Traffic Flowchart

Data Replication with CRAQ

Write Process

Validate chain version.
Fetch data via RDMA.
Serialize writes at the head using a lock.
Propagate writes along the chain.
Commit at the tail and propagate acknowledgments.

Read Process

Return committed version if available.
Handle pending versions with a status code, allowing retries or relaxed reads.

Failure Detection and Recovery

Heartbeats: Cluster manager detects failures if heartbeats are missed for a configurable interval.
State Transitions: Storage targets transition between public states (e.g., serving, syncing, offline) based on local states.
Recovery: Offline targets are moved to the end of chains; data is synced using full-chunk-replace writes.

Chunk Engine

The chunk engine manages persistent storage on SSDs:

Data Files: Store chunk data in physical blocks (64KiB to 64MiB).
RocksDB: Stores chunk metadata.
Allocator: Uses bitmaps for efficient block allocation and reclamation.

Write Operation Flowchart

Check out the Github Repo. All credit for this research goes to the researchers of this project.

Conclusion

DeepSeek 3FS is a robust distributed file system tailored for modern workloads, combining the familiarity of file system semantics with the scalability of object stores. Its use of RDMA, CRAQ, and FoundationDB ensures high performance, strong consistency, and fault tolerance. Whether you're running data analytics or machine learning pipelines, 3FS offers a flexible and efficient storage solution.

Feel free to experiment with 3FS in your projects! If you have questions or insights, drop them in the comments below.

You can find me on X

DEV Community