Sergey Platonov

Posted on Nov 26, 2024

High-Performance Block Volumes in Virtual Cloud Environments: Parallel File Systems Comparison

#highperformance #performance #ai

In virtualized cloud environments, supporting the intensive data demands of AI workloads requires a robust and scalable storage solution. Parallel file systems, such as Lustre and pNFS, provide the distributed data handling needed for these environments, allowing data to scale seamlessly across multiple nodes with minimal performance degradation. By integrating xiRAID Opus with these parallel file systems, Xinnor delivers enhanced storage performance, ensuring both random and sequential workloads achieve low-latency, high-throughput access. This blog explores how Lustre and pNFS, optimized with xiRAID Opus, create a flexible and high-performing storage architecture for AI-focused cloud environments.

To address the scalability challenges posed by AI workloads in virtualized cloud environments, integrating parallel file systems like Lustre and pNFS becomes essential. These systems enable distributed data handling, ensuring that workloads can scale across numerous compute and storage nodes without a significant performance hit. By leveraging the underlying block device performance delivered by xiRAID Opus, parallel file systems further optimize both random and sequential workloads, ensuring low-latency, high-throughput access to shared storage resources.

Xinnor Lustre Solution for Cloud Environments

Lustre is a well-known parallel file system used primarily in HPC environments, but it can also be leveraged for AI workloads, thanks to its scalability and high throughput. Lustre provides high availability over shared storage, making it ideal for cloud environments where reliability and performance are paramount.

At Xinnor, we have extensive experience with Lustre, having successfully deployed it in numerous production environments. Our expertise extends into virtualized environments, where we enable the deployment of Lustre to provide high-performance storage solutions.

In these setups, both the OSS and MDS components of Lustre are tuned by us for optimal performance. The architecture is built around disaggregated storage resources, which we transform into high-performance volumes using xiRAID Opus. These volumes are then passed through to virtual machines (VMs), forming the foundation for a highly scalable and efficient storage solution suited for AI workloads.

To validate our solution, we implemented a virtualized Lustre environment and conducted performance tests to demonstrate its scalability and efficiency for AI workloads in cloud environments.

Testing Environment Details:

CPU: 64-Core Processor per node (AMD 7702P)
Memory: 256 GB RAM per node
Networking: 1 x MT28908 Family [ConnectX-6] per node
Drives: 24x KIOXIA CM6-R 3.84TB (Gen 4)
Aggregated drive performance per node:
- 9M IOPS (4k random read)
- 3M IOPS (4k random write)
- 70 GBps (128k sequential write/read)

Implementation Overview:

Host Configuration:
We deployed three virtual machines (VMs) across two hosts—two OSS and one Lustre MDS. Each VM was configured with dedicated RAID setups:
OSS VMs use RAID 6 (16+2).
The MDS VM uses RAID 1+1.

Host Configuration:
We deployed three virtual machines (VMs) across two hosts—two OSS and one Lustre MDS. Each VM was configured with dedicated RAID setups:
a. OSS VMs use RAID 6 (16+2).
b. The MDS VM uses RAID 1+1.
Resource Allocation:
Each storage controller within the VMs is assigned a single CPU core. In total, only three CPU cores are utilized for managing the block storage system, maximizing efficiency without compromising performance.
VM Configuration:
Each OSS and MDS VM is assigned three virtual cores for processing. The Lustre Client VMs are deployed on an external host, with each client VM provisioned with 32 cores, ensuring sufficient computational power for handling intensive workloads.

Lustre Solution Performance

When testing sequential workloads (1M block size, 32 jobs), we achieved the following performance metrics with xiRAID Opus: 44 GB/s at read and 43 GB/s at write operations.

In addition to sequential workloads, we also tested random workloads, where xiRAID Opus demonstrated significantly better scaling at higher I/O depths compared to Lustre without it. This test used MDRAID (RAID 0) and Opus (RAID 6), showcasing the significant boost in both read and write performance when xiRAID Opus is incorporated into the solution. As seen in the graph above, Lustre with xiRAID Opus achieves remarkable performance growth, especially as the I/O depth increases. This scaling can be attributed to the efficiency of the multithreaded vhost-user-blk architecture, which distributes I/O tasks more effectively, leading to substantial improvements in throughput.

However, one of the primary limitations in maximizing streaming throughput lies in the network interface capacity, which often acts as a bottleneck. Despite this constraint, xiRAID Opus ensures high performance by maximizing network utilization, effectively mitigating the impact of network limitations.

Moreover, while Lustre has traditionally been considered unsuitable for small block I/O operations, recent advancements have significantly enhanced its capabilities. With improved asynchronous I/O support and the integration of high-performance interfaces, low-latency devices can now be passed directly into the MDS. This innovation, in combination with xiRAID Opus, delivers strong small block I/O performance, addressing a critical pain point for AI and cloud workloads that demand efficient data handling at scale.

Reducing the Complexity of Lustre Administration with VirtioFS

When managing file systems in virtualized environments, one of the key challenges is reducing administrative complexity while maintaining performance. To address this, we implemented VirtioFS, a solution for sharing file systems directly between hosts and VMs. VirtioFS eliminates the need for installing client software within the VMs by sharing a mounted file system from the host. This simplification makes it an ideal solution for cloud service providers, reducing administrative burden without sacrificing performance.

Xinnor-tuned VirtioFS: Performance Results

To fully optimize file system performance in virtualized environments, we’ve applied tuning to VirtioFS. This tuning allows VirtioFS to deliver performance on par with native Lustre clients, even in heavily virtualized environments. The performance improvements are especially significant in high-throughput workloads.

Sequential operations results:

These results show that with the right optimizations, VirtioFS can match the performance of native Lustre clients in sequential workloads while still providing the simplicity of a virtualized file system environment. However, in random operations VirtioFS is not able to demonstrate the same level of scalability as the native Lustre client.

Xinnor Lustre Solution Outcomes

The Xinnor Lustre solution demonstrates powerful performance capabilities, even with a virtualized setup. By pairing xiRAID Opus with virtualized Lustre OSS and MDS components, our solution is capable of handling both sequential and random I/O operations with minimal overhead. Key outcomes:

Performance:
a. With only two virtualized OSS, Lustre delivers impressive sequential and random I/O performance.
b. Critical to this performance is the high-performance block device provided by xiRAID Opus, which is passed directly to the OSS and MDS virtual machines.
Skill Requirements:
a. While Lustre configuration requires advanced expertise to set up the system and client VMs, VirtioFS offers a simplified alternative for workloads with primarily sequential patterns, reducing complexity without sacrificing throughput.
Solution for Cloud Environments:
a. Xinnor can deliver this high-performance Lustre solution for cloud-based environments, tailored to AI workloads as well as HPC.

While Lustre has a legacy in HPC environments, it’s also highly effective for AI-centric workloads. However, Lustre can be complex to administer, particularly in cloud environments, where configurations like LNET and client setups add layers of complexity. Additionally, Lustre supports a limited number of operating systems, making expert configuration essential.

Our Vision for the Future: pNFS Block Layout

pNFS (Parallel NFS) Block Layout is a part of the pNFS extension in NFSv4.1, designed to enable parallel access to storage devices, improving scalability and performance.

The block layout specifically focuses on enabling clients to access storage blocks directly, bypassing the NFS server for data transfers. This layout is ideal for environments where block storage devices (like SANs) are used, providing high-performance parallel access to large datasets.

This approach allows VMs to directly interact with xiRAID Opus block volumes, while a pNFS MDS server manages scalability. This flexible design minimizes the complexity of shared storage setups in cloud environments, ensuring both scalability and high performance.

Key Features of pNFS Block Layout:

Direct Data Access: Clients can bypass the NFS server and read/write directly to storage volumes using block-level protocols (e.g., iSCSI, Fibre Channel), reducing bottlenecks.
Separate Data and Metadata Paths: The NFS server manages metadata, but the data itself flows directly between clients and storage, streamlining performance.
Parallel Access: pNFS allows multiple clients to read/write to different sections of a file simultaneously, improving throughput for large datasets.
Scalability: By offloading data transfers to the storage devices themselves, pNFS supports high-scale operations, making it a perfect fit for cloud environments handling AI workloads or massive data sets.

pNFS Architecture in Cloud Environments

The beauty of pNFS is its simplicity, offering high-performance shared storage while requiring minimal system resources. It doesn’t need third-party client software or the direct passing of a high-performance network to VMs, making it incredibly versatile.

Shared Storage Support: pNFS can efficiently manage high-performance storage with low CPU overhead.
No Third-Party Software: Data volumes can be shared across compute nodes without needing additional software, simplifying the overall architecture. What makes this architecture especially appealing is that it leverages the same hardware we used in our Lustre testing earlier, showing just how adaptable and powerful pNFS can be.

pNFS Performance Results

Sequential operations (1M, 32 jobs):

Sequential Read:
- Without xiRAID Opus: 34.8 GB/s
- With xiRAID Opus: 47 GB/s
Sequential Write:
- Without xiRAID Opus: 32.7 GB/s
- With xiRAID Opus: 46 GB/s

By integrating xiRAID Opus, we further optimized the performance of pNFS block layouts. When we compare pNFS with and without xiRAID Opus, the results clearly demonstrate its value in high-performance environments. This test used MDRAID (RAID 0) and Opus (RAID 6), showcasing the significant boost in both read and write performance when xiRAID Opus is incorporated into the solution.

pNFS vs Lustre: Accelerated by Xinnor Solutions

When comparing the pNFS block layout to Lustre, our solutions provide significant acceleration in both setups. Both Lustre and pNFS, when paired with xiRAID Opus, are capable of delivering strong, near-equal performance in high-throughput environments:

Sequential Performance Comparison (1M, 32 jobs):

Sequential Read:
- Lustre: 44 GB/s
- pNFS: 47 GB/s
Sequential Write:
- Lustre: 43 GB/s
- pNFS: 46 GB/s

These results demonstrate that both Lustre and pNFS, when optimized by xiRAID Opus, are powerful solutions, capable of delivering outstanding performance in high-performance cloud environments.

pNFS in Cloud Environments: Conclusions

We believe pNFS represents the future of scalable, high-performance storage in cloud environments. With proper configuration, pNFS block layout can achieve tens or even hundreds of gigabytes per second in throughput, with minimal resource consumption.

Key Benefits:

Scalability: Supports large-scale environments, offering massive throughput potential with low system overhead.
High Performance: Delivers exceptional performance for both sequential and random small block operations, with minimal latency due to direct interaction with storage devices.
No Third-Party Client Software: Simplifies setup and management by removing the need for additional software on client machines.

Challenges:

While pNFS is highly promising, the current Open Source MDS is not production-ready, making it suitable for POCs but not yet for full production environments.

Conclusions

Xinnor offers two robust solutions tailored for AI workloads in cloud environments: xiRAID Opus and the Xinnor Lustre Solution. These high-performance tools are engineered to handle the demanding nature of AI applications. Our comparison of Lustre and pNFS, accelerated by xiRAID Opus, demonstrates that both parallel file systems provide exceptional scalability and performance for AI workloads in virtualized cloud settings. Lustre offers high throughput and reliability, making it suitable for complex cloud environments. On the other hand, pNFS presents a simpler, versatile alternative that minimizes setup complexity without sacrificing performance. While each solution has unique strengths, xiRAID Opus consistently enhances both, supporting fast, efficient data access across multiple cloud-based nodes. Together, these parallel file systems and xiRAID Opus form a powerful foundation for AI workloads.

You can read the original blogpost here

DEV Community