Why did we develop HPFS?

Why did we develop HPFS? When training with ResNet-50, we use a large number of images, reaching a scale of hundreds of millions. We also need to train the Stable Diffusion model, which requires a staggering amount of data, ranging from tens of billions to hundreds of billions of datasets. We have tried using Lustre, Ceph, GlusterFS, and IBM's GPFS, but all have failed because, at this scale, the performance of all file systems degrades severely, to the point of being unusable. Moreover, our concurrency is very high, with up to hundreds of threads running simultaneously. This is determined by the nature of model training, where a batch of data is simultaneously given to several training models to run concurrently. You can think of it as multiple clients running highly concurrently. In this context, we developed HPFS to meet our high-load training tasks. We do not use the FUSE version of HPFS because of performance limitations; instead, we modified the training model's interface to call the HPFS API for training. In summary, the performance difference between the API interface and the FUSE client is significant.

Today, I will share the performance of these open-source file systems in our use. CephFS with multi-MDS does not scale linearly and requires enough memory to maintain performance, which is quite awkward. Lustre is limited by the central node's performance bottleneck. GlusterFS is limited by the performance of the local file system it relies on, such as EXT4 or XFS, which degrades significantly with many files. GPFS's performance is stable from start to finish but very slow. In the future, we may open source HPFS to help solve the storage problems of massive datasets.

Our performance in use is approximately as follows:

Eight clients, each running 128 threads. HPFS-SRV uses three machines, each running 16 HPFS-SRV instances. The storage data is a cluster of six machines with 8 NVMe each.
Test case: open, write 4096 bytes, close, then open, read 4096 bytes, close.