DEV Community

Chaim Rand for AWS Community Builders

Posted on • Originally published at chaimrand.Medium

Streaming Data from Cloud Storage with Mountpoint for Amazon S3

A First Look at a New Solution for Mounting Cloud Based Data

Photo by Simon Fitall on Unsplash

These days, AI has become pretty much synonymous with collecting and maintaining large amounts of data. This data is typically stored in a central location and accessed at multiple different phases of the AI application development. An important factor in designing a data storage solution is the speed and efficiency at which the data can be accessed, as this can have a meaningful impact on the speed and cost of development. In our AI development team, we use cloud object storage services such as Amazon S3 to store enormous amounts of data. Consequently, we are obsessed with finding the fastest (and cheapest) ways of consuming cloud-based data for a variety of different scenarios.

In previous posts (e.g., herehere, and here), we described a number of different tools and techniques for pulling data from the cloud and demonstrated their application to various use cases. It is only natural, then, that with the introduction of a new option for accessing cloud-based data, we would eagerly set out to explore its capabilities.

In this post, we will describe our first impressions of Mountpoint for Amazon S3 - a new open-source solution for interfacing with cloud storage - and assess its performance on two use cases that are of particular interest to us: streaming sequential blocks of a relatively large data files (as detailed here) and consuming a large number of relatively small data files (as detailed here). For the sake of brevity, we will refer to Mountpoint for Amazon S3 simply as Mountpoint.

Importantly, keep in mind that, as of the time of this writing, Mountpoint remains under active development. You are strongly advised to stay up to date with the latest release of this tool (and all alternative tools) in order to make the most informed design decisions for your AI projects.

Although it can support other endpoints, Mountpoint prioritizes performance against Amazon S3. As such, the examples below will be run using Amazon's cloud services. However, our choice of cloud service provider - or the mention of any other tools, frameworks, or APIs - should not be viewed as an endorsement over their alternatives. Furthermore, please do not view this post as a replacement for the existing official documentation (e.g., here and here).

Yet Another FUSE Based Object Storage Access Solution

While there are many solutions for reading from and writing to file objects in the cloud, we can broadly divide them into two categories:

  1. Explicit Data Transfer - Solutions that involve explicitly downloading data from the cloud for reading and uploading data for writing.
  2. File-System Abstraction - Solutions that abstract cloud storage interactions behind a file-system-style interface, allowing seamless access to cloud-hosted files.

The second approach is often implemented using the Filesystem in Userspace (FUSE) interface, enabling cloud-based data buckets to be mounted as local directories. This allows existing applications to interact with cloud storage just as they would with a traditional file system - requiring little to no modification.

Mountpoint is a new FUSE based solution written in the Rust programming language and based on a Rust version of the Linux FUSE library. See here for an explanation of the choice of Rust. Other popular tools in the FUSE-based family of solutions are goofysand s3fs.

Using Mountpoint

To install Mountpoint, please follow the guidelines in the official documentation. The usage instructions of Mountpoint can be retrieved by running mount-s3 with the help flag. The text block below includes the first few lines of the output as well as a few of the many options that allow us to tune the behavior of the client.

Mountpoint for Amazon S3

Usage: mount-s3 [OPTIONS] <BUCKET_NAME> <DIRECTORY>

Arguments:\
  <BUCKET_NAME>\
          Name of bucket to mount

  <DIRECTORY>\
          Directory or FUSE file descriptor to mount the bucket at.

Mount options:\
      --read-only\
          Mount file system in read-only mode

Client options:\
      --maximum-throughput-gbps <N>\
          Maximum throughput in Gbps [default: auto-detected on EC2\
          instances, 10 Gbps elsewhere]

      --max-threads <N>\
          Maximum number of FUSE daemon threads

          [default: 16]

      --part-size <SIZE>\
          Part size for multi-part GET and PUT in bytes

          [default: 8388608]

Caching options:\
      --cache <DIRECTORY>\
          Enable caching of object content to the given directory and set\
          metadata TTL to 60 seconds
Enter fullscreen mode Exit fullscreen mode

Mountpoint assumes appropriate configuration of your AWS credentials. Make sure to be aware of the current documented limitations of Mountpoint, as well as any special configuration that might be required.

In the next sections we will demonstrate the use of Mountpoint for Amazon S3 for two different use-cases and compare its performance with goofys. The experiments we will describe were conducted on an Amazon EC2 c5.4xlarge instance (with 16 vCPUs). For the sake of simplicity, we chose an Ubuntu (22.04) AWS Deep Learning AMI, preinstalled with Python (3.11) and PyTorch (2.5.1). To install mount-s3 and goofys we ran the following commands:

# install goofys\
sudo curl -Lo \\
  /usr/local/bin/goofys \\
  https://github.com/kahing/goofys/releases/latest/download/goofys\
sudo chmod +x /usr/local/bin/goofys

# install mount-s3\
wget \\
  https://s3.amazonaws.com/mountpoint-s3-release/latest/x86_64/mount-s3.deb\
sudo dpkg -i mount-s3.deb\
sudo apt-get install -f -y
Enter fullscreen mode Exit fullscreen mode

We used the following command lines for mounting and un-mounting:

# Mountpoint\
mount-s3 --read-only <s3_bucket_name> <local_path>

# goofys\
goofys -o ro <s3_bucket_name> <local_path>

# disable mount\
fusermount -z -u <local_path>
Enter fullscreen mode Exit fullscreen mode

Keep in mind that the comparative performance results we will share are very much dependent on the details of the environment in which they were run. Furthermore, it is quite likely that with appropriate tuning of the command line controls we could have improved the performance of both the Mountpoint and goofys trials. We strongly encourage you to conduct your own experiments before drawing conclusions for your own project.

An Important Note About the Costs of Pulling Data from Amazon S3

Amazon S3 pricing consists of several components, one of which is based on the number of API calls (e.g., GET, SELECT, PUT, etc.). When using FUSE-based solutions such as Mountpoint or goofys, these API calls are abstracted away from the user, making it more difficult to directly assess the cost of reading data from Amazon S3 compared to explicitly pulling the data. Additionally, the number of API calls - and their associated costs - can be affected by the choice of command-line options.

A comparative cost analysis of different methods for streaming data from Amazon S3 is beyond the scope of this post, but conducting such an analysis is highly recommended before selecting the best approach for your needs.

Streaming Large Data Files

In our first experiment, we evaluated the performance of traversing a 2 GB binary file stored in the cloud. This file was assumed to contain 2,048 blocks of data (e.g., frames or data samples), each 1 MB in size.

The code block below demonstrates routines for sequentially reading through the file and for sampling data at non-sequential file offsets. Please see our previous posts for more details on how we designed the experiment and how we chose the metrics for comparison.

import time

KB = 1024\
MB = KB * KB

def read_sequential(f, t0):\
    t1 = time.time()\
    x = f.read(MB)\
    print(f'time of first sample: {time.time() - t1}')\
    print(f'total to first sample: {time.time() - t0}')\
    t1 = time.time()\
    count = 0\
    while True:\
        x = f.read(MB)\
        if len(x) == 0:\
            break\
        count += 1\
    print(f'time of avg read: {(time.time() - t1)/count}')

def fast_forward(f):\
    t1 = time.time()\
    total = 10\
    for i in range(total):\
        f.seek(i * 100 * MB)\
        t1 = time.time()\
        x = f.read(MB)\
    print(f'time of avg random read: {(time.time() - t1)/total}')

key = '<s3 key>'\
mount_dir = '<local mount>'\
sequential = True # toggle flag to run fast_forward

t0 = time.time()\
with open(f'{mount_dir}/{key}', 'rb') as f:\
    if sequential:\
        read_sequential(f, t0)\
        print(f'total time: {time.time()-t0}')\
    else:\
        fast_forward(f)
Enter fullscreen mode Exit fullscreen mode

In the table below we compare the results we received.

Comparative Results of Pulling 2 GB File from S3 (by Author)

Although Mountpoint is slightly slower than goofys in loading the first frame, it outperforms goofys in all other metrics. For a comparison with other methods for streaming large files from the cloud, please refer to our previous post.

Consuming a Large Number of Small Files

In our second experiment we assessed the speed of feeding hundreds of thousands individual cloud-based data samples into a deep learning training environment. The code block below demonstrates the creation of a custom PyTorch Dataset for loading training samples from the local mount. We measured the speed of traversing thousands of image-label pairs, where each file was 1 MB in size. Please see this previous post for more details on how we designed the experiment.

from torch.utils.data import Dataset\
import os\
class SingleSampleDataset(Dataset):\
    def __init__(self):\
        super().__init__()\
        self.base_path = '<local_mount>'

    def __len__(self):\
        return 10000

    def get_from_files(self, image_path, label_path):\
        image_file = open(image_path, 'rb')\
        label_file = open(label_path, 'rb')\
        image = image_file.read()\
        label = label_file.read()\
        image_file.close()\
        label_file.close()\
        return {"image": image, "label": label}

    def __getitem__(self, index: int):\
        image_path = os.path.join(self.base_path, f'{index}.image')\
        label_path = os.path.join(self.base_path, f'{index}.label')\
        return self.get_from_files(image_path, label_path)

def get_dataset():\
    return SingleSampleDataset()

import torch, time\
from statistics import mean, variance\
dataset = get_dataset()\
dl = torch.utils.data.DataLoader(dataset, batch_size=4, num_workers=16)\
stats_lst = []\
t0 = time.perf_counter()\
for batch_idx, batch in enumerate(dl, start=1):\
    t = time.perf_counter() - t0\
    print(f'Iteration {batch_idx} Time {t}')\
    stats_lst.append(t)\
    t0 = time.perf_counter()\
mean_calc = mean(stats_lst[1:])\
var_calc = variance(stats_lst[1:])\
print(f'mean {mean_calc} variance {var_calc}')
Enter fullscreen mode Exit fullscreen mode

When downloading large numbers of small files, goofys outperformed Mountpoint, averaging 0.08 seconds per sample compared to Mountpoint's 0.11 seconds. We surmise that the overhead observed at the start of a file in the previous experiment has a more significant impact when dealing with many small files. For results of other methods for consuming large numbers of small files from the cloud, refer to our previous post.

Summary

It's great to see a new actor in the cloud data streaming space, especially one explicitly intent on addressing the challenges faced by modern data applications. Some key highlights we found:

  • Performance Tuning - Mountpoint includes many controls that allow fine-tuning for improved performance.
  • Large File Streaming - When traversing large files, Mountpoint outperformed the solution we compared it to.
  • Ongoing Enhancements - Unlike other FUSE-based solutions, Mountpoint is under active development and expected to introduce further improvements.

One area for improvement we identified is mass-downloading many small files, where Mountpoint underperformed compared to its goofys counterpart.

As Mountpoint for Amazon S3 continues to evolve, we look forward to seeing it extend and enhance its capabilities.

Top comments (0)