DEV Community

Cover image for Boost Your Golang Skills: Writing Large Data Files for Performance
Frastyawan Nym
Frastyawan Nym

Posted on • Edited on

Boost Your Golang Skills: Writing Large Data Files for Performance

Golang, known for its simplicity and performance, offers various ways to tackle tasks efficiently. As a beginner developer, you might be wondering how to write large data into files effectively. In this beginner-friendly guide, we'll explore three key methods: Sequential, Parallel, and Parallel Chunk. We'll break down the concepts, provide easy-to-understand code samples, and offer practical advice to help you get started. Let's embark on this coding journey! 📝

Introduction 🎉

Before we dive into these methods, let's start with some basics. File handling in Golang is essential for tasks like saving user data, generating reports, or storing logs. Imagine it as managing a virtual filing cabinet where you can add, retrieve, and organize information.

Basic Concepts 🧠

Golang, with its goroutines (lightweight threads), allows you to perform tasks concurrently, making it perfect for efficient file writing. Why does efficient file writing matter? Well, it can significantly impact the performance of your applications, ensuring they run smoothly.

Now, let's explore the three methods with simplified explanations, easy-to-follow code, and practical examples.

Sequential File Writer 📜

What is Sequential Writing?
Sequential file writing is like writing a story from start to finish, one sentence at a time. It's straightforward and easy to understand.

When to Use it?

  • Use sequential writing when you're just starting, and simplicity is crucial.
  • Ideal for smaller datasets that don't require super-fast write speeds.
  • Great for scenarios where data order is essential.

Pros:

  • Simplicity: It's easy to implement, perfect for beginners.
  • Data Integrity: Maintains the order of data, making it reliable.
  • Low Resource Usage: Doesn't require much memory.

Cons:

  • Slower for large datasets.
  • Not the best choice for blazing-fast writes.

Code Sample and Explanation:
filewriter/sequential/sequential.go

package sequential

import (
    "fmt"

    "github.com/frasnym/go-file-writer-example/filewriter"
)

// SequentialFileWriter represents a writer for sequential file operations.
type sequentialFileWriter struct {
    fileWriter filewriter.FileWriter
}

// NewSequentialFileWriter creates a new instance of SequentialFileWriter.
func NewSequentialFileWriter(fileWriter filewriter.FileWriter) filewriter.Writer {
    return &sequentialFileWriter{
        fileWriter: fileWriter,
    }
}

// Write writes the specified number of lines to the file sequentially.
func (w *sequentialFileWriter) Write(totalLines int, filename string) error {
    // Create the output file
    file, err := w.fileWriter.CreateFile(filename)
    if err != nil {
        return err
    }
    defer w.fileWriter.FileClose(file)

    writer := w.fileWriter.NewBufferedWriter(file)
    for i := 0; i < totalLines; i++ {
        data := fmt.Sprintf("This is a line of data %d.\n", i)
        _, err := w.fileWriter.BufferedWriteString(writer, data)
        if err != nil {
            return err
        }
    }

    // Flush the buffer to ensure all data is written to the file.
    err = w.fileWriter.BufferedFlush(writer)
    if err != nil {
        return err
    }

    return nil
}
Enter fullscreen mode Exit fullscreen mode

In this code sample, we've created a sequential file writer that writes lines of data sequentially to a file. Here's what's happening step by step:

  • We create a new instance of sequentialFileWriter using NewSequentialFileWriter.
  • The Write method takes the total number of lines to write and the filename.
  • We create the output file using w.fileWriter.CreateFile(filename) and defer its closure.
  • To optimize writing, we use a buffered writer created with w.fileWriter.NewBufferedWriter.
  • Inside a loop, we generate data for each line and write it using w.fileWriter.BufferedWriteString.
  • After writing all the lines, we ensure all data is flushed to the file with w.fileWriter.BufferedFlush.

This code demonstrates how to write data sequentially to a file while maintaining data integrity.

Parallel File Writer 🚀

What is Parallel Writing?
Parallel writing is like having multiple authors working on different parts of a book at the same time. It speeds up the process significantly.

When to Use it?

  • Utilize parallel writing when dealing with large datasets that need to be written quickly.
  • Ideal for scenarios where write speed is a top priority.
  • When your system can efficiently manage concurrent writes.

Pros:

  • Speed: Greatly improves write speed, especially for large datasets.
  • Efficiency: Efficiently utilizes system resources.
  • Scalability: Scales well with the number of cores or goroutines.

Cons:

  • A bit more complex to implement than sequential writing.
  • Data order might not be guaranteed.

Code Sample and Explanation:
filewriter/parallel/parallel.go

package parallel

import (
    "fmt"
    "os"
    "runtime"
    "sync"

    "github.com/frasnym/go-file-writer-example/filewriter"
)

type parallelFileWriter struct {
    fileWriter    filewriter.FileWriter
    maxGoRoutines int
}

func NewParallelFileWriter(fileWriter filewriter.FileWriter) filewriter.Writer {
    // Get the number of available CPU cores
    maxGoRoutines := runtime.GOMAXPROCS(0)

    return &parallelFileWriter{
        fileWriter:    fileWriter,
        maxGoRoutines: maxGoRoutines,
    }
}

func (w *parallelFileWriter) Write(totalLines int, filename string) error {
    // Create the output file
    file, err := w.fileWriter.CreateFile(filename)
    if err != nil {
        return err
    }
    defer w.fileWriter.FileClose(file)

    // Calculate the number of lines to be written by each worker
    linesPerTask := totalLines / w.maxGoRoutines

    var wg sync.WaitGroup
    errCh := make(chan error, w.maxGoRoutines)

    for i := 0; i < w.maxGoRoutines; i++ {
        wg.Add(1)
        go w.worker(i, file, &wg, linesPerTask, errCh)
    }

    // Close the error channel when all workers are done
    go func() {
        wg.Wait()
        close(errCh)
    }()

    // Collect and handle errors
    for err := range errCh {
        if err != nil {
            return err
        }
    }

    return nil
}

func (w *parallelFileWriter) worker(id int, file *os.File, wg *sync.WaitGroup, linesPerTask int, errCh chan error) {
    defer wg.Done()
    startLine := id * linesPerTask
    endLine := startLine + linesPerTask

    writer := w.fileWriter.NewBufferedWriter(file)
    for i := startLine; i < endLine; i++ {
        data := fmt.Sprintf("This is a line of data %d.\n", i)

        _, err := w.fileWriter.BufferedWriteString(writer, data)
        if err != nil {
            errCh <- err
            return
        }
    }

    w.fileWriter.BufferedFlush(writer)
}
Enter fullscreen mode Exit fullscreen mode

In this code sample, we've created a parallel file writer that takes advantage of multiple goroutines to write lines of data concurrently to a file. Here's what's happening step by step:

  • We create a new instance of parallelFileWriter using NewParallelFileWriter. It determines the number of CPU cores available to decide how many goroutines to use.
  • The Write method takes the total number of lines to write and the filename.
  • We create the output file using w.fileWriter.CreateFile(filename) and defer its closure.
  • To optimize writing, we calculate the number of lines each worker should write based on the total number of lines and the available CPU cores.
  • We use a sync.WaitGroup to wait for all workers to complete their tasks and an error channel to collect and handle errors.
  • Inside a loop, we spawn multiple goroutines, each assigned to a worker function w.worker. These workers write their respective lines concurrently.
  • The worker function calculates the range of lines it's responsible for and writes data to the file using a buffered writer.
  • Once all workers are done, we close the error channel and collect and handle any errors.

This code demonstrates how to harness the power of parallelism to write data concurrently to a file, improving write speed significantly.

Parallel Chunk File Writer 🧩

What is Parallel Chunk Writing?
Think of it as writing chapters of a book in parallel while maintaining the order within each chapter.

When to Use it?

  • Choose parallel chunk writing for large datasets when both speed and data order are crucial.
  • Especially suitable when dealing with structured data that can be divided into chunks.
  • When your system can efficiently manage concurrent writes.

Pros:

  • Speed: Offers a significant speed boost for large datasets.
  • Data Order: Maintains data order within each chunk.
  • Scalability: Scales well with the number of cores or goroutines.

Cons:

  • A bit more complex to implement than sequential writing.

Code Sample and Explanation:
filewriter/parallelchunk/parallelchunk.go

package parallelchunk

import (
    "fmt"
    "runtime"
    "sync"

    "github.com/frasnym/go-file-writer-example/filewriter"
)

type parallelChunkFileWriter struct {
    fileWriter    filewriter.FileWriter
    maxGoRoutines int
}

func NewParallelChunkFileWriter(fileWriter filewriter.FileWriter) filewriter.Writer {
    // Get the number of available CPU cores
    maxGoRoutines := runtime.GOMAXPROCS(0)

    return &parallelChunkFileWriter{
        fileWriter:    fileWriter,
        maxGoRoutines: maxGoRoutines,
    }
}

func (w *parallelChunkFileWriter) Write(totalLines int, filename string) error {
    chunkSize := totalLines / w.maxGoRoutines
    var wg sync.WaitGroup

    for i := 0; i < w.maxGoRoutines; i++ {
        wg.Add(1)
        startLine := i * chunkSize
        endLine := startLine + chunkSize

        go w.writeChunkToFile(startLine, endLine, filename, &wg)
    }

    wg.Wait()

    return nil
}

func (w *parallelChunkFileWriter) writeChunkToFile(startLine, endLine int, filename string, wg *sync.WaitGroup) (err error) {
    file, err := w.fileWriter.CreateFile(fmt.Sprint(filename, "_", startLine))
    if err != nil {
        return err
    }
    defer w.fileWriter.FileClose(file)

    writer := w.fileWriter.NewBufferedWriter(file)

    for i := startLine; i < endLine; i++ {
        data := fmt.Sprintf("This is a line of data %d.\n", i)

        _, err = w.fileWriter.BufferedWriteString(writer, data)
        if err != nil {
            return
        }
    }

    w.fileWriter.BufferedFlush(writer)

    wg.Done()

    return
}
Enter fullscreen mode Exit fullscreen mode

In this code sample, we've created a parallel chunk file writer that divides data into smaller chunks and writes them concurrently to a file while maintaining data order within each chunk. Here's what's happening step by step:

  • We create a new instance of parallelChunkFileWriter using NewParallelChunkFileWriter. It determines the number of CPU cores available to decide how many goroutines to use.
  • The Write method takes the total number of lines to write and the filename.
  • To optimize writing, we calculate the number of lines each worker should write based on the total number of lines and the available CPU cores.
  • We use a sync.WaitGroup to wait for all workers to complete their tasks.
  • Inside a loop, we spawn multiple goroutines, each assigned to a worker function w.writeChunkToFile. These workers write their respective chunks of lines concurrently while maintaining data order within each chunk.
  • The worker create the output file using w.fileWriter.CreateFile(filename) and defer its closure.
  • The worker function calculates the range of lines it's responsible for and writes data to the file using a buffered writer.
  • Once all workers are done, we wait for the sync.WaitGroup to signal that they've completed their

tasks.

This code demonstrates how to achieve both speed and data order within chunks by dividing data and utilizing parallelism effectively.

Conclusion 🤝

As a beginner developer, choosing the right method for writing large data into files depends on your specific use case:

  • Sequential Writing is suitable for simplicity and data integrity but is slower for large datasets.

  • Parallel Writing is ideal when write speed is crucial, and you can manage concurrent writes efficiently, even if data order isn't a top priority.

  • Parallel Chunk Writing is the sweet spot, offering both speed and data order within chunks for large datasets, but it comes with a moderate implementation complexity.

Consider your project's requirements, system capabilities, and the balance between speed and data integrity when making your choice. Happy coding! 🚀📦

Want to connect?

Got questions or just want to chat? Connect with me on Twitter/X anytime. Can't wait to chat with you! 💬

Top comments (1)

Collapse
 
kaiomurz profile image
kaiomurz

Fantastic article. Learnt a lot. Thank!