DEV Community

Cover image for Data formats - how and when
Ashok Nagaraj
Ashok Nagaraj

Posted on

Data formats - how and when

Let's dive into the world of data formats. We'll cover CSV, JSON, Avro, and Parquet, detailing their advantages, disadvantages, and use cases, along with examples and pointers to Python libraries.

I have left out protobuf and ORC as I am not very knowledgeable on them at the moment

CSV (Comma-Separated Values)

Advantages:

  • Simplicity: Easy to read and write, both manually and programmatically.
  • Widespread Support: Supported by almost all data analysis tools and programming languages.
  • Human-Readable: Easy to understand and edit with plain text editors.

Disadvantages:

  • No Schema: Doesn't enforce data types, leading to potential inconsistencies.
  • Inefficiency: Larger file sizes compared to binary formats.
  • Limited Data Structures: Cannot represent nested or complex data structures.

Use When:

  • Data is tabular and doesn't require complex structures.
  • File size and read/write speed are not critical.
  • Interoperability with various tools is needed.

Avoid When:

  • Data has complex nested structures.
  • Efficiency and compression are priorities.

Python Libraries:

  • Pandas: pandas.read_csv() and pandas.to_csv()
  • csv: Built-in Python library for reading and writing CSV files.

Example:

import pandas as pd

# Read CSV
df = pd.read_csv('data.csv')

# Write CSV
df.to_csv('output.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

JSON (JavaScript Object Notation)

Advantages:

  • Flexibility: Can represent complex and nested data structures.
  • Human-Readable: Text-based and easy to understand.
  • Interoperability: Widely used for web APIs and data exchange.

Disadvantages:

  • No Schema: Doesn't enforce data types, leading to potential inconsistencies.
  • Larger Size: Less efficient in terms of storage compared to binary formats.
  • Parsing Overhead: Slower to parse compared to binary formats.

Use When:

  • Data has a complex, nested structure.
  • Interoperability with web applications and APIs is needed.

Avoid When:

  • Efficiency and compression are critical.
  • Data schema enforcement is necessary.

Python Libraries:

  • json: Built-in Python library for reading and writing JSON.
  • orjson: Faster (better) implementation of standard json
  • Pandas: pandas.read_json() and pandas.to_json()

Example:

import json

# Read JSON
with open('data.json') as f:
    data = json.load(f)

# Write JSON
with open('output.json', 'w') as f:
    json.dump(data, f, indent=4)
Enter fullscreen mode Exit fullscreen mode

Avro

Advantages:

  • Schema Enforcement: Ensures data consistency with a defined schema.
  • Efficient Serialization: Binary format with compact storage.
  • Interoperability: Supports schema evolution and backward compatibility.

Disadvantages:

  • Complexity: Requires understanding of Avro schema.
  • Less Human-Readable: Binary format, harder to inspect manually.
  • Limited Tool Support: Not as widely supported as CSV or JSON.

Use When:

  • Data consistency and schema enforcement are needed.
  • Efficient storage and serialization are priorities.
  • Schema evolution is required.

Avoid When:

  • Simplicity and human-readability are needed.
  • Interoperability with many tools is required.

Python Libraries:

  • fastavro: Library for reading and writing Avro files.
  • avro-python3: Official Avro library for Python.

Example:

import fastavro
import json

# Define Schema
schema = {
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "age", "type": "int"}
    ]
}

# Write Avro
with open('data.avro', 'wb') as out:
    writer = fastavro.writer(out, schema, [{'name': 'Alice', 'age': 30}, {'name': 'Bob', 'age': 25}])

# Read Avro
with open('data.avro', 'rb') as f:
    reader = fastavro.reader(f)
    for record in reader:
        print(record)
Enter fullscreen mode Exit fullscreen mode

Parquet

Advantages:

  • Efficient Storage: Columnar format with compression, reducing file size.
  • Fast Read/Write: Optimized for analytical queries and batch processing.
  • Schema Enforcement: Ensures data consistency with a defined schema.

Disadvantages:

  • Complexity: Requires understanding of Parquet format.
  • Less Human-Readable: Binary format, harder to inspect manually.
  • Limited Tool Support: Not as widely supported as CSV or JSON.

Use When:

  • Data analysis and big data processing are priorities.
  • Efficient storage and fast read/write are required.
  • Schema enforcement is needed.

Avoid When:

  • Simplicity and human-readability are needed.
  • Interoperability with many tools is required.

Python Libraries:

  • PyArrow: Library for reading and writing Parquet files.
  • Pandas: pandas.read_parquet() and pandas.to_parquet()

Example:

import pandas as pd

# Create DataFrame
data = {'name': ['Alice', 'Bob'], 'age': [30, 25]}
df = pd.DataFrame(data)

# Write Parquet
df.to_parquet('data.parquet')

# Read Parquet
df = pd.read_parquet('data.parquet')
print(df)
Enter fullscreen mode Exit fullscreen mode

Summary Table

Feature CSV JSON Avro Parquet
Schema Enforcement No No Yes Yes
Human-Readable Yes Yes No No
Storage Efficiency Low Medium High High
Complex Data No Yes Yes Yes
Tool Support High High Medium Medium

When to Use Each Format

  • CSV: When you need simplicity and compatibility with many tools, and your data is tabular.
  • JSON: When you need to represent complex, nested data structures and interoperability with web applications.
  • Avro: When you need efficient serialization, schema enforcement, and schema evolution.
  • Parquet: When you need efficient storage, fast read/write, and support for analytical queries and big data processing.

References


Cover picture credits

[Photo by Ksenia Chernaya: https://www.pexels.com/photo/sticky-notes-on-wooden-table-6999650/]

Top comments (0)