Pizofreude

Posted on Mar 4

Study Notes 5.6.1-2 Spark on cloud & local

#dataengineering #dezoomcamp #spark #cloud

Study Notes 5.6.1 - Connecting Spark to GCS

1. Overview

This guide explains how to connect a local Spark environment to GCS. It covers the steps required to:

Upload data to GCS
Download and configure the GCS connector for Hadoop
Set up Spark configurations for GCS access
Run Spark applications that read/write data from/to GCS

Understanding these steps is essential for integrating your Spark processing pipelines with cloud storage, a common scenario in production data engineering.

2. GCS Basics

GCS is a scalable, secure, and durable object storage service provided by GCP.
It is often used to store large datasets, logs, and other files that need to be processed by data pipelines.
In Spark, GCS is accessed via special connectors that integrate with Hadoop’s file system APIs.

3. Step-by-Step Process

3.1 Uploading Data to GCS

Before connecting Spark to GCS, you need to upload your data to a GCS bucket.

Using gsutil:

The gsutil cp command is used to copy files (or entire directories) to a GCS bucket.

Example Command:
```
gsutil -m cp -r /path/to/local/data gs://your-bucket/pq
```
- m: Enables multi-threading to speed up the transfer.
- r: Recursively copies the directory.

3.2 Downloading the GCS Connector for Hadoop

Spark requires a connector to interact with GCS. For Spark running on Hadoop 3, you need the corresponding GCS connector JAR.

Download the Connector JAR:

Locate and download the appropriate JAR file (e.g., version 2.2.5 for Hadoop 3) from the Google Cloud Storage repository.

Example Command (using cp):
```
cp <remote-location>/gcs-connector-hadoop3-latest.jar /your/local/dir/gcs-connector.jar
```

3.3 Configuring Spark to Use the Connector

You need to instruct Spark to load the GCS connector by specifying its location in your Spark configuration.

Spark Configuration:Example Code Snippet:

spark.jars: Specify the path to the downloaded connector JAR.
spark.hadoop.fs.gs.impl: Set the implementation class for the GCS file system.

from pyspark.sql import SparkSession

# Define the path to the GCS connector JAR
gcs_connector_path = "/path/to/gcs-connector.jar"

# Set the credentials file location (see next section)
google_credentials = "/path/to/google_credentials.json"

spark = SparkSession.builder \
    .appName("GCS_Connection_Test") \
    .master("local[*]") \
    .config("spark.jars", gcs_connector_path) \
    .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
    .config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", google_credentials) \
    .getOrCreate()

3.4 Setting Up Google Cloud Credentials

Spark needs credentials to authenticate with GCS.

Credentials File: Download or create a service account JSON key file from your GCP console.

Spark Configuration:
Specify the credentials file path using:

.config("spark.hadoop.google.cloud.auth.service.account.json.keyfile", "/path/to/google_credentials.json")

3.5 Verifying the Connection

After configuring Spark, test the connection by reading or writing data from/to your GCS bucket.

Example of Reading Data from GCS:

# Assuming your bucket contains a folder 'pq' with data files
df = spark.read.format("parquet").load("gs://your-bucket/pq/green/")
df.show()

If the data is loaded and displayed, your configuration is correct.

4. Best Practices and Considerations

4.1 Environment Setup

Local vs. Cluster Mode: The steps above cover connecting from a local Spark instance. In a managed cluster (e.g., Google Cloud Dataproc), many configurations may be pre-set.
JAR Placement: Ensure the connector JAR is accessible to Spark. In cluster mode, it may need to be distributed to all nodes or included via the submission command (-jars).

4.2 Managing Credentials

Security: Keep your service account JSON file secure. Avoid hardcoding sensitive paths or keys directly in notebooks.
Environment Variables: Consider setting the credentials path via environment variables to simplify configuration across different environments.

4.3 Performance and Debugging

Spark UI: Use the Spark Web UI (typically at localhost:4040) to monitor job stages and troubleshoot issues.
Logging: Increase logging verbosity if connections to GCS fail, as error messages can provide insight into misconfigurations or permission issues.

4.4 Future Steps

Local Cluster & Spark Submit: Subsequent topics may include creating a local Spark cluster and using the spark-submit command to run jobs in a production environment.
Managed Services: In cloud-managed services (e.g., Google Cloud Dataproc), many configurations are automated, reducing the need for manual setup.

5. Conclusion

Connecting Spark to Google Cloud Storage involves:

Uploading your data to a GCS bucket.
Downloading and including the correct GCS connector JAR.
Configuring Spark with the necessary settings and credentials.
Verifying the connection with test read/write operations.

Study Notes 5.6.2 - Creating a Local Spark Cluster

1. Overview

This session covers the process of creating and managing a local Spark cluster. The main topics include:

Turning a Jupyter Notebook into a standalone Python script
Using spark-submit to run Spark jobs
Setting up a local Spark cluster (standalone mode)
Connecting your script to the cluster and managing resources

These steps are fundamental for running Spark jobs in environments that mimic production clusters—whether on your laptop, a virtual machine, or within cloud orchestration systems like Airflow.

2. Why Create a Local Spark Cluster?

Even when testing on a local machine, it’s valuable to simulate a cluster environment to understand:

Cluster Components:

A typical Spark cluster consists of a master (driver) and one or more workers (executors). In standalone mode, you manually start these processes.
Resource Management:

When you run a Spark job, the cluster must have available resources (cores, memory). Running multiple applications can lead to resource contention, which is why configuration via spark-submit is crucial.
Job Submission Process:

Instead of running Spark code directly in a notebook, converting it to a script and submitting it via spark-submit enables you to define master URLs, executor configurations, and other runtime parameters externally. This is essential for integration with production schedulers like Airflow.

3. Step-by-Step Guide

3.1 Converting a Notebook to a Python Script

Purpose:

Notebooks are great for interactive development, but production jobs are typically run as scripts.
Conversion Command:

Use Jupyter’s nbconvert tool to convert a notebook (.ipynb) into a Python script (.py):
```
jupyter nbconvert --to script your_notebook.ipynb
```
Post-conversion Tips:
- Open the generated script in an editor (e.g., VS Code) to clean up formatting.
- Remove or adjust any notebook-specific code (like inline plotting or magic commands) that won’t run in a standard Python environment.

3.2 Creating a Local Spark Cluster (Standalone Mode)

3.2.1 Starting the Spark Master

Procedure:
1. Open a terminal and navigate to your Spark installation directory (often referred to as $SPARK_HOME).
2. Start the master node by executing the provided script:
```
./sbin/start-master.sh
```

3. Once started, access the Spark master UI in your browser (typically at http://localhost:8080) to view details about the master, running applications, and resource status.

3.2.2 Starting a Spark Worker (Executor)

Procedure:
1. In another terminal window (or tab), navigate to your Spark home directory.
2. Start a worker process and register it with the master using:
Replace <master-url> with the URL shown in the master UI (e.g., spark://localhost:7077).
```
```bash
./sbin/start-worker.sh <master-url>

```
```
Verification:

Refresh the master UI to confirm that the worker (or “slave” in older Spark versions) appears. The worker provides the resources (CPU cores, memory) that Spark uses to execute tasks.

3.3 Running Spark Jobs with spark-submit

3.3.1 Why Use spark-submit?

Decoupling Configuration from Code: Instead of hardcoding the master URL or resource settings in your script, you can pass them as command-line arguments. This makes your application more portable and easier to configure in different environments.
Resource Management: When using spark-submit, you can specify the number of executors, amount of memory per executor, number of cores, and more.

3.3.2 Example Command

Basic Usage:

spark-submit \
  --master spark://localhost:7077 \
  --executor-memory 1G \
  --total-executor-cores 2 \
  path/to/your_script.py --input_green /path/to/input_green \
  --input_yellow /path/to/input_yellow \
  --output /path/to/output

Key Points:
- -master: Specifies the URL of your Spark master.
- -executor-memory and -total-executor-cores: Define resource allocations.
- Arguments following the script name (e.g., -input_green) are parsed within your script (using packages like argparse).

3.3.3 Configuring the Script with CLI Arguments

Using argparse:
Make your script configurable by parsing command-line arguments. Example:

import argparse
from pyspark.sql import SparkSession

def main(args):
    spark = SparkSession.builder.getOrCreate()
    # Example: Read input from a configurable location
    df_green = spark.read.parquet(args.input_green)
    df_yellow = spark.read.parquet(args.input_yellow)

    # Process data and write output
    df_report = df_green.unionAll(df_yellow)  # Example operation
    df_report.write.parquet(args.output)

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Process input and output paths.")
    parser.add_argument("--input_green", required=True, help="Path to input green data")
    parser.add_argument("--input_yellow", required=True, help="Path to input yellow data")
    parser.add_argument("--output", required=True, help="Path for output report")
    args = parser.parse_args()
    main(args)

Benefits:
- Flexibility to run the same job for different months, years, or data partitions.
- Seamless integration with orchestration tools like Airflow.

3.4 Managing the Local Cluster

Stopping the Cluster:
After job execution, it’s important to shut down the cluster components to free resources.
- To stop a worker (or slave):
```
./sbin/stop-worker.sh
```
- To stop the master:
```
./sbin/stop-master.sh
```
Resource Contention:

If multiple Spark applications run concurrently, they may compete for resources. Ensure that only one active application occupies the cluster, or properly configure resource limits using spark-submit options.

4. Best Practices and Additional Insights

Decouple Cluster Configuration from Code:

Always use spark-submit to define cluster settings (like master URL, executor memory, and cores) rather than hardcoding these in your script. This ensures your code can run in various environments without changes.
Script Conversion and Testing:

When converting notebooks to scripts, test the script independently from the notebook environment to catch any issues with formatting or dependency management.
Cluster Monitoring:

Utilize the Spark UI (accessible via the master’s web interface) to monitor resource utilization, job progress, and troubleshoot performance issues.
Local vs. Cloud Environments:

While this guide focuses on a local Spark cluster, the techniques and configurations discussed here are similar to those used when deploying Spark on cloud-managed clusters (e.g., Google Cloud Dataproc). Understanding the local setup builds a strong foundation for cloud-based deployments.

5. Conclusion

Creating a local Spark cluster involves:

Converting interactive Jupyter notebooks into production-ready Python scripts.
Starting and managing Spark cluster components (master and workers) locally.
Using spark-submit to execute your Spark jobs with flexible configuration options.
Monitoring and managing resources effectively, ensuring smooth job execution.

DEV Community

Study Notes 5.6.1-2 Spark on cloud & local

Study Notes 5.6.1 - Connecting Spark to GCS

1. Overview

2. GCS Basics

3. Step-by-Step Process

3.1 Uploading Data to GCS

3.2 Downloading the GCS Connector for Hadoop

3.3 Configuring Spark to Use the Connector

3.4 Setting Up Google Cloud Credentials

3.5 Verifying the Connection

4. Best Practices and Considerations

4.1 Environment Setup

4.2 Managing Credentials

4.3 Performance and Debugging

4.4 Future Steps

5. Conclusion

Study Notes 5.6.2 - Creating a Local Spark Cluster

1. Overview

2. Why Create a Local Spark Cluster?

3. Step-by-Step Guide

3.1 Converting a Notebook to a Python Script

3.2 Creating a Local Spark Cluster (Standalone Mode)

3.2.1 Starting the Spark Master

3.2.2 Starting a Spark Worker (Executor)

3.3 Running Spark Jobs with spark-submit

3.3.1 Why Use spark-submit?

3.3.2 Example Command

3.3.3 Configuring the Script with CLI Arguments

3.4 Managing the Local Cluster

4. Best Practices and Additional Insights

5. Conclusion

Top comments (0)

Read next

Amazon RDS Best Practices for High Availability and Scaling

AWS S3 Bucket Subdomain Takeover

Set Up DeepSeek on Huawei Cloud with Docker and Open WebUI

Claude 3.7 Sonnet: Where AI Meets Human-Like Problem Solving