Pizofreude

Posted on Mar 4

Study Notes 5.6.3-4 Setting up a Dataproc Cluster & Connecting Spark to Big Query

#dataengineering #dezoomcamp #spark #bigquery

Study Notes 5.6.3 - Setting Up a Dataproc Cluster in GCP

1. Introduction

GCloud Dataproc is a managed service that simplifies running Apache Spark, Hadoop, and other big data workloads on GCP. Using Dataproc, you can quickly spin up clusters, run Spark jobs, and integrate with other GCP services (like GCS and BigQuery) with minimal administrative overhead.

Key Benefits:

Managed Infrastructure: No need to worry about the underlying hardware or cluster maintenance.
Seamless Integration: Easy connectivity to GCS and other GCP services.
Scalability: Quickly scale your cluster up or down based on workload needs.
Cost-Effective: Pay only for the resources you use.

2. Creating a Dataproc Cluster

2.1 Enabling Dataproc API

First-time Setup: When you first access Dataproc in the GCloud Console, you must enable the Dataproc API. This is a one-time setup that allows your account to create and manage clusters.

2.2 Creating a New Cluster

Navigate to Dataproc:
- In the GCP Console, select Dataproc from the navigation menu.
Click on “Create Cluster”:
- Choose a cluster name (e.g., data_engineering_cluster).
Region and Zone Selection:
- Region: Choose the same region as your GCS bucket (e.g., europe-west6 for Zurich).
- Zone: The specific zone can be left as default if not critical.
Cluster Type:
- Standard Mode: Includes one master node and multiple worker nodes.
- Single Node: For experimenting or processing small datasets, a single-node cluster is sufficient.
Additional Components:
- Jupyter Notebook: Optionally include Jupyter Notebook for interactive experimentation.
- Docker: Can be selected if you plan to containerize your workloads (though not used immediately).
Finalize and Create:
- Click Create and wait for the cluster to be provisioned.
- The process also creates the underlying virtual machines (VMs) that form the cluster.

3. Submitting a Spark Job on Dataproc

3.1 Preparing Your Job

Code Upload:

Before submitting a job, your Spark application (e.g., a Python script) must be accessible to the cluster. A common approach is to upload your code to a GCS bucket.

Example using gsutil:
```
gsutil -m cp /local/path/your_spark_job.py gs://your-bucket/code/
```
Tip: Organize your bucket so that code, data, and results are stored in separate folders.
Parameterization:

Adapt your script to accept command-line arguments for input and output paths. This makes it flexible for different datasets (e.g., processing data for different years).

3.2 Submitting via the Web UI

Access the Cluster:
- Click on your cluster from the Dataproc page.
Submit a Job:
- Choose the job type (e.g., PySpark).
- Specify the main Python file by pointing to the location in your GCS bucket (e.g., gs://your-bucket/code/your_spark_job.py).
Configure Job Arguments:
- Pass parameters such as input and output paths. For example:
  - Input (green dataset): gs://your-bucket/green_dataset/
  - Input (yellow dataset): gs://your-bucket/yellow_dataset/
  - Output (report location): gs://your-bucket/reports/2021/
Submit and Monitor:
- Submit the job and monitor progress via the job’s details in the Dataproc web interface.
- You can view job logs and status messages to verify successful execution.

3.3 Submitting via GCloud SDK (gcloud)

For automation or integration (e.g., with Airflow), you can use the gcloud CLI to submit jobs.

Example Command:

gcloud dataproc jobs submit pyspark \
  --cluster=data_engineering_cluster \
  --region=europe-west6 \
  gs://your-bucket/code/your_spark_job.py \
  -- gs://your-bucket/green_dataset/ gs://your-bucket/yellow_dataset/ gs://your-bucket/reports/2021/

Note:
- The - signals the end of gcloud options and the beginning of arguments to be passed to the job.
- Make sure your Python script does not hardcode the Spark master URL; Dataproc handles that automatically.

4. Managing Permissions and Service Accounts

4.1 Common Permission Issues

Error: "Permission Denied" or "Not Authorized to Request Resource"
- Occurs when the service account used does not have the required permissions to submit jobs to Dataproc.

4.2 Resolving Permission Issues

Assign the Correct Roles:
- The service account used (for example, one set up via Terraform) might need additional roles.
- Grant roles such as Dataproc Editor or Dataproc Admin to the service account.
Best Practice:
- Use separate service accounts for infrastructure management (e.g., Terraform) and job submission (e.g., Airflow workers).
- Apply the principle of least privilege, granting only the permissions needed for job submission if possible.

5. Verifying Job Execution and Output

Job Output:

After submission, check the designated GCS bucket for the output folder (e.g., gs://your-bucket/reports/2021/). The presence of output files indicates that the Spark job executed successfully.
Monitoring Options:
- Web UI: Use the Dataproc job details page to inspect logs and job status.
- SSH/Port Forwarding: Optionally, connect to the master node for deeper troubleshooting.

6. Additional Use Cases and Integration

Integration with Data Warehouses:
- Although the job output is stored in GCS, many projects require data to be loaded into a data warehouse (e.g., BigQuery) for analytics and dashboarding.
- Spark can be configured to write directly to BigQuery, streamlining the process.
Automation:
- For production workflows, consider integrating job submission via Apache Airflow. Use the gcloud or spark-submit operators to schedule and manage jobs.
Advanced Submission Methods:
- Beyond the Web UI and CLI, jobs can also be submitted via REST APIs for more complex integrations.

7. Best Practices

Consistent Region Configuration:
- Always align your Dataproc cluster’s region with the location of your GCS buckets to minimize latency and data transfer costs.
Organize Your Cloud Storage:
- Maintain a clean structure for your code, data, and outputs to facilitate management and debugging.
Parameterize Your Jobs:
- Use CLI arguments (with libraries like argparse) to make your scripts flexible and environment agnostic.
Monitor and Optimize:
- Regularly review job logs and performance metrics through the Dataproc UI.
- Adjust cluster configurations (e.g., number of workers, machine types) based on workload needs.

8. Conclusion

In this session, you learned how to:

Set Up a Dataproc Cluster:

Create and configure a cluster in GCP using Dataproc, choosing appropriate settings such as region, cluster type, and additional components.
Submit Spark Jobs:

Upload your Spark application to GCS and submit jobs via both the Dataproc web UI and the gcloud CLI. You also saw how to parameterize your jobs for flexibility.
Manage Permissions:

Resolve common permission issues by properly configuring service account roles.
Verify and Monitor Jobs:

Ensure your job’s successful execution by checking outputs and monitoring via the Dataproc interface.

Study Notes 5.6.4 - Connecting Spark on Dataproc to BigQuery

1. Overview

Purpose:

Integrate Spark (running on GCloud Dataproc) directly with BigQuery so that processed data can be written straight into your data warehouse. This approach bypasses the extra step of writing results to GCS and then loading them into BigQuery.

Context:

Previously, you learned how to create a Dataproc cluster and submit Spark jobs that output to GCS.
In this session, the focus is on modifying your Spark job to write results directly into BigQuery using the BigQuery connector.

2. The BigQuery Connector for Spark

Connector Role:

The BigQuery connector enables Spark to read from and write to BigQuery tables. Dataproc clusters often come pre-configured with the required connector JARs. However, if needed, you can add additional JARs through the job submission interface.

Key Considerations:

Temporary GCS Bucket: Some configurations require specifying a temporary GCS bucket for staging data before it’s loaded into BigQuery.
Automatic Table Creation: If the target table does not exist, the connector can automatically create it—depending on the options you provide.

3. Code Changes to Write Directly to BigQuery

3.1 Modifying the Output Section

From GCS to BigQuery:

Instead of writing your DataFrame using a format like Parquet or CSV, you change the format to bigquery. For example:

# Original save to GCS (example):
# dataframe.write.format("parquet").save("gs://your-bucket/output-folder")

# Modified to write to BigQuery:
dataframe.write \
    .format("bigquery") \
    .option("table", "your-project:your_dataset.reports_2020") \
    .option("temporaryGcsBucket", "your-temp-bucket") \
    .save()

Points to Note:

format("bigquery"): Tells Spark to use the BigQuery connector.
option("table", ...): Specifies the destination table in BigQuery using the format project:dataset.table.
option("temporaryGcsBucket", ...): Provides a temporary staging location (if required) for intermediate data during the load process.

3.2 Handling JAR Dependencies

Default Setup: Dataproc clusters usually have the BigQuery connector available by default.
Adding JARs (if needed): If your job fails with an error like “failed to find data source bigquery,” you may need to explicitly supply the connector JAR via the job’s configuration (e.g., using the -jars option when submitting a job).

4. Job Submission and Testing

4.1 Uploading Your Code

Upload to GCS:
Before submitting your job, upload your modified Python script to a GCS bucket:
```
gsutil -m cp /local/path/your_spark_job.py gs://your-bucket/code/
```

4.2 Submitting the Job via the Dataproc Web UI

Select Job Type: Choose “PySpark” as the job type.
Specify the Main File: Provide the path to your Python script (e.g., gs://your-bucket/code/your_spark_job.py).
Provide Job Arguments: Instead of hardcoded local paths, pass parameters for input datasets and the output destination (which, for BigQuery, is now the table name).
Submit and Monitor: Use the web interface to submit the job and check logs. The connector should create the BigQuery table if it does not already exist, then load the results.

4.3 Submitting via GCloud SDK (gcloud)

For automated or programmatic job submissions, you can use the gcloud CLI:

gcloud dataproc jobs submit pyspark \
  --cluster=your-dataproc-cluster \
  --region=your-region \
  gs://your-bucket/code/your_spark_job.py \
  -- gs://your-bucket/input_data/ "your-project:your_dataset.reports_2020"

Note: The - separator distinguishes between gcloud options and the arguments passed to your script.

5. Troubleshooting and Best Practices

5.1 Common Issues

Missing Connector JAR:
- Error messages like “failed to find data source bigquery” indicate that the BigQuery connector might not be found.
- Solution: Ensure the Dataproc cluster includes the connector, or supply the necessary JAR using the job submission options.
Temporary Bucket Requirement:
- If not specified, Spark might fail to stage data.
- Solution: Include the temporaryGcsBucket option in your write configuration.
Table Creation:
- If the target BigQuery table does not exist, the connector should create it. Monitor the job output and BigQuery logs to confirm.

5.2 Best Practices

Parameterize Your Job: Use command-line arguments (e.g., with argparse) to make your script flexible for different input datasets or output targets.
Avoid Hardcoding Paths: This ensures that the job remains environment agnostic and easier to manage when moving between local, Dataproc, or Airflow-managed environments.
Monitor Job Logs: Utilize the Dataproc job interface and BigQuery logs to troubleshoot issues quickly.
Test Incrementally: Validate changes on a small dataset before scaling up to production-sized workloads.

6. Conclusion

By following these steps, you can connect Spark running on a Dataproc cluster directly to BigQuery. This integration streamlines your data pipeline by eliminating intermediate storage in GCS and writing processed results straight into your data warehouse for analytics and reporting.

Key Takeaways:

Modify your Spark code to use format("bigquery") with appropriate options.
Ensure that any required connector JARs are available.
Use a temporary GCS bucket for staging if needed.
Leverage the Dataproc web UI or gcloud CLI for flexible job submission.
Monitor and adjust based on job performance and error messages.

This setup provides a powerful workflow for moving processed data into BigQuery, thereby enhancing the efficiency of your data engineering pipelines.

7. Further Reading and Resources

GCloud Dataproc Documentation:Dataproc Docs
BigQuery Connector for Spark: Detailed examples and configuration options.
GCloud SDK Documentation:gcloud CLI Overview
Apache Spark Documentation: For understanding Spark’s DataFrame API and write options.
Airflow Integration: Learn how to automate job submissions using Airflow’s operators for Spark and gcloud.

DEV Community