DEV Community

Cover image for Understanding about Spark from Data engineering POV
Chi Cong, Nguyen
Chi Cong, Nguyen

Posted on • Updated on

Understanding about Spark from Data engineering POV

Image description

  • Spark is currently one of the most popular tools for big data analytics

  • Spark is generally faster than Hadoop. This is because Hadoop writes intermediate results to disk whereas Spark tries to keep intermediate results in memory whenever possible.

The Hadoop ecosystem includes a distributed file storage system called HDFS (Hadoop Distributed File System). Spark, on the other hand, does not include a file storage system. You can use Spark on top of HDFS but you do not have to. Spark can read in data from other sources as well such as Amazon S3.

MapReduce

Image description

I. Spark ecosystem includes multiple components

Spark ecosystem

  • Spark Core: The foundation for distributed data processing.
  • Spark SQL: Enables structured data processing using SQL-like queries. It allows you to query data stored in various formats like Hive tables, Parquet files, and relational databases.
  • MLlib: Provides machine learning algorithms for tasks like classification, regression, and clustering.
  • GraphX: A library for graph processing, enabling analysis of large-scale graphs.

--> Think of Spark as a toolbox for big data. Each component provides specialized tools for different tasks, allowing you to analyze and manipulate data efficiently and effectively.

II. Basic architecture of Apache Spark

Basic architecture of Apache Spark

  • Master Node: This node houses the "Driver Program" which contains the Spark Context. The Spark Context is responsible for initializing the Spark application and connecting to the cluster.
  • Cluster Manager: The Cluster Manager is responsible for allocating resources and managing the worker nodes. It can be a standalone manager or utilize systems like YARN or Mesos.
  • Worker Nodes: These nodes are the workhorses of the Spark cluster. They execute the tasks assigned by the Driver Program.
  • Tasks: These are individual units of work that are distributed across the worker nodes.
  • Cache: Worker nodes maintain a cache for storing frequently accessed data, speeding up processing.

Here is how it works:

  1. The Driver Program, running on the Master Node, submits a Spark application to the Cluster Manager.
  2. The Cluster Manager distributes the application's tasks across the worker nodes.
  3. Worker nodes execute the tasks in parallel, leveraging their resources and the data cached on their local storage.
  4. The Driver Program gathers and aggregates the results from the worker nodes.

Top comments (0)