DEV Community

Ryan Zhi
Ryan Zhi

Posted on

Introduction to Apache Hadoop Distributed Computing Cluster Scheduling Framework

Excerpted from the Apache Hadoop Official Documentation

1. Introduction

Apache Hadoop is a software library that allows for distributed processing of large datasets across computer clusters using a simple programming model. It is designed to scale from a single server to thousands of machines, each providing local computation and storage. Instead of relying on hardware to provide high availability, Hadoop is designed to detect and handle failures at the application layer, thus providing highly available services on top of a cluster of computers, where each computer is prone to failure.

2. Components

Hadoop Common: Common utilities that support other Hadoop modules. It can be understood as the public components of a computer, but with the distinction that this is distributed.
Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data. It can be understood as the file system of a computer, but with the distinction that this is distributed; similar to the mem and cache storage systems in GPGPU.
Hadoop YARN: A framework for job scheduling and cluster resource management. It can be understood as the task scheduling module in a CPU, or the streaming collaborative scheduling system in GPGPU.
Hadoop MapReduce: A system based on YARN for parallel processing of large datasets.

3. Related Projects

Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, and Sqoop. Ambari also provides a dashboard for viewing cluster health, such as heatmaps, and the ability to visually inspect MapReduce, Pig, and Hive applications, along with features to diagnose their performance characteristics in a user-friendly manner.
Avro™: A data serialization system.
Cassandra™: A scalable multi-master database with no single points of failure.
Chukwa™: A data collection system for managing large distributed systems.
HBase™: A scalable, distributed database that supports structured data storage for large tables.
Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout™: A scalable machine learning and data mining library.
Ozone™: A scalable, redundant, and distributed object store for Hadoop.
Pig™: A high-level data-flow language and execution framework for parallel computation.
Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
Submarine: A unified AI platform that allows engineers and data scientists to run machine learning and deep learning workloads in distributed clusters.
Tez™: A generalized data-flow programming framework built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use cases. Tez is being adopted by Hive™, Pig™, and other frameworks in the Hadoop ecosystem, as well as by other commercial software (e.g., ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine.
ZooKeeper™: A high-performance coordination service for distributed applications.

Apache Hadoop 分布式计算机集群调度框架介绍

1.简介

Apache Hadoop 软件库是一个框架,允许使用简单的编程模型跨计算机集群对大型数据集进行分布式处理。它旨在从单个服务器扩展到数千台计算机,每台计算机都提供本地计算和存储。该库本身不是依靠硬件来提供高可用性,而是旨在检测和处理应用程序层的故障,因此在计算机集群之上提供高可用性服务,每台计算机都可能容易发生故障。

2.组件

Hadoop Common 公共组件
支持其他 Hadoop 模块的通用实用程序。可以理解为计算机的的公共组件,但是有区别这是分布式的。
Hadoop 分布式文件系统 (HDFS™):
一种分布式文件系统,可提供对应用程序数据的高吞吐量访问。可以理解为计算机的的文件系统,但是有区别这是分布式的;GPGPU中的mem和cache存储系统。
Hadoop YARN 分布式调度
用于作业调度和集群资源管理的框架。可以理解cpu中的任务调度模块,gpgpu中的流式协同调度系统。
Hadoop MapReduce 大型数据库
一种基于 YARN 的系统,用于并行处理大型数据集

3.Related projects 相关项目

  • Ambari™: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner. Ambari™:一种基于 Web 的工具,用于预置、管理和监控 Apache Hadoop 集群,包括对 Hadoop HDFS、Hadoop MapReduce、Hive、HCatalog、HBase、ZooKeeper、Oozie、Pig 和 Sqoop 的支持。Ambari 还提供了一个仪表板,用于查看集群运行状况(如热图),并能够直观地查看 MapReduce、Pig 和 Hive 应用程序,以及以用户友好的方式诊断其性能特征的功能。
  • Avro™: A data serialization system. Avro™:数据序列化系统。
  • Cassandra™: A scalable multi-master database with no single points of failure. Cassandra™:一个可扩展的多主数据库,没有单点故障。
  • Chukwa™: A data collection system for managing large distributed systems. Chukwa™:用于管理大型分布式系统的数据收集系统。
  • HBase™: A scalable, distributed database that supports structured data storage for large tables. HBase™:一种可扩展的分布式数据库,支持大型表的结构化数据存储。
  • Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying. Hive™:提供数据汇总和临时查询的数据仓库基础设施。
  • Mahout™: A Scalable machine learning and data mining library. Mahout™:可扩展的机器学习和数据挖掘库。
  • Ozone™: A scalable, redundant, and distributed object store for Hadoop. Ozone™:适用于 Hadoop 的可扩展、冗余和分布式对象存储。
  • Pig™: A high-level data-flow language and execution framework for parallel computation. Pig™:一种用于并行计算的高级数据流语言和执行框架。
  • Spark™: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation. Spark™:适用于 Hadoop 数据的快速通用计算引擎。Spark 提供了一个简单而富有表现力的编程模型,支持广泛的应用程序,包括 ETL、机器学习、流处理和图形计算。
  • Submarine: A unified AI platform which allows engineers and data scientists to run Machine Learning and Deep Learning workload in distributed cluster. Submarine:一个统一的 AI 平台,允许工程师和数据科学家在分布式集群中运行机器学习和深度学习工作负载。
  • Tez™: A generalized data-flow programming framework, built on Hadoop YARN, which provides a powerful and flexible engine to execute an arbitrary DAG of tasks to process data for both batch and interactive use-cases. Tez is being adopted by Hive™, Pig™ and other frameworks in the Hadoop ecosystem, and also by other commercial software (e.g. ETL tools), to replace Hadoop™ MapReduce as the underlying execution engine. Tez™:基于 Hadoop YARN 构建的通用数据流编程框架,它提供了一个强大而灵活的引擎来执行任意 DAG 任务,以处理批处理和交互式用例的数据。Tez 正在被 Hive™、Pig™ 和 Hadoop 生态系统中的其他框架以及其他商业软件(例如 ETL 工具)采用,以取代 Hadoop™ MapReduce 作为底层执行引擎。
  • ZooKeeper™: A high-performance coordination service for distributed applications. ZooKeeper™:面向分布式应用的高性能协调服务。

Top comments (0)