Understanding Apache Hadoop Ecosystem and Components

25 aug. 2022
Intermediate
13,6K Views
4 min read  

Hadoop is an ecosystem of Apache open source projects and a wide range of commercial tools and solutions that fundamentally change the way of big data storage, processing, and analysis. The most popular open source projects of Hadoop ecosystem include Spark, Hive, Pig, Oozie and Sqoop.

Apache Hadoop Ecosystem Projects

  1. Apache Spark

    An open-source and fast engine for large-scale data processing. It supports data streaming and SQL, machine learning and graph processing.

  2. Apache Hive

    A data warehouse that runs on the top of Apache Hadoop. Apache Hive provides SQL like syntax for reading, writing and managing large datasets stored in distributed storage (structured data).

  3. Impala

    An open source parallel processing SQL query engine that runs on Apache Hadoop used for querying data, stored in HDFS and Apache HBase.

  4. Apache Drill

    An open-source, schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.

    Hadoop Ecosystem
  5. Apache HBase

    An open source, nonrelational, distributed database runs on the top of HDFS. HBase is used for random, real-time read/write access to your Big Data.

  6. Spark MLib

    A scalable machine learning library based on the top of Spark Core.

  7. Mahout

    Mahout is a machine learning library and used for clustering, classification and collaborative filtering of data. It is based on top of distributed data systems, like MapReduce.

  8. R

    R is a programming language used for data visualization, statistical computations and analysis of data.

  9. Apache Solr

    An open source search platform used for full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration etc.

  10. Apache Pig

    A high-level platform for handling any kind of data and runs on Hadoop. It uses PigLatin language to write programs and enables us to spend less time in writting map-reduce programs for analyzing large data sets.

  11. Apache Kafta

    A distributed publish-subscribe messaging system designed for processing of real-time activities stream data (logs, social media streams).

  12. Apache Sqoop

    A tool to transfer bulk data between Hadoop and structured data stores such as relational databases.

  13. Apache Storm

    A distributed real-time processing system for analyzing stream of data and doing for realtime processing what Hadoop did for batch processing.

  14. Apache ZooKeeper

    An open source configuration, synchronization and naming registry service for large distributed systems.

  15. Apache Ambari

    An open source web-based management tool that runs on the top of Hadoop and responsible for managing, monitoring and provisioning the health of Hadoop clusters

  16. Cloudera Manager

    A commercial administration tools provided by Cloudera Inc. that runs on the top of Hadoop and responsible for managing, monitoring and provisioning the health of Hadoop clusters

Hadoop Components

There are three main core components of the Apache Hadoop framework - HDFS, MapReduce and YARN.

Hadoop Components
  1. HDFS

    A primary storage system of Hadoop. HDFS store very large files running on a cluster of commodity hardware. It works on principle of storage of less number of large files rather than the huge number of small files.

  2. MapReduce

    A software programming model for processing large sets of data stored in HDFS. It process huge amount of data in parallel.

  3. YARN

    YARN is the processing framework in Hadoop. It provides Resource management, and allows multiple data processing engines, for example real-time streaming, data science and batch processing.

Commercial Apache Hadoop Distributions

Hadoop is open source free platform and can be downloaded from www.hadoop.apache.org. There are also commercial distributions that combine core Hadoop technology with additional features, functionality and documentation. The most popular commercial distribution of Hadoop include Cloudera, Hortonworks, and MapR for Hadoop development, production, and maintenance tasks

  1. Cloudera

    Cloudera Inc. was founded by big data geniuses from Facebook, Google, Oracle and Yahoo in 2008. It was the first company to develop and distribute Apache Hadoop-based software, but now released as open source software.

    Cloudera
  2. MapR

    MapR was founded in 2009 and one of the leading vendors of Hadoop. This provides an Apache Hadoop distribution, a distributed file system, database management system, a set of data management tools and other related software.

    MapR
  3. Hortonworks

    Hortonworks was founded in 2011 and quickly emerged as one of the leading vendors of Hadoop. This provides an open source platform based on Apache Hadoop for analysing, storing and managing big data.

    Hortonworks
What do you think?

Thank you for your time, I hope you enjoyed this article and found it useful. Please add your comments and questions below. I would like to have feedback from my blog readers. Your valuable feedback, question, or comments about this article are always welcome.

Share Article
About Author
Shailendra Chauhan (Microsoft MVP, Founder & CEO at Scholarhat by DotNetTricks)

Shailendra Chauhan is the Founder and CEO at ScholarHat by DotNetTricks which is a brand when it comes to e-Learning. He provides training and consultation over an array of technologies like Cloud, .NET, Angular, React, Node, Microservices, Containers and Mobile Apps development. He has been awarded Microsoft MVP 8th time in a row (2016-2023). He has changed many lives with his writings and unique training programs. He has a number of most sought-after books to his name which has helped job aspirants in cracking tough interviews with ease.
Learn to Crack Your Technical Interview

Accept cookies & close this