Please enable Javascript to correctly display the contents on Dot Net Tricks!

Understanding Big Data and Hadoop

 Print 
  Author : Shubham Pandey
Posted On : 19 Sep 2017
Total Views : 653   
Updated On : 21 Sep 2017
 

The Big Data refers to a collection of large datasets that cannot be stored, processed or analyzed using traditional processes or tools due to limitations of cost or the absence of suitable mechanisms. The issues of Big Data management, maintenance and analysis became more complex, especially when the volume of data has become very large from multiple data sources.

This requires a Big Data solution which offers real-time data storage, processing or analyzed functionalities for the organizations to extract meaningful, useful, and vital information which helps them to take decision.

Big Data Hadoop

Characteristics of Big Data

In 2001, META Group (now Gartner) analyst Doug Laney specified the data growth challenges and opportunities known as the three dimensions or 3V Dimensions of Big Data.

Volume

Volume refers to the vast amounts of data generated every second. This vast amount of data analyzed by the organization to improve decision-making. Big data solutions typically store and query hundreds of terabytes of data, and the total volume is probably growing by ten times every five years.

For example, Facebook alone generates 25 TB data every day and NYSE spoons 1 billion TB. The storage capacity for storing Big data has been growing fast from terabytes to petabytes and petabytes to zettabytes.

Variety

Variety refers to the different types of data - structured, unstructured and semi-structured data. In the past, all data that was created was structured data which can be fitted in tables or relational databases, such as financial data. But today, 90% of the data that is generated by an organization is unstructured (text, images, video, voice, etc.) data.

The wide variety of data requires the different approaches, techniques and tools to store or analyse the raw data for the business use. It means applying schemas to the data before or during storage is no longer a practical proposition.

3V of Big Data

Velocity

Velocity refers to the speed at which the data is generated stored, analyzed and visualized. Big Data solutions allows you to analyze the data while it is being generated (sometimes referred to as in-memory analytics), without ever putting it into databases.

For example, just think of social media like Facebook or Twitter messages going viral in seconds.

In the past, batch processing was the common practice where we receive an update from the database every night or even every week. Computers and servers required substantial time to process the data and update the databases. But in the Big Data era, data is created, analyzed and visualized in real-time or near real-time.

Big Data Challenges

The major challenges associated with Big Data are as follows:

  1. Storage

  2. Cost

  3. Processing

  4. Querying

  5. Sharing

  6. Analysis

  7. Presentation

Hadoop

Apache Hadoop is an open-source software platform used for storage, processing and anlyzing extremely large data sets in a distributed environment using clusters of computers. Hadoop was born form Nutch search engine, created by Doug Cutting and Mike Cafarella in 2006 which was inspired by Google's research paper - MapReduce: Simplified Data Processing on Large Clusters.

Big Data Hadoop

The Hadoop name came from the Doug Cutting's son's yellow plush toy elephant. Hadoop 0.1.0 was released in April 2006 and after many years of development, Hadoop 1.0 was released in November 2012 as part of the Apache project sponsored by the Apache Software Foundation.

Advantages of Hadoop

  1. Scalable

    Hadoop is a highly scalable data storage and processing platform, since it uses thousands of nodes in a cluster that operate in parallel. Unlike traditional RDBMS that can't scale to process large amounts of data, Hadoop enables the application to run on thousands of nodes involving thousands of terabytes of data.

  2. Fast

    Unlike traditional RDBMS, Haddop's data processing is very fast since it is based on a distributed file system that basically 'maps' data wherever it is located on a cluster. The data analytics tools for data processing on the same servers where the data is located, resulting in much faster data processing. Hadoop is able to efficiently process terabytes of data in just minutes, and petabytes in hours.

    Hadoop Features
  3. Cost effective

    Unlike traditional RDBMS that are extremely expensive to scale in order to process such massive volumes of data, Hadoop is open source and runs on low-cost commodity hardware.

  4. Reliable

    The main advantage of using Hadoop is its fault tolerance. When data is sent to an individual node, that data is also replicated to other nodes available in the cluster, which means that if a node fails it will redirect to the remaining nodes in the cluster for data processing.This makes it extremely reliable.

  5. Flexible

    Hadoop is flexible in data processing, it means you can process different types of data (structured, unstructured and semi-structured) to generate value from that data. In this way, Hadoop can derive valuable business insights from data sources such as social media, email conversations or clickstream data.

What do you think?

Thank you for your time, I hope you enjoyed this article and found it useful. Please add your comments and questions below. I would like to have feedback from my blog readers. Your valuable feedback, question, or comments about this article are always welcome.



ABOUT AUTHOR

Shubham Pandey
Author, Trainer and Developer Evangelist

Extremely diversified Big data Hadoop Developer over 7 years of solid Real-time development experience on Banking, YouTube, Airlines domain which leverage in-depth exposer in Python, Java, SCALA Hadoop: HDFS, Hive, Spark, HBase, Impala, Apex etc. Shubham Pandey has gain vast experience working with India leading MNC, have delivered varies corporates and Online Training around 52 countries Provides training in Hadoop Admin, Hadoop Development, AWS Developer, Data Science in R, Project support, Consultancy to Digital (BigData, Cloud, IOT, Data Science) etc. He is a Passionate energetic trainer believe in delivering real time industry focused training on Live data to provide real time hands-on expertise to make Industry Ready professionals.

Free Interview Books
 
COMMENTS
14 OCT
Angular2 and Angular4 (Online)
03:00 PM-05:00 PM IST (+5.30GMT)
12 OCT
ASP.NET Core (Online)
09:00 PM - 11:00 PM IST(+5:30 GMT)
10 OCT
Microsoft Azure Infrastructure Solutions (Online)
08:00 AM-09:30 AM IST / 09:30 PM -11:00 PM CST
30 SEP
Angular2 and Angular4 (Classroom)
08:30 AM-11:30 AM IST
20 SEP
MEAN Stack (Online)
07:00 AM-09:00 AM IST
20 SEP
ASP.NET MVC with Angular4 (Online)
9:00PM- 11:00PM IST(+5:30GMT)
16 SEP
Angular2 and Angular4 (Online)
08:00 AM-10:00 AM IST(+5.30 GMT)
22 AUG
ASP.NET Core with Angular4 (Online)
07:00 AM - 9:00 AM IST(+5:30 GMT)
LIKE US ON FACEBOOK
 
+