Deep dive into Big Data: Hadoop (Part 1)

Accredian Publication

5 min readMar 24, 2022

by Pronay Ghosh and Hiren Rupchandani

Around 90% of the world’s data was created in the last two years, according to estimates.
Furthermore, 80 percent of the data is unstructured or available in a variety of forms, making analysis challenging.
You now have an idea of how much data was generated.
Though such a massive volume of data poses a substantial difficulty, an even greater challenge stems from the fact that the data is not organised.
Images, line streaming data, movies, sensor records, and GPS tracking information are all included.
In a nutshell, it’s data that hasn’t been structured.
Traditional systems are good for working with organised data (but only to a certain extent).
But they can’t handle massive amounts of unstructured data.
Why should anyone care about storing and processing this data, one could wonder? What is the aim of this?
The response is that we require this information in order to make more informed and calculated decisions in whatever field we are engaged in.
Forecasting is not a new concept in business.
It has been done before, but with very little information.
Businesses must use this data and then make more informed decisions in order to stay ahead of the competition.
These choices range from anticipating consumer preferences to avoiding fraud actions ahead of time.
Professionals from all fields may discover their own motives for analysing this information.

The V’s of Big Data:

When you need to figure out if you need to employ a big data system for your next project, look at the data that your app will generate and look for these aspects.
In the big data sector, these points are known as 5 Vs.

Volume

Volume is unquestionably a piece of the larger Big Data pie.
The internet-mobile cycle, which has brought with it a deluge of social media updates, has flooded every business with data.
This data can be incredibly valuable if you know how to deal with it.

Variety

Structured data stored in SQL tables is a thing of the past.
90% of data produced today is unstructured.
This means it comes in all sorts and sizes.
This can be analysed for content and thinking, to visual data like images and videos.

Velocity

Every minute, individuals around the world upload 200 hours of video to Youtube, send 300,000 tweets, and send over 200 million emails.
And when the internet’s speed improves, this number will continue to rise.

Veracity

This is true of the unpredictability of data available to marketers.
This is also known as the variability of data streaming.
This can alter at any time, making it difficult for businesses to respond swiftly and correctly.

Google’s impact in solving Big Data Explosion

This issue first piqued Google’s interest because of their search engine data, which grew with the internet industry’s transformation.
And it’s difficult to find any confirmation that it’s the internet industry.
They cleverly solved the problem by employing parallel processing theory.
They came up with the MapReduce algorithm.
This algorithm divides the workload into small chunks and distributes them across a network of computers.
This assembles all of the events to generate the final event dataset.
When you consider that I/O is the most expensive operation in data processing, this makes sense.
Database systems used to store data on a single machine, and you would send them commands in the form of SQL queries when you needed data.
These systems retrieve data from the store, store it locally, process it, and return it to the end user.
This is the real thing, which you could do with a limited amount of data and computing power.
When it comes to Big Data, though, you can’t capture all of the data on a single machine.
It is ESSENTIAL that you save it to numerous PCs (maybe thousands of devices).
And, because of the high I/O cost, you can’t aggregate data into a single location when you need to execute a query.
So, the MapReduce algorithm works on your query by breaking it down into separate nodes where data is present, then aggregating the final result and returning it to you.

It has two key benefits:
low I/O costs due to minimum data migration, and reduced time due to your operation being parallelized across numerous machines with a number of smaller data sets.

The debut of Hadoop

Hadoop is a Java-based open-source programming platform that allows massive data sets to be processed in a distributed computing environment.
It’s based on GFS, or Google File System.
Hadoop enables users to take advantage of the opportunities presented by Big Data while also overcoming the hurdles they face.

Why Hadoop ?

Hadoop runs a limited number of applications on distributed systems with thousands of nodes and petabytes of data.
It has a distributed file system, known as Hadoop Distributed File System or HDFS,
HDFS allows for quick data transmission between nodes.

Conclusion:

So far in this article, we covered an overview of Hadoop and why did it come into the real-time picture.
In the next article, we will learn about the architecture of Hadoop and its various key components.

Follow us for more upcoming future articles related to Data Science, Machine Learning, and Artificial Intelligence.

Also, Do give us a Clap👏 if you find this article useful as your encouragement catalyzes inspiration for and helps to create more cool stuff like this.