Created by Doug Cutting, Hadoop software allows large sets of data to broken into smaller pieces so that each piece can be individually processed. Once all of the smaller data sets are analyzed, the results are re-combined to provide insights into the original data.
The beauty of the Hadoop approach is that processing can be distributed over many systems, using just a single computer or spreading out over thousands of different machines, in different locations. Each of the new, smaller, data sets can now be worked individually and independently from the others. The Cloud is ideally suited for this type of task, as the Cloud is fundamentally a collection of data storage and computational nodes.
The cost of analyzing Big Data drops substantially in a Cloud computing environment compared with using a single, massive super-computer. Companies can create their own Private Cloud by assembling racks of low cost servers and disk storage; The Hadoop software will take care of breaking down the data, spreading it across the servers and handling any hardware failures. Alternatively, companies can pay to use a Public Cloud, such as Amazon or Google.
It was Google that first developed the basic technology of partitioning Big Data so that it can be spread over clusters of computers and memory arrays. Google needed a quick, inexpensive way to index all of the data they were collecting from the entire world wide web and present this in a meaningful result to search requests. Hadoop was created as a framework to run analytics on Big Data information. It was also designed to work with information that is complex and does not fit neatly into the rows and columns of a tabular spreadsheet.
From these requirements, the open source Apache Hadoop software library was created. This was the first implementation to easily analzye large data without relying on expensive, proprietary hardware. More importantly, since this approach uses distributed parallel processing and its scale is limited only by the number of modules in the cloud, there is no data that is too big! Which is supremely critical when considering the exponential growth of data on the web.
Hadoop can handle all types of data from disparate systems: structured, unstructured, log files, pictures, audio files, communications records, email– just about anything you can think of, regardless of its native format.
Even when different types of data have been stored in unrelated systems, you can dump it all into your Hadoop cluster with no prior need for a schema. In other words, you don’t need to know how you intend to query your data before you store it; Hadoop lets you decide later and over time can reveal questions you never even thought to ask.
By making all of your data useable, not just what’s in your databases, Hadoop lets you see relationships that were hidden before and reveal answers that have always been just out of reach. You can start making more decisions based on hard data instead of hunches and look at complete data sets, not just samples.
Readers often ask where names like “Hadoop” come from. In this case, it’s the name given to a toy stuffed animal elephant by Doug Cutting‘s son.
(Image Credit – Cloudera )
If you found this article interesting and informative, please be sure to sign up for our weekly e-newsletter as well as daily email / RSS Feeds at SourceTech411 .