Big Data

Touching to one of today’s hot topics is simply fun. Big data is one of them. Why is it so important to know! What exactly it is! How is it useful! Curious to know? 🙂 Follow up..
Since beginning, it is always good to save your things for future. Have a look at the history, the holy books, the pictures of our great leaders, spiritual gurus, freedom fighters and their weapons are still being saved at different places like museums and today, they are of such a great value. Even in our daily life, Students keep notes of their study for better exams preparations. Researches write and save their papers to propagate their work to new researchers. All were going so good and well managed before the revolution known as internet has come.
What is the best thing about using internet? Information! Having information and sharing information. But this information or we can say Data are increasing exponentially that made the data scientist scratch their heads. That’s called Big Data. Some statistics about big data says

  • Big data sizes are a constantly moving, as of 2012 ranging from a few dozen terabytes to many petabytes of data in a single data set.</li
  • If all sensor data were to be recorded in LHC, the data flow would exceed 150 million petabytes annual rate, or nearly 500 exabytes per day, before replication.
  • When the Sloan Digital Sky Survey (SDSS) began collecting astronomical data in 2000, it amassed more in its first few weeks than all data collected in the history of astronomy. Continuing at a rate of about 200 GB per night, SDSS has amassed more than 140 terabytes of information.
  • The NASA Center for Climate Simulation (NCCS) stores 32 petabytes of climate observations and simulations on the Discover supercomputing cluster.
  • Facebook handles 50 billion photos from its user base.
  • The volume of business data worldwide, across all companies, doubles every 1.2 years, according to regular estimates.
  • and there are a lot more…

So, according to wikipedia, “Big data is an all-encompassing term for any collection of data sets so large and complex that it becomes difficult to process using on-hand data management tools or traditional data processing applications.”
The above mentioned statistics also give the idea of major sources of big data. But how is it becoming so complex?
There are the 4 Vs, that defines the model of Big data


It used to be the humans that creates the data, but now, as the above statistics shows, big data includes the data generated by networks, mobile sensing devices, human interaction with social media and mobile devices, machines, business processes etc.


Variety refers to the many sources and types of data both structured and unstructured. We used to store data from sources like spreadsheets and databases. Now data comes in the form of emails, photos, videos, graphs, tweets, monitoring devices, wearable electronics, PDFs, audio, etc.


Velocity means the pace with which the data flow in from different sources. The flow of data is massive and continuous and this real-time data can help researchers and businesses make valuable decisions that provide strategic and competitive advantages.


Big data veracity refers to the accuracy of data. Is the data that is being stored, and mined meaningful to the problem being analyzed or is inaccurate and will be of no use.
Also have a look at this amazingly defined infographic

As per the definition, the big data is unable to be managed by our traditional data management tools such as RDBMS.
So a new technology has to be introduced by the data scientists i.e. NOSQL (Not only SQL).
NOSQL is a non relational and largely distributed database system that has developed to address the challenges created by Big data ie analysis of extremely high-volume, disparate data types, etc. It is the first alternative to relational databases, with scalability, availability, and fault tolerance being key deciding factors.

Advantages of NOSQL Databases over Relational Databases

Growth of big data : It is well said that if your data is not increasing, your business is not increasing. Storing and analyzing data is really important for business growth. It helps to make decisions about investments, marketing, sales etc. NOSQL databases allows the data to grow by making storing and managing data so easily.
Fault Tolerance : In today’s marketplace, where the competition is just a click away, downtime can be deadly to a company’s bottom line and reputation and a single hardware failure can bring such day anytime without prior warning. Fortunately NoSQL database environments are built with a distributed architecture so there are no single points of failure and there is built-in redundancy of both function and data. If one or more database servers, or nodes goes down, the other nodes in the system are able to continue with operations without data loss, thereby showing true fault tolerance. In this way, NoSQL database environments are able to provide continuous availability whether in single location, across data centers and in the cloud.
Real Location Independence : The term ‘location independence’ means the ability to read and write to a database regardless of where that I/O operation physically occurs and to have any write functionality propagated out from that location, so that it’s available to users and machines at other sites. Such functionality is very difficult to architect for relational databases.
Flexible Data Models : The database schema of the relational databases is all very strict and uniform. Such models cause problems of scalability and performance when trying to manage the large data volumes that are becoming a fact of life in a modern IT and business environment. The NOSQL often referred to schema less models. It is able to accept all types of data – structured, semi-structured, and unstructured – much more easily than a relational database. The performance factors come into play with an RDBMS’ data model, especially where wide rows are involved and update actions are many, which can have real implications on performance. However, a NoSQL data model easily handles such situations and delivers very fast performance for both read and write operations.
Analytics and Business Intelligence : A key strategic driver of implementing a NoSQL database environment is the ability to mine the data that is being collected so as to derive insights that puts your business at a competitive advantage. Extracting meaningful business intelligence from very high volumes of data is a very difficult task to achieve with traditional relational database systems. Modern NoSQL database systems not only provide storage and management of business application data but also deliver integrated data analytics that deliver instant understanding of complex data sets and facilitate flexible decision-making.
Now, the time is to discuss the database management tools that uses the non relational data model i.e. NOSQL.

NOSQL Data Models

For good illustrated explanation of NOSQL data models, refer this.


Hadoop is an Apache foundation open source software that is used to store, process, index and analyze the large data. Hadoop enables distributed parallel processing of huge amounts of data across inexpensive, industry-standard servers that both store and process the data, and can scale without limits. With Hadoop, no data is too big. And in today’s hyper-connected world where more and more data is being created every day, Hadoop’s breakthrough advantages mean that businesses and organizations can now find value in data that was recently considered useless.
Also watch this exquisite video on Hadoop.
Now the question arises, how hadoop do this magic?
The basic approach that Hadoop follows is that it breaks the data into smaller pieces, not only the data, but the computation required to store, manage and retrieve the data is also divided and that’s how it solves the problem by divide and rule method. This method or we can say the algorithm used to divide the data and computation among various nodes is known as Map Reduce.
The other important thing under consideration is Hadoop file systems


Architecture of Hadoopincludes a number of nodes where the data and the computation is distributed. These nodes are known as slave nodes because they have a Master. Master node is the main node to which the client contact and ask to solve its problem. The master node has a very important component known as Job Tracker, which divides the task and distribute among various slave nodes. It also maintain an index that which part of task is given to which node.
Every slave node has a Task Tracker which is responsible for the processing of given task. Both these components comes under the umbrella term Map Reduce.
In addition to this, every node contains hadoop file systems where the data to be processed is stored.
Hadoop handles the fault tolerance in very nice way. It maintains the three copies of each file and distribute them to different nodes. Whenever any hardware failure happen, hadoop manages to bring the required files from another node containing copy of the lost data. Even the files and index tables at the master computers are also backed up and saved at different computers. A special backup master can also be maintained to save single point failure.
Hadoop helps the programmers by not let them worry about where the file is located, how to manage failures, How to break computations into pieces and How to program for scaling?
Hadoop systems can consists of one computer to thousands of computers to provide required scalability and performance. So you can start with few computers and as your data is growing, you can simply add up more computers.
It should be noted that the processing speed of hadoop system is directly proportional to the number of computers.


Cassandra is another NOSQL data management tool like hadoop which is designed to address big data challenges. The architecture of Cassandra is a peer-to-peer distributed system across various nodes where data is distributed among all nodes in the cluster. click me to know more about Cassandra database system and come back 😛
So likewise there are many more NOSQL data management tools such as MongoDB, CouchBase, BigTable etc. It all depends upon your application that which tool will be appropriate for it.
Thanks 🙂

Better Educate than force GSOC project - UI brushup of OpenSCAD