Blog

You got it or not? How to tell if Big Data is something you need to worry about in your application

Summary

So you have an application with lots of data. Lots and lots of data. But is it Big Data? You wouldn’t be the first person to have trouble knowing when you’ve crossed the threshold into the modern, sometimes intimidating world of Big Data. That’s Data with a capital D. 10,000 records? 10 million records? One terabyte of data?

The answer is more complicated than a number. Yes, the volume of data is a huge factor, but you also have to consider the complexity of the data, how quickly new records come in and how quickly the data needs to be accessible to users.

 

 

 

A simple definition of Big Data

Michael Driscoll, Founder and CEO of Metamarkets, has this simple definition of Big Data: data that is distributed. In a nutshell, he says that you cross the threshold when your data can no longer be stored on one computer. Here’s a chart he used to explain this:
 

Class

Size

Manage with

How it fits

Examples

Small

<10 GB

Excel, R

Fits in one machine’s memory

Thousands of sales figures

Medium

10 GB – 1 TB

Indexed files, monolithic DB

Fits on one machine’s disk

Millions of web pages

Big

> 1 TB

Hadoop, distributed DBs

Stored across many machines

Billions of web clicks

 

Another common definition says that you cross the threshold into Big Data at the point when existing techniques and technology used to manage your data aren’t good enough anymore. This means normal hard drives can’t store it all, processing times slow down, searching or analysis takes too long, servers get too hot, the new records are coming in faster than you can ingest them, etc. You need to implement more sophisticated techniques and technologies — open source products like Spark or Hadoop, new ways to do ETL processing, more sophisticated load balancing, smarter search tools, etc.

Both of these definitions give us a simple starting point, but the ubiquitous use of distributed, cloud-based architectures for convenience and cost have blurred the line significantly between medium and big. Now, rather than a clear break when your server can no longer store / process all of your records, and you have to move to a distributed architecture (and someone at your company can say “we’ve definitely got Big Data now!”), instead it can be an incremental growth from a few AWS servers to more and more computing power and storage space until one day… surprise! you’ve transitioned to Big Data without even realizing it.

The 4 V’sof Big Data

Another way of defining Big Data is what geek scholars have called the 4 Vs of Big Data – Volume, Variety, Velocity and Veracity. IMB’s Big Data & Analytics Hub has this helpful infographic, which explains the Vs: https://www.ibmbigdatahub.com/infographic/four-vs-big-data

Volume refers to the amount of data, Variety refers to its type and structure, Velocity is how quickly the new data is coming in and needing to be used, and Veracity is a measure of how accurate or trustworthy it is. Any combination of the first three factors can make an application cross that threshold into the Big Data realm, as shown in this diagram from Data Science Central: https://www.datasciencecentral.com/forum/topics/the-3vs-that-define-big-data

So, what do we make of all of this? At the rate we’re collectively creating new information, which this fascinating article and infographic from Cloudtweaks quotes at 2.5 quintillion bytes of data every day –that’s 2.5 followed by 18 zeroes! – it won’t be long before every system or application we use will need to follow big data principles. Maybe in trying to define where the Big Data threshold is, asking whether your data is just plain data… or Data, you’re asking the wrong question – instead, you should be asking whether your systems, applications, technologies, search tools, and infrastructure are scaled properly for the amount of data you’re ingesting and the needs of your users. Or is your data just wasting space on your servers?

In our upcoming articles in this Big Data series, we’ll explore this question more by looking at how to get the most value out of your Big Data, how reporting is impacted by the introduction of Big Data, how ERP systems help businesses deal with their data challenges, and some tools and techniques you can be using right now to improve your system’s performance. Stay tuned for Part 2, How Big Data Changes Reporting.

Latest Blogs