Miletos: Center of Science and Philosophy in ancient era

15.000 people used to watch that view during a performance 2500 years ago. The hill was an island, the plain was sea. Thales, Anaximander, Aspasia, Anaximenes, Isidorus, and many more were there.

The Road Ahead

The Long and winding road.

Pentax Digital SLR

Pentax K50, My digital SLR choice.

Istanbul Bosphorus

Istanbul, The city, where two continents meet. View from Europe to Asia.

Feb 1, 2015

How To Identify Big Data

Let me start by describing the amount of daily data generated by human. Every day, 2,5 quintillion (1018) bytes of data are created. This means 90% of the data in the world today has been created in the last two years [1]. Some estimates for the data growth are as high as 50 times by the year 2020 [1].


The term “Big Data” is widely used about the amount of data by business professionals. Every organization believes that the world is spinning around them. They mostly believe their customer database or data warehouse is “Big data”.  Because the amount of data that is stored in their database is increasing every day, maybe double or triple in a short while. This increase in the amount of customer’s data pushes them to think their data is “Big Data”. Lately there has been popular saying such as “X organization’s Big Data” particularly in Turkish. Even if people presumes “Big Data” as the indicator of size of the data, I recommend them to compare size of an organization’s data, and daily-generated data which is not comparable.

Ask a dozen CIOs to define big data, and you'll likely get a dozen different responses, Gartner analyst Mark Beyer says.

According to a study done by IBM and Oxford University [2] has proved that, much of the confusion about big data begins with the definition itself. Respondents were asked to choose up to two descriptions about how their organizations view big data from the choices in Figure below. No single characteristics dominating among the choices.


Considering the chaos in the definition of “Big Data” by people, I wanted to give some scientifically accepted definitions and bust the myths about Big Data.

“Big Data” is not about the amount of data. Stop looking for a standard size of Big Data, because there is not one standard data size. Even if there had been a standard size to define “big data”, then it would change rapidly. Hence standard data size is not an identifier for Big Data neither today nor tomorrow.

Big Data is a term applied to “data sets” whose size is beyond the capability of commonly used software tools to capture, manage, and process [3]. I’m sure most of the enterprises are using off the shelf, common software tools to capture, manage and process their data. Considering just size parameter, if this is the case, then they are not dealing with big data.


As far back 2001, there's the classic definition of Big Data with 3V model -- high volume, high variety and high velocity as shown in Figure.



Volume: The amount of data. Perhaps the characteristic most associated with big data, volume refers to the mass quantities of data that organizations are trying to harness to improve decision-making across the enterprise. Data volumes continue to increase at an unprecedented rate. There is no standard data volume. If the volume of data you’re looking at is an order of magnitude or larger than anything previously encountered in your industry, then you’re probably dealing with Big Data. However, “high” volume is not smaller than the Petabytes (1015 bytes) and Zetabytes (1021 bytes) often referenced today.

Variety: Different types of data and data sources. Variety is about managing the complexity of multiple data types, including structured, semi-structured and unstructured data. Organizations need to integrate and analyze data from within and outside the enterprise. With the explosion of sensors, smart devices and social collaboration technologies, data is being generated in countless forms, including: text, web data, tweets, sensor data, geographic location, audio, video, click streams, log files and more.

Velocity: Data in motion. The speed at which data is created, processed and analyzed continues to accelerate. Higher velocity is the real-time nature of data creation, as well as incorporates streaming data into business processes and decision-making. Velocity impacts latency – the lag time between when data is created or captured, and when it is accessible. Today, data is continually being generated at a pace that is impossible for traditional systems to capture, store and analyze. For time-sensitive processes such as real-time fraud detection or multi-channel “instant” marketing, certain types of data must be analyzed in real time to be of value to the business. An example of instant marketing, some kind of discount offer to a customer based on their location is less likely to be successful if they have already walked some distance past the store.


This figure shows the changes in dimensions with the time in big data. As years pass, Data volume goes from GB to PB, Data velocity from batch processing to Real time processing, data variety from text tables to video or social data. 








Today, “Big Data” is increasingly being defined by the 4 Vs.  

Veracity (4th V) that refers Data uncertainty. Veracity refers to the level of reliability associated with certain types of data. Some data is inherently uncertain, for example: sentiment and truthfulness in humans; weather conditions; economic factors; and the future. Despite uncertainty, the data still contains valuable information. Just ignoring it, can create even more problems than the uncertainty itself. In the era of big data, executives will need to approach the dimension of uncertainty differently.





Let me conclude my words with the following summary. I believe this will guide for those who has the tendency to use the term “Big Data” in their sentences to refer high sized database incorrectly.

  • Big Data is not all about the data size. Data size is just one parameter to identify Big Data. 
  • In addition to Volume parameter, Variety, Velocity and Veracity (uncertainty) must be in the nature of “Big Data”.
  • There is no standard data size for Big Data. But one should consider the volume of data to be larger than anything previously encountered in your industry such as around petabytes (1015 bytes) to zetabytes (1021 bytes) today.
  • Big Data needs to be processed to support decision-making.
  • Big Data should include data sets such as text, web data, tweets, sensor data, geographic location, audio, video, click streams (such as likes), log files and more.
  • Big Data cannot be captured, managed, and processed by commonly used software tools.

In the cases where the indicators above are not taking place, it is misuse of “Big Data” term to identify the data set.




Regerences:

  1. Performance and Capacity Implications for Big Data, IBM Red Paper,  January 2014.
  2. Analytics: The real-world use of big data, IBM Institute for business value-Said Business School, Oxford University, 2012.
  3. Information Management and Big Data A Reference Architecture, Oracle white paper, February 2013.