The term Big Data means a lot of things to a lot of people, and it isn’t always clear just what that is. My wife tells me that whenever I say it (mispronouncing the word “data” with the long a), it makes her picture Brent Spiner posing for Attack of the 50-Ft. Woman. Big Data. Seriously, though, it is a major component of the Business Intelligence world today and the topic of countless blogs, discussion groups, articles, and conferences. For those of us who are not analyzing data for Amazon or Google, why should we care?
We live in a data-driven world, and the volumes of data that we collect and store are already staggering. According to a recent International Data Corporation (IDC) study, the global data supply passed 2.8 zetabytes (ZB) last year. That is 2.8 trillion GB of data stored on computers around the world. And according to estimates, global business data double every 1.2 years. What is the source for all of these data?
Perhaps an easier question to answer would be what is not a source for these data, because data are being created everywhere. Nevertheless, consider the following short list of Big Data facts from the Wikibon Blog.
- More than 5 billion people are calling, texting, tweeting, and browsing on mobile phones worldwide.
- YouTube users upload 48 hours of content every minute.
- 571 new web sites are created every minute.
- Walmart handles more then 1 million transactions per hour, which are imported into databases estimated to contain more than 2.5 petabytes of data. This is just one company.
- Facebook users upload 100 terabytes of data to Facebook each day.
And these are just illustrations. When you stop to add in the full volume of e-commerce transactions, online banking transactions, medical data, communications data, research data and much, much more being created every day, it is no wonder that the global data supply is so massive. But Big Data is not really about all of these data per se, because less than 1% are actually accessed for analysis. Big Data is really about achieving two independent objectives: processing massive amounts of information in a very short amount of time, and accessing unstructured data.
Ebay is a prime example of the first. With anywhere from 300 to 350 million items listed for sale at any moment and over 180 million active buyers and sellers, Ebay generates an enormous quantity of data. This includes 250 million queries per day through its search engine. Ebay must process these queries in near real time, returning appropriate results while at the same time analyzing customer behavior in order to offer other relevant suggestions. Behind the scenes, the Hadoop engines perform massive amounts of real-time analysis in order to sharpen the user experience for their customers. They are also used to analyze their own internal IT infrastructure, capturing and processing detailed statistics in real time on every component in their data centers.
Unstructured data are those that cannot be captured in the tidy rows and columns of a relational database. These include email, text documents, videos, or anything digital that has no regular “shape” to it. This article itself is unstructured data and will eventually be stored in multiple places both online and off. 80% of those 2.8 zetabytes is unstructured. And while there is a certain “cool” factor in processing a lot of data rapidly, I think the unlocking of unstructured data offers enormous potential. In fact, my first hands-on introduction to Hadoop was with unstructured data. After installing and configuring the platform on a single node Linux virtual machine, I exercised one of the sample map reduce programs. After downloading three large tomes in electronic format from Project Gutenberg (including Ulysses by James Joyce), the map reduce program crunched through all three books in just under one second, returning an aggregated list of all the words found along with the number of instances of each word, sorted alphabetically. That was Wow!
Social media companies are already using Big Data tools to scrub our posts for keywords and trends that are useful to their advertisers. Advertisers are using these data to understand better how consumers feel about their brand. The richness of information stored in unstructured data sources is, I think, only just beginning to be fully understood. There is a wealth of opportunity to better understand our customers and our world through analysis of these sources. I see tremendous risks as well, as more and more of our digital assets are outside our immediate control. I’m not certain that I would want the content of every email I’ve ever written to be available for scrutiny by either a corporation or a government.
Like BI though, Big Data is more a practice than a technology. It is a way of looking at new problems and finding new solutions for solving them. More important, it is driving an entirely new set of questions to be answered. But even though it is not about the technology, it could not exist without these latest technological tools. Quoting again from the Wikibon Blog (see link above), “Decoding the human genome originally took 10 years to process; now it can be achieved in a week.” That’s pretty big.
Not everyone is ready for Big Data. Many companies are struggling simply to manage the cost of making their products, and that rarely requires huge volumes of data. Nevertheless, as they begin turning their eyes to their twenty-first-century customers, many of the relevant questions are going to be answered increasingly by Big Data processes. We all need to be watching that horizon and thinking about it strategically, whether or not we are doing it yet. How do the social media affect my company, my employees, and my products? How can we benefit by tapping those resources? What forms of unstructured data are relevant to us and what can we learn from them? These and other questions should begin appearing in strategic planning conversations. One last question to consider is, will we have the skills we need when we get there? Already there is a paucity of practitioners qualified to use these technologies.
I suspect that the definition of Big Data will continue to have fuzzy edges. Already the terminology is evolving as fast as the technology. I myself need to be careful to avoid causing confusion. At the recommendation of my wife, I shall return to pronouncing the word “data” properly with the short “a.” And if I enunciate clearly, there should be no problem. After all, no one would want to picture Marcel Duchamp posing for Attack of the 50-Ft. Woman. Big Dada?
How is your company using Big Data processes? Do you have experiences to share regarding unstructured data?