Some Thoughts on Big Data

The term Big Data means a lot of things to a lot of people, and it isn’t always clear just what that is.  My wife tells me that whenever I say it (mispronouncing the word “data” with the long a), it makes her picture Brent Spiner posing for Attack of the 50-Ft. Woman.  Big Data.  Seriously, though, it is a major component of the Business Intelligence world today and the topic of countless blogs, discussion groups, articles, and conferences.  For those of us who are not analyzing data for Amazon or Google, why should we care?

We live in a data-driven world, and the volumes of data that we collect and store are already staggering.  According to a recent International Data Corporation (IDC) study, the global data supply passed 2.8 zetabytes (ZB) last year. That is 2.8 trillion GB of data stored on computers around the world.  And according to estimates, global business data double every 1.2 years.   What is the source for all of these data?

Perhaps an easier question to answer would be what is not a source for these data, because data are being created everywhere.  Nevertheless, consider the following short list of Big Data facts from the Wikibon Blog.

  • More than 5 billion people are calling, texting, tweeting, and browsing on mobile phones worldwide.
  • YouTube users upload 48 hours of content every minute.
  • 571 new web sites are created every minute.
  • Walmart handles more then 1 million transactions per hour, which are imported into databases estimated to contain more than 2.5 petabytes of data.  This is just one company.
  • Facebook users upload 100 terabytes of data to Facebook each day.

And these are just illustrations.  When you stop to add in the full volume of e-commerce transactions, online banking transactions, medical data, communications data, research data and much, much more being created every day, it is no wonder that the global data supply is so massive.  But Big Data is not really about all of these data per se, because less than 1% are actually accessed for analysis.  Big Data is really about achieving two independent objectives:  processing massive amounts of information in a very short amount of time, and accessing unstructured data.

Ebay is a prime example of the first.  With anywhere from 300 to 350 million items listed for sale at any moment and over 180 million active buyers and sellers, Ebay generates an enormous quantity of data.  This includes 250 million queries per day through its search engine.  Ebay must process these queries in near real time, returning appropriate results while at the same time analyzing customer behavior in order to offer other relevant suggestions.  Behind the scenes, the Hadoop engines perform massive amounts of real-time analysis in order to sharpen the user experience for their customers.  They are also used to analyze their own internal IT infrastructure, capturing and processing detailed statistics in real time on every component in their data centers.

Unstructured data are those that cannot be captured in the tidy rows and columns of a relational database.  These include email, text documents, videos, or anything digital that has no regular “shape” to it.  This article itself is unstructured data and will eventually be stored in multiple places both online and off.  80% of those 2.8 zetabytes is unstructured.  And while there is a certain “cool” factor in processing a lot of data rapidly, I think the unlocking of unstructured data offers enormous potential.  In fact, my first hands-on introduction to Hadoop was with unstructured data.  After installing and configuring the platform on a single node Linux virtual machine, I exercised one of the sample map reduce programs.  After downloading three large tomes in electronic format from Project Gutenberg (including Ulysses by James Joyce), the map reduce program crunched through all three books in just under one second, returning an aggregated list of all the words found along with the number of instances of each word, sorted alphabetically.  That was Wow!

Social media companies are already using Big Data tools to scrub our posts for keywords and trends that are useful to their advertisers.  Advertisers are using these data to understand better how consumers feel about their brand.  The richness of information stored in unstructured data sources is, I think, only just beginning to be fully understood.  There is a wealth of opportunity to better understand our customers and our world through analysis of these sources.  I see tremendous risks as well, as more and more of our digital assets are outside our immediate control.  I’m not certain that I would want the content of every email I’ve ever written to be available for scrutiny by either a corporation or a government.

Like BI though, Big Data is more a practice than a technology.  It is a way of looking at new problems and finding new solutions for solving them.  More important, it is driving an entirely new set of questions to be answered.  But even though it is not about the technology, it could not exist without these latest technological tools.  Quoting again from the Wikibon Blog (see link above), “Decoding the human genome originally took 10 years to process; now it can be achieved in a week.”  That’s pretty big.

Not everyone is ready for Big Data.  Many companies are struggling simply to manage the cost of making their products, and that rarely requires huge volumes of data.  Nevertheless, as they begin turning their eyes to their twenty-first-century customers, many of the relevant questions are going to be answered increasingly by Big Data processes. We all need to be watching that horizon and thinking about it strategically, whether or not we are doing it yet.  How do the social media affect my company, my employees, and my products?  How can we benefit by tapping those resources?  What forms of unstructured data are relevant to us and what can we learn from them?  These and other questions should begin appearing in strategic planning conversations.  One last question to consider is, will we have the skills we need when we get there? Already there is a paucity of practitioners qualified to use these technologies.

I suspect that the definition of Big Data will continue to have fuzzy edges.  Already the terminology is evolving as fast as the technology. I myself need to be careful to avoid causing confusion.  At the recommendation of my wife, I shall return to pronouncing the word “data” properly with the short “a.”  And if I enunciate clearly, there should be no problem.  After all, no one would want to picture Marcel Duchamp posing for Attack of the 50-Ft. Woman.  Big Dada?


How is your company using Big Data processes?  Do you have experiences to share regarding unstructured data?

6 thoughts on “Some Thoughts on Big Data

  1. Nancy says:

    From a high level it’s not scary, it’s quite exciting to have so much data to analyze from so many angles. However; from a personal level, it can be quite scary and intimidating. I don’t think anyone really realizes the impact of their technological habits – i.e. facebook, texts, chats, blogs, youtubes, etc and the footprint it’s creating about one’s persona and the inherent need we’ve created to be so “connected.” When you stop and think that every key stroke, every swipe, push of a button is leaving your fingerprint somewhere, that’s a little unnerving in my own environment. I choose to limit my grasp on social media platforms and how much I interact and spout about my thoughts, but then I don’t have an agenda or platform I need to make heard around the world. It’s a love/hate situation. We’ve created the need yet cry about the dependency. Advertising no longer relies on word of mouth or an ad in a newspaper or magazine or billboard – it literally is derived from our very keystrokes. And the more we tap away, the more of ourselves we open up for the masses to analyze. Great article, Steve!

  2. bimuse says:

    Thank you, Nancy.

    I agree that there really are consequences to our technology habits. Social media has brought us all closer together in a myriad of ways, but remaining cognizant of that exposure is very important. Personally, I don’t mind so much if companies are able to figure out my likes and dislikes from my posts or buying habits, up to a point. Shopping online, for instance, is much easier and more efficient than most brick and mortar shopping. The book I purchased from Amazon is probably not a problem most days. The personal items I buy online at Walgreen’s are probably not something I want shared. There is a lot to think about.

  3. Bill says:

    Steve, how do you see non-transaction level business intelligence analysis converging with unstructured data analysis? I can picture a somewhat fuzzy convergent technical solution, I’m struggling to find the questions / use cases from a business perspective.

    • bimuse says:

      Bill, this is a very good question. I don’t necessarily see that aggregate BI analysis must converge with the analysis of unstructured data. I think it all depends on the circumstances, the questions being asked, and the application of the technology. If the application is real time analysis of unstructured data aimed at responding immediately to intelligence harvested from a sea of social media posts (such as responding to the Super Bowl blackout), then a conventional tool would have little place in the mix because it would be too slow. However if the application is the analysis over time of responses to marketing campaigns, then I could see a mix of technologies with the big data tools supplying the data.

      With respect to use cases for unstructured data, these are probably not your typical corporate business questions. For instance, I can’t imagine unstructured data being useful in managing costs or analyzing sales performance, both typical BI questions. But the social media companies are already mining unstructured data continually. Other fields such as legal depend on unstructured data, but have not previously had the technology to mine potentially thousands of documents, emails, and text messages looking for the answers to who did what and when. Again, it depends on the business questions to be answered, and that seems to be evolving.

      This is all very evolutionary and I have more questions than answers. At the TDWI conference in Las Vegas this year, 60% of the content was about these emerging technologies, but every single one of the actual solutions presented during the Executive Summit were based on conventional BI. I have yet to see the beef myself.

  4. Michael Mixon says:

    Great post, Steve. Everything about Big Data seems to be large…the potential, the potential abuse, the misunderstanding, the hype. As a BI practitioner and standard-issue geek, the idea of having vast volumes of data to analyze is very appealing. But many (if not all) of the same things that get in the way of small data being useful will get in the way of big data being useful, namely a) an ability to tease out and present the meaningful stories, and b) the need to cast a skeptical eye at anything generated by a fancy algorithm.

    This article makes a strong case, in my opinion, for why we should not rush to embrace the promise of Big Data too quickly.

    • bimuse says:

      Terrific points, Michael. I agree that teasing out the meaningful stories is the name of the game. I’m only partially in agreement with the article. I think there are strong use cases for big data and enterprises such as Amazon and eBay use them very effectively. In his Walmart example, they clearly chose a backhoe when it was a trowel that was needed. Thanks for the thoughts.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s