July 9, 2015

Advanced Analytics: The Search for Answers in Unstructured Data

Advanced Analytics: The Search for Answers in Unstructured Data

Call this the decade of big data – petabytes of it, daily, and most of it in unstructured form which means data analytics have to be applied to find any meaning at all. By some estimates, 70 to 80 percent of all data in many businesses is unstructured and it is a proverbial gold mine – but the bad news is that it is just not that easy to mine.

Today’s data input would have been unimaginable even 30 years ago when computers sometimes shipped with just two floppy drives that could hold 360KB apiece. Today Google alone is said to process 20 petabytes of data every day, 365 days a year.

Many organizations also handle at least a petabyte daily, from web queries, Tweets, emails, Facebook posts, transcripts of voicemails, spreadsheets, SMS on cellphones, calendar entries, PowerPoint shows, to blogs, and still more varieties of data. Right there is the problem with finding meaning. In bygone days, companies analyzed a single form of data – say an Oracle database. Where meaning was to be found was obvious and consistent, from file to file. No more.

But know this: as data has grown, so too have the techniques for extracting meaning across diverse formats. Different techniques are popular across various organizations, and many use several to extract maximum value. Lots of big data analysis tools have proliferated in the past few years. Some are free, some are expensive, some are easy to use, some require advanced coding skills. A lot is custom developed, for a specific company and use. There is no right tool for every situation. But trial and error will guide a user to the tools he/she is comfortable with.

Essential in effectively analyzing data is to know what the purpose of the analysis is. What answers are wanted from the data? Very probably, in any organization’s big data stores, there is a myriad of answers. The questions become how to dig through it, with what tools, and how to know what you want to know when the job is done. Over time, as familiarity with a data set grows, almost certainly recognition will grow that there are important answers in it to questions that nobody had thought to ask at first. It is important, at the beginning, to be comfortable asking the questions that are the obvious ones.

Here is a classic big data problem: why did the stock of company XYZ fall 10 percent last week? Assume there is no single trigger event to explain it. So, why?

The route to an answer is to sort through social media chatter (on Twitter, especially), factor in commentary on the Web (investment focused blogs in particular), and do a fast dive through pertinent news articles. The search focus will be on the company name – that’s an easy tag – and tools will be deployed to grade the sentiment of any chatter or commentary (ranking it as favorable, unfavorable, or neutral). In the data answers will be found, and that’s the case for just about every question.

The key throughout is this: the focus is on a search for answers, not a focus on the data as such. But that is the beauty of this effort: it quickly produces tangible payoffs. Just keep trying. That’s the secret to how this gold is mined.