November 27, 2015

The Basics of Big Data Storage

The Basics of Big Data Storage

Big data is a buzzphrase, but it’s also a concept that makes a big difference to your business. Its effects are fundamental enough that even though you may be asking IT staff to take care of the details, senior executives need to know enough to ask questions and set clear guidelines for what those staff are trying to achieve.


The Effects of Data Quantity Mean More Than Its Amount

First we need to define what we mean by “big data”, which is a tricky task as it can mean different things to different people. It’s not so much about the amount of data companies handle and access these days (which continues to rocket) but more the effects of that quantity. Perhaps the best way to think of big data is that it has a drawback and a benefit that are two sides of the same coin.

The drawback is that there’s so much data that older or “traditional” data processing applications and methods simply can’t cope: processing either takes too long to be practical, needs more computing power than is available, or both. The benefit is that the breadth and depth of the data makes it possible to spot relationships and make more accurate predictions in a way that wasn’t previously possible.


It’s not Enough to Have a lot of Storage

In turn, big data has two significant effects on storage requirements, beyond the simple point that you need a lot of storage. One is that the storage needs to be quickly scalable, meaning you can add more storage without disproportionate cost or logistical headaches. You should be asking tech staff to be researching ways of scaling more efficiently as simply buying more disk space as you use more data may soon become financially unviable.

The second major requirement is that the data be accessible much more quickly. This is certainly the case when you are performing analysis based on live data, for example when looking at the locations of inbound calls to a call centre and cross-referencing the locations with the content of calls (i.e: were they about a problem and was it resolved?) to spot localized service problems that need fixing quickly.


The Requirements to the Storage

Storage designed for faster access can be more expensive, so one option is a tiered approach where the data most in need of live analysis be stored on flash-based devices, while data that is being held for future reference be stored on slower but more resilient and cheaper media such as traditional disks or even tape.

While there are countless ways to meet these requirements, they can usually be put into a few main categories:

  • Scale-out (also known as clustered NAS) means you effectively have multiple copies of files, stored in different parts of a network. That allows faster local access while giving a failsafe if any data is lost or inaccessible.
  • Object storage usually involves a single copy of each file, but a particularly efficient index for finding and accessing the file quickly. It works more like the Internet, where the actual location of a file doesn’t matter so much, than it does a filing cabinet where everything must be arranged carefully by drawer and folder.
  • Hyperscale computing involves huge numbers of virtual servers, meaning each physical server machine that stores and processes data can quickly switch focus to a particular user. In simplified terms, it means you can access a virtual computer that’s exactly big enough to handle your data processing task, but only “exists” for the duration of the task, making it much more efficient in terms of cost and power demands.

As always, your IT staff will usually be the most qualified to help you find the right solution, but you can harness their skills much more effectively by thinking about exactly what you need from your data storage and how this might change in the short-, medium- and long-term.