The Dirty Truth About Big Data

Laundromat with washing machines.

Dirty — incomplete, outdated and/or duplicated — data can lead a company from false results to disastrous conclusions. But it’s not easy to clean big data. And when you’re talking about big, dirty data, that’s a lot of laundry.

Massive data collections — commonly called big data — are really hard to scrub. Those data sets are so large and complex that they exceed the processing capacity of traditional database systems.

How white do those whites have to be?

Conventional wisdom says that data has to be clean — purged and reconciled — to generate precise analysis. But big data is not conventional. It’s all about capturing, curating and storing huge data sets so they can be searched, shared, analyzed and used to create business insights. And the larger the data set you draw from, the greater the potential for informative results.

Unfortunately, the converse is also true. The greater the volume, velocity and/or variability of your data sets, the tougher they are to get clean. You can count on dedicating  80 percent of your entire data mining effort to the cleansing process, alone. So just how precise do those results have to be?

Does quality matter?

Data cleansing can take a long time. To invest your resources wisely, settle this question upfront:  Are you after meticulous results or will a broader perspective provide all the insight you need?

  • The sheer magnitude of big-data aggregation with an analytical platform like Hadoop might offset a lack of data hygiene and provide all the truth you need.
  • Cleaning data means throwing some of it away, and many believe that takes you further from the truth, not closer. Semi-structured NoSQL database management can provide just enough structure to organize the data, but not enough to whittle it away.
  • Mission-critical applications, such as healthcare, likely require more exacting analysis. For those situations, avoid regulatory compliance exposure, and get that stuff clean before you put it through the wringer.

Ann Newman

Ann Newman

Dell Contributor at Tech Page One
Ann Newman lives in Austin and blogs for BYOD, virtualization, Windows 8, storage and mobility on She has no pets. You can contact her at [email protected]
Ann Newman
Ann Newman
Ann Newman
Tags: Business,Business Intelligence,Data Center,Productivity,Technology,Virtualization