Dirty — incomplete, outdated and/or duplicated — data can lead a company from false results to disastrous conclusions. But it’s not easy to clean big data. And when you’re talking about big, dirty data, that’s a lot of laundry.
How white do those whites have to be?
Conventional wisdom says that data has to be clean — purged and reconciled — to generate precise analysis. But big data is not conventional. It’s all about capturing, curating and storing huge data sets so they can be searched, shared, analyzed and used to create business insights. And the larger the data set you draw from, the greater the potential for informative results.
Unfortunately, the converse is also true. The greater the volume, velocity and/or variability of your data sets, the tougher they are to get clean. You can count on dedicating 80 percent of your entire data mining effort to the cleansing process, alone. So just how precise do those results have to be?
Does quality matter?
Data cleansing can take a long time. To invest your resources wisely, settle this question upfront: Are you after meticulous results or will a broader perspective provide all the insight you need?
- The sheer magnitude of big-data aggregation with an analytical platform like Hadoop might offset a lack of data hygiene and provide all the truth you need.
- Cleaning data means throwing some of it away, and many believe that takes you further from the truth, not closer. Semi-structured NoSQL database management can provide just enough structure to organize the data, but not enough to whittle it away.
- Mission-critical applications, such as healthcare, likely require more exacting analysis. For those situations, avoid regulatory compliance exposure, and get that stuff clean before you put it through the wringer.
Tags: Business,Business Intelligence,Data Center,Productivity,Technology,Virtualization