The Value of Data Dredging

There are many people who doubt the value of data. We shouldn’t just ignore them. It’s often presumed or claimed that more data, bigger data and greater variety of data necessarily means there is more value in mining data. You’ve probably come across the phrase mining data for insights many times.  

However, there are still many data skeptics. While they may feel pressured into talking the talk, data skeptics are unwilling to walk the walk and make the investments needed for data mining and predictive analytics to pay off.

We should be careful about dismissing them as dinosaurs, though, because data mining and predictive analytics do not always pay off. More data, bigger data and more variety in data are by no means guaranteed to improve our decisions. Who Cares About Evidence? digs into how data are actually used by decision-makers.

First, what is mining? According to Wikipedia:

Mining is the extraction of valuable minerals or other geological materials from the earth usually from an orebody, lode, vein, seam, reef or placer deposits. These deposits form a mineralized package that is of economic interest to the miner.

What, then, is data mining? Again, citing Wikipedia:

Data mining is the computing process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems…The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. Data mining is the analysis step of the “knowledge discovery in databases” process, or KDD. 
 
The term is a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence, machine learning, and business intelligence…Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.

D

Statisticians have often used data mining to mean data dredging:

Data dredging (also data fishing, data snooping, and p-hacking) is the use of data mining to uncover patterns in data that can be presented as statistically significant, without first devising a specific hypothesis as to the underlying causality.

Data dredging, understandably, is very much frowned upon by the statistical community. I can remember when “data mining” was something you might accuse someone of but would never put on your resume!

“Dinosaurs” talk to one another and some have had direct experience with an expensive data mining and predictive analytics project that flopped. Frequently, one reason for perceived failure is a blind assumption that if you build it (data infrastructure) they (insights useful for decision-making) will come. Again, this is not guaranteed.

Unfortunately, some management teams have a very gung-ho mentality and, in essence, try to hit a grand slam home run before they’ve gotten a runner on base. Often it makes sense to start small and proceed incrementally. Exploratory analysis of samples of data, external as well as internal, can provide clues as to potential value of the data. If your vender tells you need to build a gigantic data lake and warehouse first, consider looking for a new vender.

Those of us working in what is now often called data science need to be careful about what we promise. There are data and statistical issues, and there are also ethical issues, and many of us are concerned that a few well-publicized foul ups will tarnish analytics as a whole. Moreover, there have been serious data breaches in recent years and growing concerns about privacy. Bob Hoffman’s Ad Contrarian blog also raises serious questions regarding what is vaguely called digital marketing.

If, like most people, you’re new to the general topic of data science, you may find Predictive Analytics Demystified helpful. Stuff Happens gives a quick overview of how easy it is for chance to pull the wool over our eyes. It’s also important to understand that small data are often more valuable than so-called big data – see Preaching about Primary Research for some reasons why.

What Are Insights? clears up some misconceptions about what insights really are. What to Look for in a Statistician and Statistical Thinking and the Art of Lawnmower Maintenance point out crucial differences between statisticians and number crunchers.

This topic is much too big for one short article, but I hope you’ve found it interesting and helpful!

 

Browse

Article by channel:

Read more articles tagged: Analytics, Featured

Data & Analytics