Which Comes First, the Data or the Egg?

Is there fool’s gold in them thar hillsAny statistician or data scientist at one point or another will be asked: “I have this data. Can you tell me if there’s anything interesting in it?” I have heard variations of this question more times than I can remember. If you’ve had much training in statistics, questions like this can make you cringe.

Experienced data scientists will also understand my uneasiness with this approach to data analysis, which runs counter the CRISP-DMframework many data scientists are familiar with:

“Data mining” still has negative connotations for some statisticians and, indeed, I recall statistics professors calling it data dredging or “shotgun empiricism.” See Stuff Happens for a brief look at some of the dangers inherent in trying to “find something” in data. 

However, I have a confession to make: We all do it. We do it, though reluctantly, not just because we’re pressured into it but because sometimes we do get lucky and strike gold. The real stuff. However, how you go about it makes a difference and if you look closely at the diagram above you’ll see that data analysis is an iterative process. There is much back and forth, and induction and deduction (loosely defined) can feed into one another many times. 

As philosopher Karl Popper observed many years ago, theories and hypotheses often are sudden inspirations and not the result of rigorous, textbook applications of the scientific method. So, when asked if I can “find something,” I try to uncover what inspired the question and, in doing so, will learn at least something about the potential business questions the data may address. The analytics process, in turn, may uncover important questions that should be asked but no one had thought of. It is also imperative to learn what other data the organization might have, or would be able to obtain, that might have bearing on the questions you have begun to unearth.  

Seldom mentioned in the blogosphere or business media is that decision-makers in most industries typically have had little or no hands-on experience analyzing data beyond Excel or SQL, and sometimes not even that. What they mean by data and analysis may be very different what you and your colleagues mean when you use these terms. Analysis, for instance, may refer to some neat-looking graphics that reveal little of business significance. Terms like machine learning and AI are badly abused and it’s usually hard to tell what is meant by them. Be on the lookout.

Get Expert Help and Advice for your Digital Transformation

The data you receive may be patchy or otherwise not in the shape you’d need to even begin exploratory analysis. It is not unusual to be given aggregate data instead of case-level data. I’ve even been asked to perform multivariate analysis on data from a dozen focus group respondents! I have often made up hypothetical data in Excel to illustrate what I think I’ll need to be sure I’ll get what I need.

Be very careful about what you assume, particularly when you are working with someone for the first time.

If you flatly refuse you may be viewed as uncooperative or as a geeky purist and, in my experience, how you tackle these sorts of requests is an important part of a statistician’s job. With any kind of business analytics, it’s crucial to find out who will be using the results of your analysis, when they will use them and (to the extent possible) how the results will be used. Step lightly, though, as the person or persons posing the original question may react defensively or be reluctant to reveal details for reasons of confidentiality or, frankly, organizational politics. 

Setting expectations is also critical. Avoid saying, “yeah, sure, no problem.” If you do, there is a high risk that there will be a problem. Be clear that you will ask many questions, including clarifications about the data, and make certain they know that you may come up empty-handed. Also, try to gently convey the risks of data dredging without confusing them or giving the impression that the effort will at best be fruitless. We humans have a strong tendency to think dichotomously and statisticians, who have been trained to think in terms of conditional probabilities, need to be on guard for this.

Ethics should never be sacrificed, however, and if you feel you have strong reasons to avoid taking on the project, turn down the request and tactfully explain why. How you do this will depend on your personality, national culture and your relationship with the person or persons making the request. 

In analytics, the maths are the easier part!

Arrange a Conversation 

Browse

Article by channel:

Read more articles tagged: Analytics, Featured