Statisticians and the Real World

Statistics is not simply plugging numbers into formulas and outputting the results. It’s really a set of reasoning tools. Though my own education in statistics put considerable emphasis on how to use statistics, I have heard more than a few statisticians complain that their academic training did not prepare them for the real world of data analysis. Their coursework heavily stressed mathematics and programming, and application was shortchanged in their view.

Have a peek inside a few recent stats textbooks to see what they mean. Many of the examples given probably employ simulated data, or real data that pertain to a very narrow specialization (e.g., the mating etiquette of crocodiles). How to communicate statistical concepts to lay persons is also overlooked in some stats programs, which is a serious shortcoming. Not surprisingly, some statisticians who excel in the classroom struggle once they leave it. 

In this short post I’ve jotted down a few thoughts based on my 30-plus years as a marketing scientist. Inbred thinking is a problem in many fields and, to minimize this, along the way I have interacted frequently with statisticians working in other industries as well as with academics. My sense is that my “world view” is quite typical of statisticians and a lot of what I’ll have to say here is just commonsense to an experienced statistician – for better or for worse.

Hopefully this post will be helpful to young statisticians as well as users of statistics – decision-makers – and data scientists too. Compared to data scientists, statisticians are usually more concerned with the uncertainty underling the interpretation of data than in maximizing the utility of the data themselves. (Apologies to Peter Diggle, past president of the Royal Statistical Society, who phrased this contrast more eloquently.) Uncertainty lies at the heart of our discipline, in no small part due to its early origins in astronomy, gambling and actuarial science. Data are less important to us than the questions they raise and the answers they may offer. (I can highly recommend this 20 minute podcast by Chris Fonnesbeck, the chief developer of the PyMC package to anyone. Especially cats.)

Now…reality. Not infrequently, statisticians are given a pile of data out of the blue and asked (less directly) “Can you find me something?” This is less than ideal. At the other, more positive, extreme, many statisticians are also in a consulting role and involved early on in the design of the project. This probably happens less often than most statisticians would wish, however.

Either way, a sort of First Commandment for me goes something like this: Before you do anything, understand what decisions will be made, who will be using your results and when they will be used. It’s critical to understand what motivated the project and it can be risky to make assumptions about this, about your own role or the expectations the key stakeholders have for you and the project. Begin with the end in mind.

Sometimes, though, we feel there is not enough time or the answers we get to these questions are vague. However, if you skip or gloss over this crucial first step you may take a serious wrong turn and never wind up in a destination anyone is happy with. It’s always worth the hassle and is not always a hassle – a statistician’s input is increasingly valued by decision-makers, in my experience. One of the background issues I try to (tactfully) uncover is how research-savvy the decisions makers are and how data and analytics are used in general within the client organization. I’ll return to this in a moment.

Secondly, even when they have been heavily involved in the design of the research and in data collection and processing, a statistician should never assume the data are clean. There is nearly always work to be done even when the data arrive in pretty good shape. Some data fields (variables) may be accurate but must be combined with other data or recoded for analytics purposes, for example. I find the cleaning and data set up phase of the analytics process an excellent way to get to know the data – I clean, set up and explore the data all at once. It’s not always fun but is absolutely essential.

Thirdly, don’t get set in your ways and always stick to analytics methods you’re comfortable with. Be open to learning and trying new things but don’t do something new just to be different – there should be substantive reasons for what you do. (Here I’m paraphrasing advice innovative jazz trumpeter Woody Shaw often gave his band members.)

Don’t become methodologically-driven – you might learn the hard way that a method you prefer and think best for a given situation is too complicated for decision-makers to absorb. In these cases, your statistically elegant results will most likely be ignored and you may not be asked to do future work for the client. “Make things as simple as possible but no simpler” is often attributed to Albert Einstein and, regardless of who first uttered these words, this is sound advice.

For data mining and predictive analytics projects it might be best to stick with popular “machine learning” techniques such as stochastic gradient boosting. That said, when feasible, understanding the data generating process is always helpful and you might wish to consider using one method for predictions and another for helping you and your clients understand what is driving customers to behave in one way or another (for example).

Statisticians (myself included) often don’t give the right hand side of regression equations (the part to the right of Y = ) the attention it deserves. Regression Modeling Strategies (Harrell) and Applied Logistic Regression (Hosmer and Lemeshow) are two books that explain in depth how to build regression models that predict well on new data and are informative. To follow up on this thought, most decisions involve notions about causality. Frequently they are implicit and decision-makers are not really conscious of them. Bone up on causal analysis if you haven’t had much experience in that area. My post “Causal Analysis: The Next Frontier In Analytics?” lists a number of references and might be of help.

When communicating your results, even informally, never try to showcase your technical knowledge. Avoid jargon. An exception would be when your immediate client is a statistician and wants to be impressed by your technical skills and knowledge. In my experience, this is very rare, though.

Conversely, remember there are lies, damned lies and then there are data visualizations…which can take a lot of time to prepare and may confuse or mislead the audience. Graphics can be a godsend or get you in a heap of trouble. Academic statisticians often seem to prefer simple graphs and bullet points in their lectures and presentations and I think there is a lesson in that. (A sales pitch is a different matter.)

Since statistics is such a complicated discipline, it takes time and experience to understand how technical details that are important in the classroom will impact decision-making. It takes even longer to learn how to communicate to clients that some minutia are important and can affect their decisions! Try to get in the habit of asking yourself if and how a choice you are making regarding which method or option to use will influence the eventual decisions your client makes.

Space does not permit me to discuss at length the tendency for humans to think dichotomously, to reject evidence that does not support their point of view or is not politically useful, groupthink and the numerous other cognitive deficiencies we suffer from as a species. These also affect the work of a statistician and data scientist – having lots of data and elaborate analytics does not immunize us from human nature, and we can be seen as threats or merely dweebs and our massive data and sophisticated models ignored.

There are a lot of disagreements among statisticians, far more than most non-statisticians may realize. Though p-value bashing is now the rage we need to remember that NHST has been part of science for many decades and we need to understand it in order to read the research literature. Furthermore, it remains the orthodoxy.

Bayes versus frequentism…Bayesian statistics certainly can come to the rescue in some circumstances but I personally do not think it will replace frequentist approaches in foreseeable future. Statisticians will need to learn both, as well as ANN, SVM, boosting, bagging and the other tools commonly used in data science. Time-series analysis, structural equation modeling, meta-analysis and propensity score analysis are other areas statisticians should now have a good understanding of but historically have tended to ignore. Likewise, some facility with the R language has become a must and before long it may become the lingua franca of data analysis. (I am not making a veiled plug for R – I personally use many packages, including R, and find that more than one is needed for most of my projects.)

To sum up, my advice it to be non-ideological about method and software and, without compromising your objectivity, do your best to put yourself in the shoes of the people who will be using the results of your analyses. Remember that you’re only one player on the field.

Hope you’ve found this interesting and helpful!

Kevin Gray is president of Cannon Gray, a marketing science and analytics consultancy.


Arrange a ConversationĀ 


Article by channel:

Read more articles tagged: Featured