fbpx

Data Science and Data Science Fiction

Data Science is what Data Scientists really do, but much of the buzz is about Data Science Fiction. Here are a few tips for young people thinking about a career in Data Science.

Anything has its upsides and downsides. Even a true wonder drug like aspirin has side effects and should be avoided by some people. Unfortunately, in our exuberance, the downsides can be forgotten and the upsides blown out of proportion. Naturally, commercial interests also exert their influence.

Data Science is one such example. Articles that cross the border from journalism into science fiction abound, and it’s easy to imagine a faraway world populated by superhuman-like creatures. Often derided as “unicorns” by practicing data scientists, these beings seem to have the equivalent of a PhD in computer science, a PhD in statistics and at least two MBAs. Plus 20 years or more experience, and charms galore. Granted, even here on Earth there are real people with similar talents and experience, but not very many. There is, no doubt, a serious shortage of unicorns.

Happily, real Data Science doesn’t require unicorns. When I began my marketing research career, what is now called Data Science already existed, albeit in less technically glamorous form and mostly hidden behind the scenes in the back room. Data scientists in those days were often called “Programmer/Statisticians” and you can still occasionally spot this occupational title in want ads. Some were diehard Fortran and Cobol folks with a good stats background but many were statisticians of various stripes who used SAS.

The CRISP-DM flowchart shown below was developed a decade later and is still a very useful framework. While not as sexy as what we find on Planet Unicorn, Data Science is sexy enough for most of us and requires solid skills in many areas, computer science and statistics being the obvious ones. Business or organizational acumen, communication skills and subject matter experience are also needed to thrive, as there still are many Data Science doubters in the real world. Part of this skepticism, I suspect, is a reaction to the embellishments of Data Science most of us are exposed to in the business media and blogosphere. Some also see it as a threat, particularly those who’ve “gone with their gut” throughout their careers.

As the flowchart suggests, the data we need frequently are not organized the way we’d wish or perhaps unavailable. This is true even in companies with data warehouses and data marts since, until recently, the principal focus has been on capturing and storing data, not on analyzing it beyond simple SQL queries. Moreover, a substantial amount of Data Science is grunt work, or “janitorial work” in computer science vernacular. It is not at all unusual to spend 8o% or more of one’s time on data cleaning and set up – essentially data processing tasks.

Statisticians by any job title are accustomed to this and regard it as part of the job. An upside to the heavy lifting is that it’s an excellent way to get to know the data. This is essential, as the Data Science work requiring humans cannot be done entirely mechanically. We need to know what the data mean in order to catch errors in the data, recode and transform data, and interpret our models.

Even if you’re one of the lucky handful who can do it all, you’re still going to need a team because of constraints on your time. Real Data Science is a team sport – no one has time to be an infielder, outfielder, pitcher, catcher and manager even if, astoundingly, they have a flair for all these positions. My own skills set and interests are in data analysis “positions” and consulting rather than data management, but there is room for many kinds of people on these teams, some of which are ad hoc.

New Call-to-action

There are incredibly gifted people – some high school drop outs – developing new Artificial Intelligence tools and working on radically new forms of computing. Of course, there are a few Elon Musks, Ray Kurzweils and other visionaries too. In part because of their collective genius and hard work, the world in 50 or 100 years will very likely be much different from ours. I do not know what this world will look like. I used to scoff at Transhumanism and Posthumanism but no longer – those sorts of worlds, creepy as they seem, may one day be the worlds of our grandchildren or great grandchildren.

Most of us guys in Data Science do some cool stuff but, like most folks, we’re practical guys just earnin’ a livin’. Some of us will make it big but if there is a guaranteed road to riches, it ain’t Data Science. So, if you’re still in college or just getting a start on your career, keep hold of your ambition and dreams but, at the same time, be realistic as well. A lot of Data Science that is now done by humans will undoubtedly be automated in the future, and those jobs will disappear. Learning a dozen programming languages may not be enough.

We are humans and Earth is our planet.

Here is a short selection of books on Data Science and related topics you might want to have a look at. There are many more, but these are ones I can personally vouch for.

  • A Practitioner’s Guide to Business Analytics (Bartlett)
  • Data Mining Techniques (Linoff and Berry)
  • Data Mining (Whitten et al.)
  • Applied Predictive Modeling (Kuhn and Johnson)
  • An Introduction to Statistical Learning (James et al.)
  • Elements of Statistical Learning (Hastie et al.)
  • Data Mining: The Textbook (Aggarwal)
  • The Data Warehouse Toolkit (Kimball and Ross)
  • Data Architecture: A Primer for the Data Scientist (Inmon and Linstedt)
  • Building a Scalable Data Warehouse with Data Vault 2.0 (Linstedt)
  • Designing Data-Intensive Applications (Kleppmann)

My company library lists many more books as well as academic journals that pertain to subjects such as marketing, statistics, research methodology and computer science. In addition, there are MOOCs, seminars, webinars and other on-line sources such as KDnuggets you can explore. 

An Analytics Toolbox provides an overview of the kinds of analytic methods I personally use most often. I’d urge you to study statistics before turning to machine learners such as Artificial Neural Networks. If you are able to design research and analyze and interpret data as a statistician can, more doors will open for you and there is less risk of being automated out of a job.

I find this comment by former Royal Statistical Society President Peter Diggle instructive:

“Informatics seeks to maximize the utility of data, whereas statistics seeks to minimize the uncertainty that is associated with the interpretation of data.”
 

Causal Analysis is the next frontier in analytics, in my opinion.

I hope you’ve found interesting and helpful!

 

Browse

Article by channel:

Read more articles tagged: Analytics, Featured

Analytics