The biggest headache in machine learning? Cleaning dirty data off the spreadsheets

If you imagine the life of a machine learning researcher, you might think it’s quite glamorous. You’ll program self-driving cars, work for the biggest names in tech, and your software could even lead to the downfall of humanity. So cool! But, as a new survey of data scientists and machine learners shows, those expectations need adjusting, because the biggest challenge in these professions is something quite mundane: cleaning dirty data.

This comes from a survey conducted by data science community Kaggle (which was acquired by Google earlier this year). Some 16,700 of the site’s 1.3 million members responded to the questionnaire, and when asked about the biggest barriers faced at work, the most common answer was “dirty data,” followed by a lack of talent in the field.

But what exactly is dirty data, and why is it such a problem?

It’s axiomatic to say that data is the new oil of the digital economy, but this is especially true in fields like machine learning. Contemporary AI systems generally learn by example, so if you show one lots of pictures of a cat, over time it’ll start to recognize characteristics that constitute ‘cattyness’. This is why companies like Google and Amazon have been able to build such effective image and speech recognition platforms: they have a ton of data from users.

But AI systems are still computer programs, which means they’re prone to flipping out if you press the wrong button at the wrong time. This inflexibility includes the data they can learn from. Think of these programs like fussy infants who refuse to eat unless their bananas are mashed just so. But instead of prepping bananas, workers in the field have to comb through datasets with hundreds of thousands of entries, tracking down missing values and remove any formatting errors. Making airplane noises while they do so is optional.

“There’s the joke that 80 percent of data science is cleaning the data and 20 percent is complaining about cleaning the data,” Kaggle founder and CEO Anthony Goldbloom told The Verge over email. “In reality, it really varies. But data cleaning is a much higher proportion of data science than an outsider would expect. Actually training models is typically a relatively small proportion (less than 10 percent) of what a machine learner or data scientist does.”

Kaggle itself is intended to help. The site is best known for its competitions, where companies posts a specific data-related challenge and then pay the person who comes up with the best solution. (The money itself isn’t great but winning is a good way to get noticed by recruiters.) And this means Kaggle has also become a repository of interesting datasets that users can play around with. These range from a collection of 22,000 graded high school essays to CT scans for lung cancer to a whole lot of pictures of fish. (Posted by a US environmental NGO hoping to hook a better fish-identifying AI.)

Kaggle’s survey wasn’t just about data, though, and it includes other interesting tidbits. For a start, a master’s degree was the most common level of educational attainment for respondents (followed by a bachelor’s and then a doctoral degree). And Python was both the most commonly used programming language and the top language recommended to individuals looking to break into the field. Also notable was despite attention focused on new data tools like neural networks, most practitioners more frequently rely on older and less glamorous statistical methods.

For example, a type of analysis known as “logistic regression” was the most commonly used (63.5 percent of respondents said they deployed it) while neural networks only came in fourth (used by 37.6 percent). The roots of logistic regression as a mathematical tool are centuries old, and it’s used to find the probability that a point in any given dataset belongs to a specific category. Goldbloom suggests that one of the reasons for its popularity is that fact that it’s a mainstay of university courses, and used in a wide variety of fields.

“300,000 years from now, there will be stones, cockroaches, and logistic regressions.”

“Linear regression and logistic regression are taught to every undergraduate that does any statistics related course,” he says. “Including machine learning, econometrics, psychology, bioinformatics…” Goldbloom notes that as a mathematical tool it can be “brittle and not very powerful,” but academic and industry inertia means it’s not going anywhere soon. As one high-ranking Kaggle “grandmaster” noted in response to the survey: “300,000 years from now, there will be stones, cockroaches, and logistic regressions left in this world.”

Neural networks, meanwhile, get the most attention because they’re particularly well-suited for tasks involving image, video, and audio data. (Aka, all the cool stuff happening in AI right now.) For text and numerical information, though, the older methods are more suitable. So if you’re planning on getting into machine learning or data science any time soon, be prepared to start cleaning those spreadsheets.


Article by channel:

Read more articles tagged: Machine Learning