Join us for networking & quality resources to help you and your team succeed in digital transformation.
“Mine and you shall find” is an implicit assumption (and sometimes explicit claim) underlying big data and data science as frequently practiced. The bigger the data, the more accurate it is and the more value it has to decision-makers.
Modern machine learning methods and Artificial Intelligence are now able to extract meaning from data without resorting to theory. Bigger is necessarily better.
Or so we are sometimes told. Statisticians have long cautioned against such beliefs and competent data scientists echo these warnings.
Python Data Science Essentials: A practitioner’s guide covering essential data science principles, tools, and techniques, by Alberto Boschetti and Luca Massaron is now in its 3rd Edition and covers a wide range of data science topics from the perspective of the Python programming language.
How well a model generalizes to new data – data not used when constructing the model – is one of these topics and is discussed in several places in the book. Below are some brief excerpts from the book related to these concerns. Any copy/paste and editing errors are mine.
“A machine learning algorithm, by observing a series of examples and pairing them with their outcome, is able to extract a series of rules that can be successfully generalized to new examples by correctly guessing their resulting outcome. Such is the supervised learning approach, where it applies a series of highly specialized learning algorithms that we expect can correctly predict (and generalize) on any new data.
But how can we correctly apply the learning process in order to achieve the best model for prediction to be generally used with similar yet new data? In data science, there are some best practices to be followed that can assure you the best results in the future generalization of your model to any new data…
A learning curve is a useful diagnostic graphic that depicts the behavior of your machine learning algorithm (your hypothesis) with respect to the available quantity of observations. The idea is to compare how the training performance (the error or accuracy of the in-sample cases) behaves with respect to the cross-validation (usually tenfold) using different in-sample sizes.
As far as the training error is concerned, you should expect it to be high at the start and then decrease. However, depending on the bias and variance level of the hypothesis, you will notice different behaviors:
- A high-bias hypothesis tends to start with average error performances, decreases rapidly on being exposed to more complex data, and then remains at the same level of performance no matter how many cases you further add.
- Low-bias learners tend to generalize better in the presence of many cases, but they are limited in their capability to approximate complex data structures, hence their limited performance.
- A high-variance hypothesis tends to start high in error performance and then slowly decreases as you add more cases. It tends to decrease slowly because it has a high capacity of recording the in-sample characteristics.
As for cross-validation, we can notice two behaviors:
- High-bias hypotheses tends to start with low performance, but it grows very rapidly until it reaches almost the same performance as that of the training. Then, it stops growing.
- High-variance hypotheses tends to start with very low performance. Then, steadily but slowly, it improves as more cases help generalize. It hardly reads the in-sample performances, and there is always a gap between them.
Being able to estimate whether your machine learning solution is behaving as a high-bias or high-variance hypothesis immediately helps you in deciding how to improve your data science project…
As learning curves operate on different sample sizes, validation curves estimate the training and cross-validation performance with respect to the values that a hyper-parameter can take. As in learning curves, similar considerations can be applied, though this particular visualization will grant you further insight about the optimization behavior of your parameter, visually suggesting to you the part of the hyper-parameter space that you should concentrate your search on…
We strongly suggest that you use cross-validation just for optimization purposes and not for performance estimation (that is, to figure out what the error of the model might be on fresh data). Cross-validation just points out the best possible algorithm and parameter choice based on the best averaged result.
Using it for performance estimation would mean using the best result found, a more optimistic estimation than it should be. In order to report an unbiased estimation of your possible performance, you should prefer using a test set.”
Source: Boschetti, Alberto and Massaron, Luca. Python Data Science Essentials: A practitioner’s guide covering essential data science principles, tools, and techniques, 3rd Edition, Packt Publishing.
Article by channel:
Everything you need to know about Digital Transformation
The best articles, news and events direct to your inbox