Overfitting and Underfitting are two essential concepts in Machine Learning and the reasons for the poor performance of the models. The main object of every machine learning model is to give a suitable output by adjusting to the given set of unknown inputs. In a nutshell, it should be able to produce accurate and reliable data after receiving training on the data sets. Thus if the model in machine learning is performing well on training data but its performance reduces on the test data it is called overfitting. On the other hand, if the model performs poorly over both the train and test set, it is known as underfitting.
These concepts in machine learning are related to bias and variance trade-offs. The article will give you a brief idea of the two concepts and also how to avoid them. Let us understand bias and variance before we explore each concept in detail!
Overfitting refers to a model that performs well for training the data but displays poor performance on test data. It happens when the model learns all the details and noise of the training data to such an extent that it negatively affects the performance of the test data of the model. This indicates that the noise or the random fluctuations in the training data is selected and understood as concepts by the model. However, these concepts do not apply to new data and thereby impact the performance of the models in a negative way.
Overfitting is likely to happen with nonlinear models and nonparametrics which are more flexible while learning a target function. You may also find many nonparametric algorithms of machine learning that include techniques or parameters to constrain and limit how much detail the model is able to learn. Thus overfitting refers to high variance and low biases in the model.
For instance: Decision trees are more susceptible to overfitting the data since it is very flexible. The problem is addressed by removing some of the detail that it has picked after pruning the tree.
Examples of Overfitting
In the above diagram you can note that overfitting is due to the following reasons :
The user can use the following tricks to tackle the problem of overfitting
Detecting overfitting is an impossible task unless the data is tested. It can only help to address the inherent characteristics such as the inability to generalise the data sets. Therefore you can separate the data into different subsets so that training and testing become easy. Now, the data is divided into two main parts i.e Test set and a training set. The training set has the majority of data which is 80% of the available data that it is required to train the model. The remaining 20% is the test set which is used to test the accuracy of data it didn't have interaction with before. Segmentation of data will help to examine the performance of the model so that it can be stopped when overfitting occurs. This will also help to manage the training process. Using the percentage of precision observed in two data sets the performance can be measured to conclude the detection of overfitting. To sum up, if the model of machine learning performs better on the training set when compared to the test set this means that the model is overfitted.
Overfitting can be easily identified by keeping the metrics of validation such as loss and accuracy in check. It will generally increase to a point where they become stagnant or either start declining if the model is impacted by overfitting. In such a scenario, while in an upward trend, it seems to be a good fit but when achieved it causes the trend to begin declining and become stagnant.
Underfitting refers to a scenario whereby a data model is not able to capture accurately the relationship between inputs and outputs. This generates a high error on both the training set and also the unseen data. This happens when the model is very simple which means that the model needs more training time, less regularisation, and more inputs. When there is underfitting in the model, it fails to establish within the data the dominant trend. Hence results in poor performance and training errors in the model. If the model fails to generalise with the new data it cannot be used for prediction or classification tasks. Low variance and high bias are the two good indicators of underfitting. This behaviour can be noticed while the training data set is used. You can easily identify under fitted models as compared to overfitted ones.
Thus Underfitting refers to high biases and low variance in the model
Example of Underfitting
In the above image you can figure that underfitting is due to the following reasons :
[Related Article: Classifications in Machine Learning]
Underfitting can be avoided using the following techniques
Detecting the problem of overfitting and underfitting in machine learning is useful. But unfortunately cannot solve the issues. All you can do is try several options. Here are a few for your reference.
The only remedy for underfitting is to keep trying alternative machine learning algorithms. It helps in providing a good variance to the problem of overfitting. There are various ways in which one can prevent overfitting with the help of Cross validation which is a powerful measure, you can use the initial data for training and generate multiple small train-tests in splits. You can use these splits to modify the data. Training the data can help the algorithms to understand the signal. If having more data sets fails the user can then use data augmentation to make available data appear diverse. Simplify the data and further regularise it. Bagging and boosting are the two methods that are popularly used for ensembling. Early stopping will also help to stop the training process before the user/learner passes the point. The depth can be reduced to the maximum point in a decision tree model. Similarly, in neutral networks, the user should introduce a dropout layer to avoid overfitting. Add regularisation if you can come across Liner and SVM Models.
The "Goodness of fit" is a term in statistics. It indicates a goal that a machine learning model has so that it can achieve the goodness of fit. In a nutshell, it ideally represents how closely predicted values ( results) match the true values in the dataset.
The machine learning model to be considered a good fit should be between both the underfitted and also overfitted model, and makes predictions with zero errors, but in reality, it is not possible to achieve it.
When we train the model for a prolonged period, the errors also go down in the training data, and similarly, it happens with the test data. But when we train the machine learning model for a long duration the performance of the model might decrease due to the overfitting, since the model also learns the noise that is present in the dataset.
Therefore the user will have to stop at a point where the errors are increasing in order to be a good fit. This is the point where the model has skills that are good for training and as well as test data.
A few common Goodness of fit tests are:
Hope you have now understood the concepts of overfitting and underfitting of machine learning. It is vital to know all the techniques to avoid such problems. A user has to strike an accurate balance between the models to produce the exact outcome. To know more about the fundamentals of overfitting and underfitting practically, you can join us, HKR training, a one-stop solution to all your career growth requirements. Do share your comments in the section below for any queries.
Batch starts on 27th Sep 2023, Weekday batch
Batch starts on 1st Oct 2023, Weekend batch
Batch starts on 5th Oct 2023, Weekday batch
Overfitting is a modelling error that occurs when a function is too closely fitted to a limited set of data points. Underfitting refers to a model that is neither model training data nor generalisation to new data.
Underfitting occurs when a model is too simple and is informed by too few features or regularised too much, making it inflexible in learning from the dataset.
Overfitting is a problem because the evaluation of machine learning algorithms on training data differs from the evaluation, namely how well the algorithm performs on unseen data.
Machine learning prevents overfitting in several ways they are: