Overfitting and Underfitting in Machine Learning

Overfitting and Underfitting are two essential concepts in Machine Learning and the reasons for the poor performance of the models. The main object of every machine learning model is to give a suitable output by adjusting to the given set of unknown inputs. In a nutshell, it should be able to produce accurate and reliable data after receiving training on the data sets. Thus if the model in machine learning is performing well on training data but its performance reduces on the test data it is called overfitting. On the other hand, if the model performs poorly over both the train and test set, it is known as underfitting.

These concepts in machine learning are related to bias and variance trade-offs. The article will give you a brief idea of the two concepts and also how to avoid them. Let us understand bias and variance before we explore each concept in detail!

  • Bias: It refers to all the assumptions that a model makes so that a function can be learned easily. 
  • Variance: It refers to all the changes that occur in the model when the different portion of the data set which is used for training is used. 

What is Overfitting?

Overfitting refers to a model that performs well for training the data but displays poor performance on test data. It happens when the model learns all the details and noise of the training data to such an extent that it negatively affects the performance of the test data of the model. This indicates that the noise or the random fluctuations in the training data is selected and understood as concepts by the model. However, these concepts do not apply to new data and thereby impact the performance of the models in a negative way.

Overfitting is likely to happen with nonlinear models and nonparametrics which are more flexible while learning a target function. You may also find many nonparametric algorithms of machine learning that include techniques or parameters to constrain and limit how much detail the model is able to learn.  Thus overfitting refers to high variance and low biases in the model. 

  Become a machine learning Certified professional by learning this HKR Machine Learning Training !

For instance:  Decision trees are more susceptible to overfitting the data since it is very flexible. The problem is addressed by removing some of the detail that it has picked after pruning the tree.

Machine Learning Training

  • Master Your Craft
  • Lifetime LMS & Faculty Access
  • 24/7 online expert support
  • Real-world & Project Based Learning

Examples of Overfitting

Overfitting

In the above diagram you can note that overfitting is due to the following reasons :

  • The model is complex and also has high variance. 
  • The data that is used for training contains noise i,e. Garbage values and is not cleaned.
  • The extent of the data set which is used for training is also not enough.

The user can use the following tricks to tackle the problem of overfitting 

  • Adoption of ensembling techniques    
  • Develop a training model which has sufficient data.
  • Use of regularisation techniques for eg - Lasso & Ridge 
        

How to Detect Overfitting in Machine Learning?

Detecting overfitting is an impossible task unless the data is tested. It can only help to address the inherent characteristics such as the inability to generalise the data sets.  Therefore you can separate the data into different subsets so that training and testing become easy.  Now, the data is divided into two main parts i.e Test set and a training set.  The training set has the majority of data which is 80% of the available data that it is required to train the model. The remaining 20% is the test set which is used to test the accuracy of data it didn't have interaction with before. Segmentation of data will help to examine the performance of the model so that it can be stopped when overfitting occurs. This will also help to manage the training process. Using the percentage of precision observed in two data sets the performance can be measured to conclude the detection of overfitting. To sum up, if the model of machine learning performs better on the training set when compared to the test set this means that the model is overfitted. 

Overfitting can be easily identified by keeping the metrics of validation such as loss and accuracy in check.  It will generally increase to a point where they become stagnant or either start declining if the model is impacted by overfitting.  In such a scenario, while in an upward trend,  it seems to be a good fit but when achieved it causes the trend to begin declining and become stagnant.

 Top 30 frequently asked machine learning Interview Questions !

Subscribe to our youtube channel to get new updates..!

What is Underfitting?

Underfitting refers to a scenario whereby a data model is not able to capture accurately the relationship between inputs and outputs. This generates a high error on both the training set and also the unseen data. This happens when the model is very simple which means that the model needs more training time, less regularisation, and more inputs. When there is underfitting in the model, it fails to establish within the data the dominant trend. Hence results in poor performance and training errors in the model. If the model fails to generalise with the new data it cannot be used for prediction or classification tasks. Low variance and high bias are the two good indicators of underfitting. This behaviour can be noticed while the training data set is used. You can easily identify under fitted models as compared to overfitted ones.

Thus Underfitting refers to high biases and low variance  in the model

Example of Underfitting

Underfitting

In the above image you can figure that underfitting is due to the following reasons :

  • The model is very simple and also has a high bias
  • The data contains noise i.e garbage values and also the data that is used for training is also not cleaned.
  • The training data set size is also not enough 

[Related Article: Classifications in Machine Learning]

How to Avoid Underfitting?

Underfitting can be avoided using the following techniques 

  • Increase the training duration: Sometimes stopping the training can also result in underfitting of the model. Therefore the user can also avoid it by extending the duration of the training. Finding the right balance between overfitting and underfitting is the key.
  • Selection of the features: For any machine learning model, to determine a specific outcome, features are used. If a model does not have enough features that feature of more importance should be introduced. This will help to make the model more complex, delivering better training results.
  • Decrease Regularisation: If the data features in a model become uniform it leads to underfitting, this is because a model is not able to identify the dominant trend. Therefore the amount of regularisation, more variation and complexity are introduced in the model. This will thereby result in training the model successfully. 

Machine Learning Training

Weekday / Weekend Batches

How to Avoid Overfitting and Underfitting In Machine Learning?

Detecting the problem of overfitting and underfitting in machine learning is useful. But unfortunately cannot solve the issues. All you can do is try several options. Here are a few for your reference.

The only remedy for underfitting is to keep trying alternative machine learning algorithms. It helps in providing a good variance to the problem of overfitting.  There are various ways in which one can prevent overfitting with the help of Cross validation which is a powerful measure,  you can use the initial data for training and generate multiple small train-tests in splits. You can use these splits to modify the data.  Training the data can help the algorithms to understand the signal. If having more data sets fails the user can then use data augmentation to make available data appear diverse. Simplify the data and further regularise it. Bagging and boosting are the two methods that are popularly used for ensembling. Early stopping will also help to stop the training process before the user/learner passes the point. The depth can be reduced to the maximum point in a decision tree model.  Similarly, in neutral networks, the user should introduce a dropout layer to avoid overfitting. Add regularisation if you can come across Liner and SVM Models.


What is Goodness of Fit?

The "Goodness of fit" is a term in statistics. It indicates a goal that a machine learning model has so that it can achieve the goodness of fit. In a nutshell, it ideally represents how closely predicted values ( results) match the true values in the dataset.

The machine learning model to be considered a good fit should be between both the underfitted and also overfitted model, and makes predictions with zero errors, but in reality, it is not possible to achieve it.

When we train the model for a prolonged period, the errors also go down in the training data, and similarly, it happens with the test data. But when we train the machine learning model for a long duration the performance of the model might decrease due to the overfitting, since the model also learns the noise that is present in the dataset. 

Therefore the user will have to stop at a point where the errors are increasing in order to be a good fit. This is the point where the model has skills that are good for training and as well as test data.

A few common Goodness of fit tests are:

  • Anderson Darling
  • The Chi Square
  • Kolmogorov Smirnov 
     
Conclusion

Hope you have now understood the concepts of overfitting and underfitting of machine learning. It is vital to know all the techniques to avoid such problems. A user has to strike an accurate balance between the models to produce the exact outcome. To know more about the fundamentals of overfitting and underfitting practically, you can join us, HKR training, a one-stop solution to all your career growth requirements. Do share your comments in the section below for any queries. 

Related Articles:

Find our upcoming Machine Learning Training Online Classes

  • Batch starts on 1st Dec 2022, Weekday batch

  • Batch starts on 5th Dec 2022, Weekday batch

  • Batch starts on 9th Dec 2022, Fast Track batch

Global Promotional Image
 

Categories

Request for more information

Gayathri
Gayathri
Research Analyst
As a senior Technical Content Writer for HKR Trainings, Gayathri has a good comprehension of the present technical innovations, which incorporates perspectives like Business Intelligence and Analytics. She conveys advanced technical ideas precisely and vividly, as conceivable to the target group, guaranteeing that the content is available to clients. She writes qualitative content in the field of Data Warehousing & ETL, Big Data Analytics, and ERP Tools. Connect me on LinkedIn.

Overfitting and Underfitting in Machine Learning FAQ's

  • Overfitting is a modelling error which occurs when a function is too closely fitted to a limited set of data points.
  • Underfitting refers to a model that is neither model training data nor generalization to new data.
  • Overfitting occurs when a model is complicated, like having too many parameters relative to the number of observations.
  • Underfitting occurs when a machine learning algorithm or a statistical model cannot recognize the underlying trend of the data. 

Overfitting is a modelling error that occurs when a function is too closely fitted to a limited set of data points. Underfitting refers to a model that is neither model training data nor generalisation to new data.

Underfitting occurs when a model is too simple and is informed by too few features or regularised too much, making it inflexible in learning from the dataset.

Overfitting is a problem because the evaluation of machine learning algorithms on training data differs from the evaluation, namely how well the algorithm performs on unseen data.

Machine learning prevents overfitting in several ways they are:

  • Cross-validation
  • Removing features
  • Early stopping
  • ensembling