Thursday, February 29, 2024
HomeTrending! What is overfitting (overfitting) in machine learning?

[Machine learning 2022]! What is overfitting (overfitting) in machine learning?

 machine learning

Do you know overfitting (overfitting) that you face once in machine learning?

Overfitting can have a negative impact on the model you want to predict. So how can we prevent overfitting?

In this article, we will introduce what overfitting is, its causes, countermeasures, etc.

Table of contents

  • What is overfitting (overfitting) in machine learning?
  • A specific example of overfitting that occurs in machine learning
  • 3 main causes of overfitting in machine learning
    • (1) Lack of training data
    • (2) Learning from biased data
    • (3) The purpose of the machine learning model you want to create is unclear
  • What to do to notice overfitting
  • 3 methods to prevent overfitting in machine learning
    • (1) Hold-out method
    • (2) Cross-validation
    • (3) Regularization (min-max normalization)
  • summary

What is overfitting (overfitting) in machine learning?

Overfitting, also known as overfitting, is one of the most common pitfalls of data analysis.

Overfitting is the result of the computer over-learning the training data prepared in advance when performing machine learning. It refers to the state where there is no (lost versatility).

This expression may be a little difficult to understand, so let’s replace it with a familiar example and give an easy-to-understand example.

Imagine a “school test” . I think that I will solve “past questions” as a test measure before the test . Mr. A has only solved similar past questions in this test preparation, and he only remembers the combination of “problems” and “answers”, and when he takes the actual test, the past questions will appear . Suppose there is a situation where there are only new problems that have not been solved and cannot be solved at all. This is the so-called overfitting state.

AI specialized news media AINOW
What is machine learning? Easy-to-understand explanations from definitions to learning methods, five algorithms, and application examples
Hello, this is Satoshi, an AINOW intern. In this article, we will explain machine learning, which is often confused with AI and deep learning, in an easy-to-understand manner so that you can understand their relationships and differences. Also, essential terms for understanding machine learning (supervised learning, unsupervised learning, each al…

A specific example of overfitting that occurs in machine learning

So, what kind of model will be created when overfitting occurs?

Let’s take a look at 1) underlearning, 2) appropriate, and 3) overlearning.

From the left, this graph shows (1) under-learning, (2) appropriate, and (3) over-learning.

At first glance, the graph on the right seems to be the best fit for the given data, but this is an overfitted model, overfitting.

If this happens, the machine learning model will be completely unusable, so it is necessary to respond immediately.

  • Lack of learning : No complexity of the model just capturing the features of the training data
  • Appropriate : Capturing the characteristics of the training data sufficiently
  • Overfitting: Overfitting to training data and not generalizing to unknown data

3 main causes of overfitting in machine learning

Overfitting is one of the most common pitfalls in data analysis and a troubling phenomenon for data scientists. Why is it that overfitting is caused without being able to avoid it even if you know the existence of overfitting? I will introduce three main reasons for this.

  1. Lack of training data
  2. Learning from biased data
  3. The purpose of the machine learning model you want to create is unclear

(1) Lack of training data

One of the major causes of overfitting is insufficient training data.

Overfitting is often misunderstood as over-learning (too much data), but in reality, it is the lack of data that hinders normal learning.

Even if a new job is assigned, humans can respond to some extent based on past experience and common sense in addition to the necessary data, and can proceed with learning efficiently. is everything.

In other words, if the amount of data given is small, we can only analyze from the small amount of data and can only deal with biased data.

Therefore, in order to achieve correct data analysis, it is necessary to secure a sufficient amount of data and train it according to the purpose.

(2) Learning from biased data

Another important thing to keep in mind when learning data is to learn biased data.

As mentioned in ①, analysis can only be performed from data that has been given machine learning. Therefore, in order to perform flat analysis, it is necessary to learn abundant data with as little bias as possible.

If only biased data is learned, such as building a model with insufficient data or learning only with convenient data, the machine learning model lacks objectivity and can only perform biased analysis and prediction. , which adversely affects the model building itself.

In order to perform correct data analysis, let’s prepare correct and abundant data.

(3) The purpose of the machine learning model you want to create is unclear

As a preliminary step to building a model, it is absolutely necessary to have the purpose of “what kind of model to build?”

Not limited to machine learning, AI basically only has practicality specialized for single tasks, and it is not possible to perform various tasks from one data for general purposes, and it is possible to control it on the human side. I have to.

If you build a model in a state where it is unclear whether you want to predict store sales, predict the population growth rate, or what you want to do, you will learn unnecessary data or biased data. It will cause you to learn.

Build a model to predict store sales. If the purpose is clarified, such as, it is possible to collect and learn a wide range of data necessary for the prediction.

Therefore, it is possible to prevent overfitting by clarifying the purpose of what kind of model to build and having it learn with data suitable for it.

What to do to notice overfitting

In the unlikely event of overfitting, it is necessary to notice it immediately and improve the model.
Even if you proceed with learning and prediction without noticing overfitting, only meaningless models and meaningless data will be born.

It is important to go through the process of “building a model -> verifying -> improving -> upgrading the model” so that the process of ” building a meaningless model → calculating a meaningless predicted value” does not happen.

And, as introduced in the cause of overfitting, it is necessary to prepare sufficient data in advance without bias. Also, by separating and distinguishing training data, validation data, and test data in advance, it will be easier to evaluate the accuracy of the model.

Such overfitting is closely related to “bias” and “variance” and is an important way of thinking, so it is good to know.

  • Bias : the difference between the predicted result and the actual value
  • Variance : variation in prediction results

Basically, bias and variance are in a trade-off relationship, so it is very important to find a balance.

A model with a low bias is excellent for a prediction model, but if the bias is too low, it will also adapt to noise (obstructive data that should be ignored), resulting in large variations in prediction results. As a result, the variance will increase. In addition, even if you try to fit a complex model to various data, the predicted values ​​will vary and the variance will increase.

As a general rule, when the bias is low and the variance is high, there is a high possibility of overfitting, so it is better to be careful.

3 Methods to prevent overfitting in machine learning

Even if you know the causes of overfitting and are very careful, there is a good chance that overfitting will occur.

So how can we avoid overfitting?

I will introduce three methods to prevent overfitting.

  1. Hold-out method
  2. cross-validation
  3. regularization (min-max normalization)

(1) Hold-out method

The holdout method is one of the data testing methods in machine learning, and is the simplest method in model evaluation.

In the holdout method, all data sets are divided into training data (x_train, y_train) to create a model and test data (x_test, y_test) to evaluate the model, and the accuracy of the trained model is evaluated.

If you have 100 data, divide them randomly, such as 6:4 or 7:3, and divide them into 60 training data, 40 test data, and so on. (Many people allocate less test data)

By dividing the data and building a model in this way, it is possible to improve the performance (generalization performance) for unknown data and create a model that is less prone to overfitting. The disadvantage is that evaluation values ​​vary when the number of samples) is small.

(2) Cross-validation

The cross-validation method is one of the methods of dividing training data and test data (validation data) like the holdout method, but the model is built using a slightly different division method than the holdout method.

Here, we introduce K-fold cross-validation, which is often used among cross-validations.

The K-fold cross-validation method randomly splits the original data into k datasets. A model is built using one of the k divided datasets as test data and the rest as training data.

Then, all split datasets are validated as test data in order. Since there is no overlap between the data sets to be verified , it can be said to be a more reliable evaluation method than repeating the holdout method k times . There is a disadvantage that if the amount is huge, the calculation load will be applied to the CPU and the calculation will take time.

(3) Regularization (min-max normalization)

Regularization is a method that makes a model that has become complicated due to overfitting simpler.

From among the prepared data, the model is simplified by giving penalties to complex data and unsmooth models, lowering the weights, and ignoring isolated data.

(In the field of machine learning, devices for improving overfitting are generally called normalization.)

There are two methods of normalization, each used for different purposes.

  • L1 regularization : Clarify necessary explanatory variables (set the influence of unnecessary explanatory variables to 0)
  • L2 Regularization : Smooth the predictive model (reduce the effect of complicating the model)

There are the following ways to use them properly.

When the number of data and the number of explanatory variables are also large Reduce the number of explanatory variables using L1 regularization
When the number of data and the number of explanatory variables are not large Optimize partial regression coefficients using L2 regularization

Normalization is explained in detail in Qiita, so please check it out if you want to know more. (Qiita article: Why is normalization necessary in machine learning? )


What did you think.

In this article, I introduced the term “overfitting” in machine learning.

When performing data analysis using machine learning, overfitting is inevitable, so it is necessary to take proper countermeasures.

In addition to the holdout method, cross-validation, normalization, and other methods not introduced this time, there are solid measures to prevent overfitting, so when handling data in machine learning, use a method that suits your purpose to achieve correct accuracy. Let’s create a high model!



Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments