In learning machine learning, statistics is an inseparable existence, and at the same time, it is said that the content is similar or the boundary is unknown.
So is multiple regression analysis really a machine learning technique?
By clarifying the purpose of machine learning methods and statistical methods, let’s use them properly!
This article introduces the difference between machine learning and statistics, the necessity of statistics, and recommended books.
Table of Contents
- The difference between machine learning and statistics (statistical analysis)
- What is statistics
- Statistics fall into three categories
- Why Machine Learning Needs Statistics
- The relationship between machine learning and statistics
- Differences between machine learning and statistical methods
- How to properly use machine learning models and statistical models
- 3 Recommended Books to Learn Machine Learning with Statistics
- Complete self-study Introduction to Statistics
- Introduction to Statistics for Data Science
- Machine learning with Bayesian inference
The difference between machine learning and statistics (statistical analysis)
There is a distinct difference between machine learning and statistics.
Taken literally, ” machine learning ” is the automatic learning of machines, while “statistics” is the statistical determination of rules and patterns in data.
“Statistically” here means judging whether or not it is probabilistically correct.
What is statistics
We are surrounded by a lot of data, and we can obtain various data.
However, the data itself is difficult to understand, and useful information cannot be obtained from simple data.
Statistics is the study of statistical methods of processing and analysis for estimating the properties of a population by examining the properties of data and examining samples extracted from the population.
Statistics fall into three categories
Statistics fall into three categories.
- Descriptive statistics
- Inferential statistics
- Bayesian statistics
We will discuss each of the above three.
(1) Descriptive statistics
First, let’s talk about descriptive statistics.
Descriptive statistics is a statistical technique that is used to understand trends and characteristics of data . Specifically, it calculates the mean and variance of the data and grasps trends.
But how do you catch trends?
You can’t just calculate the mean and variance of your data.
Therefore, in order to grasp the trend, the calculated data is graphed.
Examples of descriptive statistics include average test scores, average age, and average height.
As mentioned above, there must be some criteria to catch the trend of these data. Just arranging height and age data is difficult to understand and cannot be compared. The standard is the “average”.
Having an average makes it easier to compare data. Then, by making a graph of the comparison, you can make it “visually understandable”.
Descriptive statistics is about making complex data easy to understand .
(2) Inferential statistics
Next, I will explain inferential statistics.
Inferential statistics is the study of inferring the characteristics of the population as a whole from a sample drawn from the population.
If the population is large, surveying the entire population can be expensive and time consuming.
Therefore, based on the sample extracted from the population, the characteristics of the population are inferred.
Typical examples of inferential statistics include TV ratings and defect predictions.
It is difficult and inefficient to collect data from all households throughout Japan when calculating TV ratings.
Estimated statistics extract a part and estimate the whole (population).
(3) Bayesian statistics
Finally, we will discuss Bayesian statistics.
Bayesian statistics is a type of inferential statistics that, unlike descriptive and inferential statistics, does not necessarily require a sample. Therefore, Bayesian statistics deals with subjective probabilities.
There are two types of probability: objective probability and subjective probability.
Objective probability is the probability that the answer will not change from person to person.
The probability of rolling a die and getting a 1 is 1/6. This is an objective probability because it is a probability that does not change no matter who says it.
On the other hand, subjective probability is the probability that the answer will vary from person to person.
For example, what is the probability that person A owns a dog?
Such probabilities cannot be judged objectively and are subjective probabilities such as 1/3 or 1/10.
This is subjective probability and Bayesian statistics deals with subjective probability.
Even if the data is insufficient as before, the set probability is called the prior probability.
In addition, the probability is updated every time new information is obtained, and the case where it is set is called the posterior probability.
Bayesian updating is the process of updating probabilities based on new information.
We will work out the probabilities of what actually happens.
Why Machine Learning Needs Statistics
Machine learning does not necessarily require statistical knowledge.
There are abundant libraries, trained models, datasets, etc. on the Internet, so you can do simple machine learning while researching.
However, if you want to learn machine learning, you will need to understand how to process and analyze data.
Machine learning is based on mathematics and statistics, so learning these is necessary.
The relationship between machine learning and statistics
Machine learning and statistics are closely related.
Terms used in machine learning such as “coefficient of determination” and “standard deviation” are also statistical terms.
Statistics is an academic field that has existed since ancient times, and has been used in national census for a long time.
Both statistics and machine learning are very similar because they both deal with data and aim at estimating or predicting numerical values.
Therefore, it can be said that if you advance your study of statistics, you will be able to learn machine learning smoothly.
Differences between machine learning and statistical methods
There is a crucial difference between machine learning methods and statistical methods. It can be said to be the “purpose” of data handling.
In machine learning, we make predictions from accumulated data. A model with certain accumulated features is called a trained model. The trained model makes predictions on different new data. The resulting prediction results are required to have high prediction accuracy.
On the other hand, in statistics, it is used to discover regularities and irregularities from data that varies, and to provide evidence for the data.
It is also used to analyze data and predict unseen data from that data, and is required to explain whether the predicted data is reasonable.
In other words, machine learning aims to improve prediction accuracy, while statistics aims to verify hypotheses and explain them with evidence.
How to properly use machine learning models and statistical models
Machine learning and statistics have different purposes.
So, how should we differentiate between machine learning and statistical methods?
Machine learning is used in the quest for predictive accuracy.
The higher the accuracy, the better, so it is used for user recommendations.
User recommendation is to recommend something that seems to be of interest to the customer.
For example, it refers to recommending similar products as recommended products based on the customer’s purchase history and browsing history. At this time, the one that is closer to the customer’s preference is considered to be more accurate.
In this way, it is used in situations where high accuracy is sufficient even if the logic is not understood.
On the other hand, statistical models are used to analyze data and derive new things, and have the advantage of being easier to understand than machine learning.
The process is easy to understand, that is, the algorithm is simple, so it is possible to grasp trends from uncertain data or less data, and to grasp trends when you want to know the entire model.
3 Recommended Books to Learn Machine Learning with Statistics
Here are three books I recommend for those who want to learn machine learning with statistics.
- Complete self-study Introduction to Statistics
- Introduction to Statistics for Data Science
- Introduction to Machine Learning with Bayesian Inference
I will introduce each of them.
Complete self-study Introduction to Statistics
Complete Self-Study Introduction to Statistics is a book that explains statistics in a very easy-to-understand manner. Fill-in-the-blank problems and practice problems that are important for studying statistics are explained from the beginning with explanations.
Introduction to Statistics for Data Science
Introduction to Statistics for Data Science contains basic concepts, R, and Python code for statistics and machine learning. Through a series of data science processes such as data classification, analysis, modeling, and prediction, you can efficiently learn the basics of statistics and practical data science techniques.
Machine learning with Bayesian inference
Machine learning by Bayesian inference is a book that allows you to learn Bayesian inference on the shortest path. It is carefully written from scratch for beginners who have started studying Bayesian inference.
Both machine learning and statistics are used by many companies and institutions and can be said to be important fields.