As attention to AI has increased, the importance of big data has also increased, and in the field of AI, how to collect huge amounts of data has become the key. However, there was a problem that creating a large amount of high-quality data was costly, such as the need to annotate (tag) a large amount of data in order to create training data .
Under such circumstances, in the field of AI, the idea of data-centric (data centered) AI development, which emphasizes how to collect high-quality data even with a relatively small amount of data, is more important than a way of thinking that overemphasizes algorithms. It is popular. From now on, the difference between success and failure will be how efficiently and at a low cost a small amount of high-quality data can be created.
This time, we interviewed Takeshi Suzuki of FastLabel Inc., which provides an annotation platform service that speeds up AI development, about the importance of data quality, which is attracting attention.
ーーPlease introduce yourself.
Mr. Suzuki: My name is Kenji Suzuki.
I was researching machine learning algorithms at Waseda University graduate school. After graduating, I joined a system company and gained experience as a software engineer. After that, he participated in a project to incorporate AI into an accounting system, and was in charge of everything from planning to development.
When creating the data, we spent about a month tagging accounting slips, which was a hardship that I had not experienced in the laboratory. I thought that the tagging phase might be a bottleneck in AI development, so I started a company called FastLabel.
Currently, FastLabel has developed a data platform equipped with annotation tools for AI development companies and provides training data creation support services.
Table of contents
- The difficulty of quality control of data that I felt keenly during AI development
- The trend toward de-dependence on AI vendors
- Breaking away from being an AI vendor by commoditizing algorithms
- From Algorithm Era to Data Era
- Look at the data challenges that precede the PoC
- Top challenges in building MLOps
- Key Points of AI Projects in the Data-Centric Era
- Focus on Data Accuracy, Not Just Algorithms
- Schemes required in the area of data quality
- From big data to good data
The difficulty of quality control of data that I felt keenly during AI development
ーーCould you tell us more about the tagging issues you had at your previous job?
Suzuki: At my previous job, I was the only person with knowledge of accounting and AI when creating data, so I was mostly alone in tagging.
At that time, even when I interviewed people from other companies that were developing AI, they said that creating data was a difficult task. There have also been failures of hiring part-time workers or dispatched workers and entrusting tagging with inconsistent quality, resulting in unusable data.
Therefore, I thought that securing the quality of the data would be more difficult, and many projects would end in failure.
ーーSometimes the data annotation work is outsourced to part-time workers or dispatched workers.
Suzuki: When deep learning first appeared, I had the impression that most of the annotation work was relatively simple, so it wasn’t that difficult. Compared to now, it seems that most of the data creation was simple, such as crowdsourcing and processing with Amazon SageMaker Ground Truth *1 .
Therefore, it was thought that anyone could do it, but looking at recent AI projects, the domain knowledge and data specifications of existing industries are becoming more complicated, making data creation more difficult. I think there are
The trend toward de-dependence on AI vendors
Breaking away from being an AI vendor by commoditizing algorithms
ーーIn what industries do you currently receive many orders from clients?
Mr. Suzuki: We often receive work from customers in the manufacturing industry. When removing defective products on the production line, we work with human eyes, so I think there is a high need for AI utilization in detecting defective products.
There are technologies and services that can detect defective products by image recognition even with a small amount of data, but there are cases where it is not possible to detect defective products at the site, so FastLabel supports that.
ーーAre there many requests from companies other than the manufacturing industry that develop AI?
Mr. Suzuki: Recently, we have been receiving more requests from business companies.
Until now, we have relied on AI vendor companies, but if you ask a vendor company, it may cost 10 million yen for PoC for one month. On top of that , there are many cases that end with PoC, so I have the impression that more and more business companies are doing in-house production.
Also, deep learning has become easier to handle, and algorithms are becoming commoditized for the purpose of actual business use, so I think that dependency on AI vendors is gradually disappearing.
From Algorithm Era to Data Era
ーーWith the age of algorithms coming to an end and the quality of data being guaranteed with GUI tools and no-code tools, the “democratization of AI” may be approaching.
Mr. Suzuki: As no-code tools develop, dependency on AI vendors disappears, and we move to in-house production, we try to produce data in-house, but the volume of hiring is large, and we are not accustomed to annotations such as part-time jobs and temporary staff. Since it is difficult to guarantee the quality of the data, there are increasing cases of requesting companies that specialize in providing annotations, such as FastLabel.
There are many needs for annotation services, and services other than FastLabel that specialize in annotation are increasing.
Alternatively, there are cases where a BPO company that has been performing data entry on behalf of a company develops an annotation service as a business.
ーーIt has been five to six years since AI began to be used in Japan, but I feel that the term “data-centric” has started to spread recently.
Suzuki: It’s undeniable that the algorithm was the only focus of attention. However, many engineers who have been developing AI in the field for many years are keenly aware of the importance of data.
In terms of data focus, there’s been an interesting move in the competition to improve AI performance. At Kaggle, there was a competition where participants competed for AI performance by improving their algorithms on the same data set. However, from July 2021, a data-centric competition has started that changes only the dataset without changing the algorithm at all.
I thought it was a very interesting move because it was a competition that took a completely different direction from Kaggle.
Look at the data challenges that precede the PoC
ーーAI vendor companies have focused only on PoC issues, but I think it will be important to focus on data issues, which is the previous stage, like Mr. Suzuki.
Suzuki: I think that many AI engineers have acquired their skills in laboratories. In laboratories, data sets are often fixed because we study how the data set is improved compared to other studies.
For this reason, when people who have been researching how to improve algorithms for many years go to the actual site, they sometimes feel that there is a gap between their previous experience and their experience.
So I think we can’t help but focus on improving the algorithm.
ーーWhile there is a need for researchers who can create new algorithms, I think that conventional engineers will struggle after the end of the algorithm era unless they become PMs who oversee projects. What do you think?
Mr. Suzuki: I think you are right. There is still a demand for engineers who are good at algorithms, but I think the relative demand will gradually decrease.
This trend is similar to that seen overseas. Overseas, there is a total shortage of data engineers, so while the need is increasing, the need for algorithm engineers is decreasing.
Therefore, in the future, data engineers who can develop data infrastructure and lifecycles (MLOps) will be useful. Also, planning jobs cannot progress without people with domain knowledge, so I think more data engineers who can do planning will be hired .
Top challenges in building MLOps
Key Points of AI Projects in the Data-Centric Era
ーーHow do you feel about the importance of MLOps?
Mr. Suzuki: Mr. Andrew of Stanford University says that as data centricity spreads, it is necessary to rotate the development cycle centered on data .
I also agree, and I believe that the most important issue in building MLOps is how high quality data can be obtained in all phases of data creation, modification, learning, and operation .
ーーWhen combining AI projects and data-centric, can you tell us if there are any points that we should stick to?
Mr. Suzuki: Instead of focusing only on the algorithm and repeating improvements, we recommend using a basic algorithm, fixing the code as soon as possible, and then focusing on improving the data. At the moment, there are not many convenient tools that support that, but we are currently developing a platform that enables data-centric development as FastLabel.
Also, in the development phase, instead of suddenly creating tens of thousands of data, we first create a small amount of data, such as 1000 or 3000, and then review it over and over again to create high-quality data through iterations. I think it will be important to create
Teaching data creation should be done iteratively, and is equivalent to coding in AI development. In normal software development, the cycle of coding, deploying, and testing is repeated, and if a bug is found during testing, the code is recoded.
On the other hand, AI development goes through a cycle of training data creation, learning, and evaluation, and when AI makes a mistake during model evaluation, it responds by adding or correcting training data.
Focus on Data Accuracy, Not Just Algorithms
ーーIn order to develop a data-centric model, do you mean that we should focus on the accuracy of the data used for verification in addition to considering the accuracy of the algorithm in the conventional PoC?
Mr. Suzuki: In PoC, which is often unsuccessful, I think there was a lot of flow to first preprocess the data, let the algorithm learn, evaluate it, and then change or modify the algorithm. The procedure is to create all the data first, then fix the data, and see if there is any change in accuracy by tuning the code or algorithm.
Instead, in data-centric AI development, I think the point is to create and develop data with the image of running the PDCA cycle. If you fix the code and algorithm once and proceed with PoC in a format where you revise and improve the data many times, I think it will be easier to improve the accuracy when comparing over a certain period of time.
In the past, when Mr. Andrew verified the team to improve the algorithm and the team to improve the data as a project for 3 months, the result was that the team to improve the algorithm did not improve the accuracy at all.
In comparison, the team that improved the data improved accuracy by 16.9% over the three months . Even in PoC, I think that it is better to proceed while correcting the data over and over again to achieve greater results .
Schemes required in the area of data quality
ーーWith the focus on improving data quality, what kind of scheme will be required in the future?
Suzuki: Currently, it is difficult to automate data creation, so creating consistent data is extremely important . It turns out that if you actually create inconsistent data, you won’t get the accuracy you want. A consistent, high-quality dataset can perform as well for half as much as a noisy, inconsistent dataset.
We also lack tools that allow us to improve the data over and over again. I will use a free annotation tool to create it for the time being, but it will not work because there is no mechanism to improve the data or work with a large number of people. Therefore, I think we will need a platform that can improve data many times with a large number of people .
ーーThere are tools that can learn from small amounts of data, but what are the key points when using such tools?
Mr. Suzuki: When creating data and having it learn, there is no point in learning if there is similar data . I think it will come.
It’s been the focus of attention in the past, but it wasn’t easy. Collecting data randomly and annotating a lot of similar data will not improve accuracy much.
If it is possible to create data for areas that AI is not good at by selecting data instead of selecting at random, I think we will be able to achieve accuracy with, for example, about 3,000 items instead of 10,000 .
From big data to good data
Mr. Suzuki: Until now, large amounts of data were collected by major IT companies, and there was a trend to develop AI.
However, in existing industries such as agriculture and medicine, millions of data items per day are unbearable, so learning is inevitably done with a relatively small amount of data. With a large amount of data, even a little noise in the data improves the accuracy, but with a relatively small amount of data of 10,000 or less, the noise has a significant impact on the accuracy.
In the future, I think that it will be necessary to create relatively small amounts of high-quality data, such as “good data.”
We interviewed him on the theme of the arrival of the data era from the algorithm era.
Data annotation looks like a simple task, so quality assurance is neglected. In addition, the complexity of recent data specifications has increased, making it more difficult to ensure quality.
Therefore, it is important to improve the accuracy of annotation with a small amount of data.
In the future, I think that by focusing on creating good data and working on AI development, the implementation of AI in Japan will be further accelerated.