Thursday, February 29, 2024
HomeAIAI "CLIP" that expresses the relationship between language and images specialized !

AI “CLIP” that expresses the relationship between language and images specialized[2022] !

AI

rinna Co.( AI ), Ltd. (Head office: Shibuya-ku, Tokyo / CEO: Jean “Cliff” Chen, hereinafter referred to as rinna) has developed a pre-learned language image model CLIP ( Contrastive Language-Image Pre-training) and its improved model CLOOB, released under the commercially available Apache-2.0 license.

AI

It is expected that the release of this model will lead to the activation of research on language and image understanding in Japanese.

■Overview

Rinna has so far released pre-trained language models for GPT (1.3 billion parameters) and BERT (110 million parameters) specialized for Japanese natural language processing (NLP), and has been used by many researchers and developers. It has been. The range of use of general-purpose language models is expanding, and CLIP developed by OpenAI has realized the relationship between language and images.

This time, in addition to learning CLIP specialized in Japanese, we also learned CLOOB (Contrastive Leave One Out Boost), an improved model of CLIP, also specialized in Japanese. We will give back to the language and image community by releasing these models to Hugging Face, an AI model library, under the commercially available Apache-2.0 license.

■ Hugging Face URL

CLIP: https://huggingface.co/rinna/japanese-clip-vit-b-16

CLOOB: https://huggingface.co/rinna/japanese-cloob-vit-b-16

 

■ Explanation of CLIP

CLIP is a pre-trained language image model that can express the relationship between language and images. For example, for an image of a cat, it is possible to determine that the text “picture of a cute cat” is closer than the text “picture of a dog walking”.

CLIP learning makes extensive use of pairs of images and text describing the images (for example, the image and text of “picture of a cute cat”). In the learning stage, the image “picture of a cute cat” is closely related to the text “picture of a cute cat”, but is far from the text “picture of a dog walking”.

At the same time, it learns that the text of “picture of a cute cat” is closely related to the image of “picture of a cute cat”, but is distantly related to the image of “picture of a dog walking”. Such learning makes it possible to express the relationship between language and images. In addition, CLOOB, which is released at the same time as CLIP, is an improved model of CLIP, and has been reported to have higher performance than CLIP.

CLIP, which can express the relationship between language and images, can be applied to various tasks. For example, it can be applied to image classification tasks that classify images into multiple classes such as cats and dogs, and image retrieval that outputs several images that are close to a given text. Furthermore, it is also possible to combine with an image generation model to generate an image from text. This can be achieved by using CLIP’s ability to output the degree of similarity between images and text, and having the image generation model generate images that increase the degree of similarity.

Features of rinna’s Japanese CLIP

Rinna’s CLIP has the following features.

  • As training data, CC12M < https://github.com/google-research-datasets/conceptual-12m > open source data of 12 million language/image pairs are translated into Japanese and used.
  • Training of CLIP/CLOOB requires training with a large batch size, but rinna’s model is trained with sufficient computer resources using 8 NVIDIA Tesla A100 GPUs (80GB memory).
  • For learning CLIP/CLOOB, we used the Japanese-specific BERT (110 million parameters) released by rinna.
  • The learned CLIP/CLOOB is released to Hugging Face under the commercially available Apache-2.0 License.
  • CLIP/CLOOB can also handle image classification tasks. In this model, in zero-shot image classification without additional learning, CLOOB’s top 1 predicted label accuracy rate for 50,000 images of Japanese 1000 class achieved 48.36%. Table 1 shows that the model understands the relationship between language and images.

Table 1: Zero-shot image classification results for 1000 classes in the ImageNet validation set

・ By combining it is possible to generate images from text (Figures 1 and 2).

 

Figure 1: Output result when inputting “Kyoto, Japan at the North Pole”

Figure 2: Output result when “Sunflower oil painting” is input

■ Future development

The large-scale pre-trained models developed by rinna’s research team are already widely used in rinna’s products. rinna plans to continue to publish research results and return them to the research and development community. We also aim to implement AI in society by collaborating with other companies .

[About rinna Co., Ltd.]

AI character development company established in June 2020 . With the mission of “Draw out your unique creativity with AI characters and make the world colorful.” ” Tamashiru We provide an SNS application such as “Chararu ” that trains and interacts with other AI characters.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments