Thursday, February 29, 2024
HomeTrendingWhat is DALL-E 2? From architecture to risk !

What is DALL-E 2? From architecture to risk !


Table of Contents

  • Foreword
  • Overview of DALL-E 2 Specifications
  • Architecture
  • Comparison with existing models
  • Weaknesses in image generation
    • Generate overlapping objects
    • Character generation
    • Generating high-resolution images
  • Restrictions intended for abuse
    • Language restrictions
    • Suppress input text
    • User restrictions
    • Limitation of face generation
  • Anticipated harmful output
    • Disguised content
    • Visual synonym
  • Malicious modification
    • Harassment or bullying
    • Fake image
    • Forgery of evidence
    • Trademark infringement
  • Bias
    • Racial bias
    • Cultural bias
    • Gender bias
  • Evaluation operation progress report
  • Impact when released to the public
  • Toward the development of a Japanese version of the multimodal image generation model
    • Diversion by translation is impractical
    • Risk is inevitable
    • Human evaluation is valid
  • Summary


On April 6, 2022, OpenAI announced an AI model “ DALL-E 2 ” that generates images from text. In this article, we will summarize the specifications and mechanism of this model, confirm the assumed risks, and consider the impact that this model will have on society. Based on these considerations, at the end of the article, I will also list points to keep in mind when developing a Japanese version of the multimodal image generation model.

The following content is based on OpenAI’s published blog post on DALL-E 2 , a paper detailing the model , and a report summarizing the risks of the model .

Overview of DALL-E 2 Specifications

DALL-E 2 is an image generation model that generates an image according to the content of the text when you enter a text (caption) that describes the image, such as “Astronaut on a horse in a photorealistic style.” It is the successor model of DALL-E announced in January 2021 .

Image source: From the OpenAI blog post announcing DALL-E 2

It is worth noting that a Google image search using the above text as a search term will only show image examples cited in OpenAI’s blog post announcing DALL-E 2. This means that there are no images on the internet other than those generated by the model that depict “An astronaut riding a horse in a photorealistic style.” This means that the images generated by the model did not copy anything .

Google image search results for “An astronaut riding a horse in a photorealistic style”. Search conducted in May 2022

DALL-E 2 accepts any image as input information, and can also delete or add any object in the image, or change the style of the image. In other words, it also realizes some of the functions of existing image editing apps . In the example cited in the OpenAI blog post, a duck object that was not present in the original image on the left was added in the image on the right processed by the same model.

Image source: From the OpenAI blog post announcing DALL-E 2

The following image list is a group of images generated from the text “Vivid portrait of Salvador Dali with half of the face as a robot”. The variety of compositions and styles is astonishing, and you can see that a great deal of effort would be required to achieve similar results manually. By the way, Salvador Dali is a representative surrealist painter who is the origin of the model name of DALL-E 2.

Image source: DALL-E 2 published paper

Images generated by DALL-E 2 can be viewed from OpenAI’s Instagram account @openaidalle . These images were generated by limited users selected by OpenAI as described below.


The DALL-E 2 architecture consists of a part that generates CLIP image embedding from text and a part that generates various images from that embedding . CLIP is an image recognition model that adds captions to arbitrary images announced by OpenAI in January 2021, and generation of CLIP image embedding is to generate embedded information about images that reflects the meaning of input text. means This process only generates the basic image, and it has not yet been drawn in various compositions and styles.

Image source: DALL-E 2 published paper

The process of generating diverse images employs a diffusion model that emphasizes certain image features .

Comparison with existing models

OpenAI conducted a comparative test using GLIDE , a multimodal image generation model announced in March 2022, as a benchmark to confirm the performance of DALL-E 2 . The comparison was based on the scores that humans scored the images generated by the two models with respect to three items: photorealism'',similarity with text” and “diversity”.

In the evaluation of “photorealism” that compares the realism of images, we showed human testers one by one the images generated by DALLE-E 2 and GLIDE and asked them to choose which one was more photorealistic. In “similarity with text”, which compares the similarity between the input text and the generated image, we presented the images generated by the two models together with the text and asked them to evaluate the similarity. For “diversity”, we asked which of the four images generated by giving the same input text to the two models was more diverse. The results of these tests are shown in the table below.

Percentage of respondents saying DALL-E 2 is better

photo realism Similarity to text Diversity
48.9%±3.1% 45.3%±3.0% 70.5%±2.8%

As described above, while photorealism and text are almost evenly matched, DALL-E 2 greatly surpasses GLIDE in terms of diversity .

In the paper on DALL-E 2, in addition to GLIDE, DALL-E, Make-A-Scene , a multimodal image generation model announced by Meta AI Research in March 2022, and DALL-E 2, which is still under development, were compared. Results are also posted. In the image list below, the top row is the actual real-world image, and the bottom row is the one produced by the release version of DALL-E 2.

Image source: DALL-E 2 published paper

In the above comparison between DALL-E 2 and existing multimodal image generation models, FID (Fréchet Inception Distance) calculated using the MS-COCO image dataset, which is often used for image quality comparison, We also made a comparison. The lowest value calculated from the DALL-E 2 generated image was 10.39 for this index, which means that the lower the value, the higher the quality of the image .

DALL-E 2 employs a diffusion model to generate diverse images as mentioned above. This model had the problem of “ diversity vs. image quality trade-off, ‘ ‘ in which the more diverse the image, the less realistic the image . This problem has been improved in DALL-E 2, achieving both diversity and realism compared to existing models.

In the image list below, the images on the left are images generated by DALL-E 2, and the images on the right are those from GLIDE. The numbers from 1.0 to 4.0 on the left side of the image list are parameters that represent the realism of the image, and the higher the number, the more realistic it is. In the image generated by GLIDE, when the realism is 4.0, it becomes a similar composition, whereas DALL-E 2 shows a variety of compositions.

Image source: DALL-E 2 published paper

Weaknesses in image generation

While the DALL-E 2 has innovative image generation capabilities, it has been found that image generation can fail in three contexts:

Generate overlapping objects

It may fail to generate images where similar objects are stacked , such as an image of dice . The left side of the image below is generated by DALL-E 2, and the right side is that of GLIDE. You can see that some of the dice generated by DALL-E 2 are unnatural and lack shadows.

Image source: DALL-E 2 published paper

Character generation

Generating characters such as those written on billboards can be misspelled. The image below is the output image when the text “Deep Learning sign” is input.

Image source: DALL-E 2 published paper

Generating high-resolution images

When generating high resolution images, some parts of the image may be corrupted. In the image of the dog below, the drawing of the legs is broken, and in the image of New York Times Square, the wall of the building is partially collapsed.

Image source: DALL-E 2 published paper

Restrictions intended for abuse

DALL-E 2 as described above has a risk of being abused due to its innovative image generation ability . Concerned about the risk of misuse, OpenAI currently does not release the model to the public, but selects users for evaluation. Restrictions established during the evaluation stage include the following.

Language restrictions

DALL-E 2 supports English only . This restriction means that the output image is rooted in the culture of the English-speaking world, rather than being able to input only in English.

Suppress input text

To prohibit the output of sexual or violent imagery, we exclude words that can generate such images from the input text.

User restrictions

DALL-E 2 evaluation users as of May 2022 were limited to 400 people selected by OpenAI . The breakdown of those selected users was 200 OpenAI employees, 25 researchers, 10 creators, and 165 OpenAI employee-related people. The final group of users includes OpenAI executives, Microsoft employees, and family and friends of OpenAI employees. It is speculated that Microsoft employees are included in the selected users because Microsoft has invested in OpenAI .

The reason for setting the number of evaluation users to 400 is based on the experience of evaluating GPT-3.

Limitation of face generation

The ability to generate photorealistic images of human faces exposes them to a variety of misuse risks. Therefore, DALL-E 2 has two limitations regarding facial image generation .

The faces of public figures known to many people, such as dignitaries and celebrities of different countries, are intentionally limited so that they cannot be generated. However, OpenAI has not clarified what specific public figures face generation is restricted.

In addition, DALL-E 2 has a technical limitation that does not output images that perfectly match the characteristics of all input images, including facial images . Simply put, it does not output a copy of the facial image. However, the possibility of outputting a caricature that closely resembles the input face image still remains.

Anticipated harmful output

Even with these restrictions, malicious and clever text input can result in output of images that are judged to be harmful. Here are two possible cases of such harmful output.

Disguised content

Even if the input information does not contain content that directly outputs harmful images, it may output harmful images depending on the context of the input information . For example, an input of “person eating eggplant” will output a safe image, but an input of “person eating a whole eggplant” may output a sexual image. In this case, the input information is disguised as if it were safe in order to output a harmful image.

As for the possibility of disguised content, it was already noticed during the development of DALL-E 2. We also knew that this kind of harmful image was more likely to occur, especially when generating low-resolution images. Since the released version of the model can now generate high-quality images, it is believed that the risk of generating disguised content has been somewhat mitigated.

Visual synonym

In some cases, even if the input text is safe content, it contains visually harmful objects. For example, the word “blood” is likely to generate violent images, while “ketchup” can be safe or harmful depending on the context of the image generation. Because ketchup and gore are visually similar .

Malicious modification

DALL-E 2 can not only generate images, but also add and remove objects in images. Therefore, the same model can be used to modify a safe image into a malicious one. Such malicious modifications can be produced without harmful objects, making them difficult to prevent by restricting input information. There are four typical examples of malicious image modification:

Harassment or bullying

It is assumed that an image is generated or processed for the purpose of damaging the dignity of a specific individual. Specifically, the following contexts are possible:

  • Clothing modification: Add or remove religious clothing (like Hijab in Islam).
  • Add specific foods to photos: Add meat to images of individuals who are vegetarians.
  • Add people to images : replace people in images of people holding hands (e.g. make it look like they are holding hands with someone other than their spouse)

Fake image

Even safe images can be fake images that mislead viewers depending on the context in which the image is presented . For example, adding smoke to the sky in a cityscape image could be used to claim that war has broken out.

Forgery of evidence

A similar example of fake images is the fabrication of evidence. And fake news can be produced based on fabricated evidence .

Trademark infringement

The current DALL-E 2 can generate various trademarked logos, copyrighted strings, etc. Of course, abuse of this image generation ability is conceivable. However, OpenAI does not plan to adopt a policy that completely restricts the generation of trademark logos. It seems to support quoting and generating appropriate trademark logos .


As mentioned in the limitations, the DALL-E 2 does image generation rooted in English-speaking cultures. This means generating images that reflect the biases in the English-speaking world . There are three types of image generation that include bias:

Racial bias

The DALL-E 2 tends to produce in favor of native English speaking Caucasians. For example, for the text “a builder”, output an image of a white male construction worker.

Image source: Quoted from the system card document summarizing the risks of DALL-E 2

Cultural bias

The model preferentially outputs images based on Western culture . For example, for the text “a wedding”, output an image of a Western wedding ceremony, but not an image of a Japanese Shinto wedding ceremony.

Image source: Quoted from the system card document summarizing the risks of DALL-E 2

Gender bias

The model outputs images that reflect gender biases in the English-speaking world . For example, typing “lawyer” will output an image of a white man in legal robes. Also, an input of “nurse” will output only female nurses, although they are racially diverse.

Image source: Quoted from the system card document summarizing the risks of DALL-E 2

Image source: Quoted from the system card document summarizing the risks of DALL-E 2

As you can see from the “lawyer” example, the above three biases often appear in combination.

Evaluation operation progress report

As mentioned above, DALL-E 2 is in the evaluation stage for use by selected users only. Under these circumstances, on May 18, 2022, an official OpenAI blog post was published to report what was found from the evaluation operation. The contents of the report are summarized in the following five items.

  • Only the first limited users have generated 3 million images using the DALL-E 2.
  • Improved text filters to eliminate risky word input, and enhanced safety systems including automatic detection of generated images that violate content policies and adjustments to the response system that operates upon detection.
  • Fewer than 0.05% of generated images downloaded or published were flagged as violating the content policy. 30% of the flagged images were also confirmed as violating by human reviewers, leading to account deactivation.
  • We asked limited users not to generate photorealistic images containing human faces, and to flag any problematic image generation they found. This request will continue for the time being.
  • DALL -E 2 accounts will be issued to up to 1,000 new users every week .

From the above progress report, it can be inferred that the public version DALL-E will implement a robust safety system to prevent the generation of problematic images.

Impact when released to the public

Currently DALL-E 2 is only available for limited users, but like GPT-3, it is expected that eventually it will be commercially available to general users. Below, we summarize the economic impact of the model when it is released to the public.

In order to automate or streamline image generation tasks with DALL-E 2, it is required that occupations related to image production will be replaced by the same model through the general release of the model , or productivity will be improved by using the same model. I guess. The occupations affected include photographers, graphic designers, artists, and models. Services that are not directly related to image production, stock photo services, and illustration services may be replaced.

With the public release of DALL-E 2, new skills and services can be considered. For example, as an alternative to stock photo services, skills and services that generate images that match the content of the text may emerge. Such skills and services will be invaluable to those involved in web media. This is because you can have customized featured images and inset images for individual articles in a short time.

A fusion of DALL-E 2 and existing apps is also expected. An app that achieves such a fusion will be a further evolution of the GauGAN AI art demo that incorporates the GAN model announced by NVIDIA . And creators who can master these futuristic image-generating apps will be well positioned professionally.

Toward the development of a Japanese version of the multimodal image generation model

The emergence of multimodal image generation models like DALL-E 2 is an inevitable trend in artificial intelligence development. Therefore, in the near future, a Japanese version of the multimodal image generation model will be developed (maybe already under development). Below we summarize what we can learn from the development of the DALL-E 2 regarding the development of such a model.

Diversion by translation is impractical

In the development of the Japanese version of the multimodal image generation model, we can consider a method of diverting the public version DALL-E. In this method, you would translate the text entered in Japanese into English, pass it as input to DALL-E, and receive the output. However, this method is not very practical.

As mentioned above, the DALL-E 2 produces output rooted in English-speaking culture. Therefore, even if you use the same model, you will not get the output that Japanese users want. A model that outputs images that Japanese users want must be developed using training data using text and images in the Japanese-speaking world .

Risk is inevitable

If we develop a Japanese version of the multimodal image generation model rooted in the Japanese-speaking world, we believe that the model has the potential risks of DALL-E 2 mentioned above. Specifically, it will include the generation of harmful images and fake news by camouflaged content, as well as biases that are in line with the Japanese-speaking world.

A bias specific to the Japanese-speaking world might be, for example, that the text “Otaku” is likely to output an image of a man wearing distinctive clothing. .

Human evaluation is valid

To mitigate the potential risks of the Japanese version of the multimodal image generation model, human evaluation is considered essential, as we are doing in DALL-E 2 development. In such evaluations, the composition of evaluation groups should be fair . In other words, it would be desirable to organize an evaluation group that is not biased in terms of gender, age, etc.


Multimodal image generation models such as DALL-E 2 are expected to be commercially available worldwide in the near future. When such a situation becomes a reality, as mentioned above, photographers and graphic designers may be forced to replace or improve efficiency.

When multimodal image generation models take off in earnest, should those at risk carry out a negative campaign against AI like the Luddite movement once did? Perhaps the rise of AI will not be stopped by the anti-repulsion movement.

The best way to adapt to the full rise of multimodal image generation models is to become familiar with these AIs. And the best way to survive is to use AI to streamline operations and invent new image generation skills. In a nutshell, future creators should form a co-creative relationship with state-of-the-art image-generating AI .

As colleagues I work with, multimodal image generation models are very capable. Creators who have AI on their side will revitalize the creative industry of the future.



Please enter your comment!
Please enter your name here

Recent Posts

Most Popular

Recent Comments