In the future, we plan to analyze how models like DALL♾ relate to societal issues like economic impact on certain work processes and professions, the potential for bias in the model outputs, and the longer term ethical challenges implied by this technology. We recognize that work involving generative models has the potential for significant, broad societal impacts. This training procedure allows DALL♾ to not only generate an image from scratch, but also to regenerate any rectangular region of an existing image that extends to the bottom-right corner, in a way that is consistent with the text prompt. It receives both the text and the image as a single stream of data containing up to 1280 tokens, and is trained using maximum likelihood to generate all of the tokens, one after another. Like GPT-3, DALL♾ is a transformer language model. We extend these findings to show that manipulating visual concepts through language is now within reach.
Image GPT showed that the same type of neural network can also be used to generate images with high fidelity.
GPT-3 showed that language can be used to instruct a large neural network to perform a variety of text generation tasks.