DeepMind Improved Their Technology and Trained AI Vision-Language Model Flamingo

Flamingo, an 80B parameter vision-language model (VLM) AI, was recently trained by DeepMind. On 16 vision-language benchmarks, Flamingo beats all other few-shot learning models by combining separately pre-trained vision and language models. Additionally, Flamingo can talk with users and respond to inquiries about uploaded photos and videos.

Flamingo was created by DeepMind and is built on two earlier models: Perceiver, a multimodal classifier model, and Chinchilla, a 70B parameter language production model. Flamingo combines these two models into a single neural network, which is then trained using the interleaved image and text data sets. A vision-language AI is created as a result, capable of learning new tasks with little to no further training data.


Multimodal VLMs, like CLIP, have demonstrated success at zero-shot learning, but their range of tasks is constrained because these models only produce a score showing the similarity between an image and a textual description. Other VLMs, like DALL-E, can create photorealistic images from a description but cannot complete tasks like visual question answering (VQA) or image captioning since they cannot produce language.

To seamlessly connect the pretrained vision and language models, the researchers also add two learnable architecture elements: a Perceiver Resampler and cross attention layers. The vision encoder sends its spatiotemporal properties to the perceiver resampler, which then outputs a collection of visual tokens. The model may combine visual information for the next-token prediction task by using these visual tokens to condition the frozen LM via newly initialized cross attention layers between the pretrained LM layers.

For 16 multimodal benchmarks, including visual dialogue, VQA, captioning, and image categorization, DeepMind evaluated Flamingo. Flamingo surpassed prior best results "by a substantial margin" in few-shot learning scenarios. Flamingo surpassed cutting-edge fine-tuned models on six of the benchmarks without undergoing its own fine-tuning; instead, Flamingo was given 32 samples, which is "approximately 1000 times less" than the fine-tuned models, and was only utilized in a few-shot situation.

Due to technology limitations, some AI-generated images are of low quality. As for generated images' resolution of DALL.E may not satisfying for all of the people who value image quality, VanceAI Image Upscaler is necessary for this situation.

Note: VanceAI Image Upscaler only offers AI upscaling service and does not offer any creative generation like DALLL-E 2.

About Mia Woods

Meet Mia Woods, an esteemed author with a wealth of experience in writing AI news and creating AI product tutorials. With a deep understanding of the tech industry, Mia is your go-to expert for all things AI. When she's not busy exploring the latest advancements, you'll find her indulging in her passions for hiking, photography, and exploring new cuisines. Join Mia on her journey of unraveling the fascinating world of artificial intelligence.

Leave a Reply

Your email address will not be published. Required fields are marked *