Flamingo, an 80B parameter vision-language model (VLM) AI, was recently trained by DeepMind. On 16 vision-language benchmarks, Flamingo beats all other few-shot learning models by combining separately pre-trained vision and language models. Additionally, Flamingo can talk with users and respond to inquiries about uploaded photos and videos.
Flamingo was created by DeepMind and is built on two earlier models: Perceiver, a multimodal classifier model, and Chinchilla, a 70B parameter language production model. Flamingo combines these two models into a single neural network, which is then trained using the interleaved image and text data sets. A vision-language AI is created as a result, capable of learning new tasks with little to no further training data.
Multimodal VLMs, like CLIP, have demonstrated success at zero-shot learning, but their range of tasks is constrained because these models only produce a score showing the similarity between an image and a textual description. Other VLMs, like DALL-E, can create photorealistic images from a description but cannot complete tasks like visual question answering (VQA) or image captioning since they cannot produce language.
To seamlessly connect the pretrained vision and language models, the researchers also add two learnable architecture elements: a Perceiver Resampler and cross attention layers. The vision encoder sends its spatiotemporal properties to the perceiver resampler, which then outputs a collection of visual tokens. The model may combine visual information for the next-token prediction task by using these visual tokens to condition the frozen LM via newly initialized cross attention layers between the pretrained LM layers.
For 16 multimodal benchmarks, including visual dialogue, VQA, captioning, and image categorization, DeepMind evaluated Flamingo. Flamingo surpassed prior best results "by a substantial margin" in few-shot learning scenarios. Flamingo surpassed cutting-edge fine-tuned models on six of the benchmarks without undergoing its own fine-tuning; instead, Flamingo was given 32 samples, which is "approximately 1000 times less" than the fine-tuned models, and was only utilized in a few-shot situation.