With that, Flamingo is a state-of-the-art visual language model that can perform various joint vision language tasks such as visual question answering, visual dialogue, and image captioning using few-shot learning. That is to say for example, given two images and their captions, Flamingo can use them to learn to predict the caption of unseen images. Two-shots!!
One of the intriguing things about Flamingo is that it can handle arbitrary inputs of images, videos, and texts.
In a summary, the main contribution of Flamingo is to use few-shot learning to perform numerous visual language tasks such as visual question answering. Flamingo achieved state-of-the-art results on almost all vision-language datasets. We will take a look at the results in the later section.