View profile

Neural Networks Should be Able to Read Objects in an Image - Issue #2

Jean de Dieu Nyandwi
Jean de Dieu Nyandwi
In our previous issue, we revisited the history of deep learning. It was important to first have a foundation before diving deep into the latest trends.
This week, we will talk about a recent paper titled Pix2seq: A Language Modeling Framework for Object Detection from Google AI scientists(the paper was published in ICLR 2022). On a high-level note, the paper reveals a simple and elegant way of performing object detection using a natural language approach. Keeping things simple, we will talk about the motivation behind the paper, the architecture, and the results on benchmark datasets.

Pix2seq Model
Pix2seq Model
The Motivation
Modern R-CNN object detectors typically have two-stage networks that compute the object locations and object information(object class and bounding box) respectively. Let’s take the example of Faster R-CNN. Its first stage is the Regional Proposal Network(RPN) which generates the regions containing objects and the second stage is Fast R-CNN which computes the object class labels and bounding boxes. That’s easy to say, but object detection is a very complicated task and so are the current detectors. Modern detectors are designed carefully and the choice of architectures and training techniques are very specific to the (detection) tasks. As the authors also noted, “the specialization and complexity of existing detection systems make them difficult to integrate into a larger system or generalize to a much broader array of tasks associated with general intelligence.”
So, rather than detecting objects in an image using hard-engineered techniques that are selected prior to the task, is there a simple and straight way to detect objects in an image? Can we teach the neural network to read the objects as long as it knows where they are located? That is the main contribution of the paper. Object detection is viewed as a language model operating on input image pixels. Different from modern R-CNN detectors and other detectors(like DETR and single-stage detectors), the architecture and loss functions are pretty generic and as result, this new framework can be applied to other visual recognition tasks.
Given an input image, Pix2seq produces a sequence of individual tokens that corresponds to the object class and bounding boxes.
Now that we understand the high-level details of the paper, let’s take a look at the framework of Pix2seq.
Pix2seq Framework
Pix2Seq is simple and an elegant network (considering how complicated the existing object detectors are). It is made of 4 main components which are:
  • Image augmentation, something that is a norm in computer vision today.
  • Box & class label sequence construction & augmentation: object class labels and bounding boxes are converted into tokens and augmented as well.
  • Main architecture: An encoder-decoder network. The encoder is a ConvNet or Vision Transformer(ViT) that learns the representations from the input image. A decoder is a Transformer that generates one token at a time. A decoder essentially removes the need for separate bounding box proposals and regression networks.
  • Loss function which is a softmax cross-entropy loss. This is good because existing detectors have too many loss functions that need to be tracked and merged during training.
The main components of Pix2seq
The main components of Pix2seq
Pix2seq on Benchmarsk Datasets
Pix2seq achieved competitive results on benchmark datasets compared to current state-of-the-art object detectors such as Faster R-CNN & DETR.
On the COCO validation set, for example, Pix2seq achieved AP(Average Precision) of 43.0 with ResNet50 backbone(encoder network) while DETR has 42.0 AP with the same backbone. With bigger backbone networks such as ResNet101, it even works better(i.e 44.5 AP). The highest AP or the best model was achieved with ViT backbone on large image sizes(1333X1333) with pre-training on the Objects365 dataset and fine-tuning on the COCO dataset. Without pre-training, the best model achieved 45.0 AP on large image sizes(1333X1333). As the authors noted, pre-training and fine-tuning are faster than training from scratch and generalize better. Transfer learning is actually a norm in computer vision.
Comparison of Pix2seq and Faster R-CNN and DETR
Comparison of Pix2seq and Faster R-CNN and DETR
Another interesting thing to say about Pix2seq is that when visualizing the decoder’s cross attention, it seems that the decoder pays the most attention to the object when predicting the class token.
Visualization of decoder's cross attention
Visualization of decoder's cross attention
Below are the detection results made by Pix2Seq. Some of them are from the paper and a provided demo that you can play with. The best detections are made with ViT backbone and pre-training and ResNet50(with large image size) without pre-training.
Pix2seq detection results
Pix2seq detection results
Key Takeaways
Pix2seq is really an elegant object detection architecture. One of the intriguing things about it is that it doesn’t require a separate box classifier and bounding box regressor. Just throw a sequence of object tokens into an encoder(such as ResNet or ViT) for learning representations and then pass the results to the Transformer decoder to predict the object class label and box coordinates.
The downside of the introduced framework is that it’s slow and thus, it can’t make reliable detection in real-time applications. The authors concluded with a positive note that it will be improved in future works. Also, just like other detectors(or supervised learning algorithms in general), Pix2seq requires annotated data which is a big bottleneck in a world that is full of massive unlabelled data.
Thanks for reading. For more about Pix2seq, you can check the following:
Until the next week, stay safe!
——-
If you enjoyed the issue, you can share it with your friends or anyone who you think might enjoy reading it! You can also follow us on Twitter.
——-
P.S Can you count them with Pix2seq?
Did you enjoy this issue? Yes No
Jean de Dieu Nyandwi
Jean de Dieu Nyandwi @jeande_d

Trends, ideas, and the latest news in deep learning and computer vision.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.