Modern R-CNN object detectors typically have two-stage networks that compute the object locations and object information(object class and bounding box) respectively. Let’s take the example of Faster R-CNN
. Its first stage is the Regional Proposal Network(RPN) which generates the regions containing objects and the second stage is Fast R-CNN
which computes the object class labels and bounding boxes. That’s easy to say, but object detection is a very complicated task and so are the current detectors. Modern detectors are designed carefully and the choice of architectures and training techniques are very specific to the (detection) tasks. As the authors also noted, “the specialization and complexity of existing detection systems make them difficult to integrate into a larger system or generalize to a much broader array of tasks associated with general intelligence.”
So, rather than detecting objects in an image using hard-engineered techniques that are selected prior to the task, is there a simple and straight way to detect objects in an image? Can we teach the neural network to read the objects as long as it knows where they are located? That is the main contribution of the paper. Object detection is viewed as a language model operating on input image pixels. Different from modern R-CNN detectors and other detectors(like DETR
and single-stage detectors), the architecture and loss functions are pretty generic and as result, this new framework can be applied to other visual recognition tasks.
Given an input image, Pix2seq produces a sequence of individual tokens that corresponds to the object class and bounding boxes.
Now that we understand the high-level details of the paper, let’s take a look at the framework of Pix2seq.