Exploring Verona Through the Lens of Semantic Segmentation

In this video, we can see the OneFormer semantic segmentation model in action among the streets and squares of Verona. This represents the ideal scenario for benchmarking semantic segmentation algorithms, as the scenes are cluttered with many agents belonging to many different classes. There are tourists strolling around the ancient buildings, cars, bikes, and buses traveling around the streets that are full of billboards, signs, and bustling activities. Furthermore, the lighting of outdoor scenes is heavily dependent on the current weather conditions, making the task even more complex to solve with satisfactory precision. However, as you will see in this video, OneFormer is able to tackle all these problems, reaching a very satisfactory level of precision in all the scenarios. 

What is Semantic Segmentation? 

Recent advancements in artificial intelligence algorithms for computer vision are allowing more tasks to be tackled with increasing levels of precision. As an example, the latest semantic segmentation methods based on transformer models are both able to detect a lot of different classes of objects and to locate them with remarkable levels of detail. We could see these families of algorithms as the natural evolution of object detection methods. While detecting an object simply means finding the rectangle that contains it, segmentation requires identifying each single pixel belonging to it. As a result, the spatial information obtained by the state-of-the-art segmentation algorithms is much more fine-grained, meaning that it can be used to tackle more complex tasks. 

What is Semantic Segmentation Used for? 

The fine-grain localization obtained with state-of-the-art segmentation algorithms deeply influences varied domains. In autonomous vehicles, it precisely identifies lanes, obstacles, and traffic signs at a pixel level, enhancing safety and decision-making. In medical imaging, it meticulously classifies pixels to delineate organs and abnormalities, elevating diagnostic accuracy and treatment planning. Furthermore, many computer vision methods can benefit from leveraging fine-grained spatial information. As an example, tracking algorithms usually require describing each object in the most accurate and informative way possible, as they might have to distinguish between many similar objects. For this reason, it can be easily seen that using even a small number of pixels belonging to the wrong object (e.g., background) can degrade the overall tracking performance. 

What techniques were used for the video? 

To compute the semantic segmentation masks for this video, we employed the OneFormer model due to its remarkable level of precision. Primarily, this is a result of intensive optimization applied to its already potent encoder-decoder backbone transformer model.

Firstly, OneFormer utilizes a task token (Qtask) to initialize the queries that enter the transformer backbone. This approach ensures that the model’s focus is inherently tuned towards semantic segmentation. Additionally, the incorporation of a learnable text context (Qctx) enables the model to adaptively adjust its feature extraction and processing capabilities; this is essential for complex image interpretations. For instance, in a cityscape, Qctx could help the model understand the interplay between urban elements like buildings, roads, and parks, providing a cohesive understanding of the urban environment.

The dual aspect of Qtask and Qctx initialization is instrumental in enhancing the model’s ability to differentiate between various semantic classes with a higher degree of precision.

Project Page