110 research outputs found
Efficient Hybrid Transformer: Learning Global-local Context for Urban Scene Segmentation
Semantic segmentation of fine-resolution urban scene images plays a vital
role in extensive practical applications, such as land cover mapping, urban
change detection, environmental protection and economic assessment. Driven by
rapid developments in deep learning technologies, the convolutional neural
network (CNN) has dominated the semantic segmentation task for many years.
Convolutional neural networks adopt hierarchical feature representation,
demonstrating strong local information extraction. However, the local property
of the convolution layer limits the network from capturing global context that
is crucial for precise segmentation. Recently, Transformer comprise a hot topic
in the computer vision domain. Transformer demonstrates the great capability of
global information modelling, boosting many vision tasks, such as image
classification, object detection and especially semantic segmentation. In this
paper, we propose an efficient hybrid Transformer (EHT) for real-time urban
scene segmentation. The EHT adopts a hybrid structure with and CNN-based
encoder and a transformer-based decoder, learning global-local context with
lower computation. Extensive experiments demonstrate that our EHT has faster
inference speed with competitive accuracy compared with state-of-the-art
lightweight models. Specifically, the proposed EHT achieves a 66.9% mIoU on the
UAVid test set and outperforms other benchmark networks significantly. The code
will be available soon
Extracting structured information from 2D images
Convolutional neural networks can handle an impressive array of supervised learning tasks while relying on a single backbone architecture, suggesting that one solution fits all vision problems. But for many tasks, we can directly make use of the problem structure within neural networks to deliver more accurate predictions. In this thesis, we propose novel deep learning components that exploit the structured output space of an increasingly complex set of problems. We start from Optical Character Recognition (OCR) in natural scenes and leverage the constraints imposed by a spatial outline of letters and language requirements. Conventional OCR systems do not work well in natural scenes due to distortions, blur, or letter variability. We introduce a new attention-based model, equipped with extra information about the neuron positions to guide its focus across characters sequentially. It beats the previous state-of-the-art benchmark by a significant margin. We then turn to dense labeling tasks employing encoder-decoder architectures. We start with an experimental study that documents the drastic impact that decoder design can have on task performance. Rather than optimizing one decoder per task separately, we propose new robust layers for the upsampling of high-dimensional encodings. We show that these better suit the structured per pixel output across the board of all tasks. Finally, we turn to the problem of urban scene understanding. There is an elaborate structure in both the input space (multi-view recordings, aerial and street-view scenes) and the output space (multiple fine-grained attributes for holistic building understanding). We design new models that benefit from a relatively simple cuboidal-like geometry of buildings to create a single unified representation from multiple views. To benchmark our model, we build a new multi-view large-scale dataset of buildings images and fine-grained attributes and show systematic improvements when compared to a broad range of strong CNN-based baselines
- …