256 research outputs found
Amodal Instance Segmentation and Multi-Object Tracking with Deep Pixel Embedding
This thesis extends upon the representational output of semantic instance segmentation by explicitly including both visible and occluded parts. A fully convolutional network is trained to produce consistent pixel-level embedding across two layers such that, when clustered, the results convey the full spatial extent and depth ordering of each instance. Results demonstrate that the network can accurately estimate complete masks in the presence of occlusion and outperform leading top-down bounding-box approaches.
The model is further extended to produce consistent pixel-level embeddings across two consecutive image frames from a video to simultaneously perform amodal instance segmentation and multi-object tracking. No post-processing trackers or Hungarian Algorithm is needed to perform multi-object tracking. The advantages and disadvantages of such a bounding-box-free approach are studied thoroughly. Experiments show that the proposed method outperforms the state-of-the-art bounding-box based approach on tracking animated moving objects.
Advisor: Eric T. Psota and Lance C. PĂ©re
A Survey of Visual Transformers
Transformer, an attention-based encoder-decoder model, has already
revolutionized the field of natural language processing (NLP). Inspired by such
significant achievements, some pioneering works have recently been done on
employing Transformer-liked architectures in the computer vision (CV) field,
which have demonstrated their effectiveness on three fundamental CV tasks
(classification, detection, and segmentation) as well as multiple sensory data
stream (images, point clouds, and vision-language data). Because of their
competitive modeling capabilities, the visual Transformers have achieved
impressive performance improvements over multiple benchmarks as compared with
modern Convolution Neural Networks (CNNs). In this survey, we have reviewed
over one hundred of different visual Transformers comprehensively according to
three fundamental CV tasks and different data stream types, where a taxonomy is
proposed to organize the representative methods according to their motivations,
structures, and application scenarios. Because of their differences on training
settings and dedicated vision tasks, we have also evaluated and compared all
these existing visual Transformers under different configurations. Furthermore,
we have revealed a series of essential but unexploited aspects that may empower
such visual Transformers to stand out from numerous architectures, e.g., slack
high-level semantic embeddings to bridge the gap between the visual
Transformers and the sequential ones. Finally, three promising research
directions are suggested for future investment. We will continue to update the
latest articles and their released source codes at
https://github.com/liuyang-ict/awesome-visual-transformers.Comment: Update the applications of both 3D point clouds and multi-sensory
data strea
- …