84 research outputs found

    k-means Mask Transformer

    Full text link
    The rise of transformers in vision tasks not only advances network backbone designs, but also starts a brand-new page to achieve end-to-end image recognition (e.g., object detection and panoptic segmentation). Originated from Natural Language Processing (NLP), transformer architectures, consisting of self-attention and cross-attention, effectively learn long-range interactions between elements in a sequence. However, we observe that most existing transformer-based vision models simply borrow the idea from NLP, neglecting the crucial difference between languages and images, particularly the extremely large sequence length of spatially flattened pixel features. This subsequently impedes the learning in cross-attention between pixel features and object queries. In this paper, we rethink the relationship between pixels and object queries and propose to reformulate the cross-attention learning as a clustering process. Inspired by the traditional k-means clustering algorithm, we develop a k-means Mask Xformer (kMaX-DeepLab) for segmentation tasks, which not only improves the state-of-the-art, but also enjoys a simple and elegant design. As a result, our kMaX-DeepLab achieves a new state-of-the-art performance on COCO val set with 58.0% PQ, Cityscapes val set with 68.4% PQ, 44.0% AP, and 83.5% mIoU, and ADE20K val set with 50.9% PQ and 55.2% mIoU without test-time augmentation or external dataset. We hope our work can shed some light on designing transformers tailored for vision tasks. Code and models are available at https://github.com/google-research/deeplab2Comment: ECCV 2022. arXiv v2: add results on ADE20K. arXiv v3: fix appendix. Codes and models are available at https://github.com/google-research/deeplab

    Milestones in Autonomous Driving and Intelligent Vehicles Part II: Perception and Planning

    Full text link
    Growing interest in autonomous driving (AD) and intelligent vehicles (IVs) is fueled by their promise for enhanced safety, efficiency, and economic benefits. While previous surveys have captured progress in this field, a comprehensive and forward-looking summary is needed. Our work fills this gap through three distinct articles. The first part, a "Survey of Surveys" (SoS), outlines the history, surveys, ethics, and future directions of AD and IV technologies. The second part, "Milestones in Autonomous Driving and Intelligent Vehicles Part I: Control, Computing System Design, Communication, HD Map, Testing, and Human Behaviors" delves into the development of control, computing system, communication, HD map, testing, and human behaviors in IVs. This part, the third part, reviews perception and planning in the context of IVs. Aiming to provide a comprehensive overview of the latest advancements in AD and IVs, this work caters to both newcomers and seasoned researchers. By integrating the SoS and Part I, we offer unique insights and strive to serve as a bridge between past achievements and future possibilities in this dynamic field.Comment: 17pages, 6figures. IEEE Transactions on Systems, Man, and Cybernetics: System

    Classification and Segmentation of Galactic Structuresin Large Multi-spectral Images

    Get PDF
    Extensive and exhaustive cataloguing of astronomical objects is imperative for studies seeking to understand mechanisms which drive the universe. Such cataloguing tasks can be tedious, time consuming and demand a high level of domain specific knowledge. Past astronomical imaging surveys have been catalogued through mostly manual effort. Immi-nent imaging surveys, however, will produce a magnitude of data that cannot be feasibly processed through manual cataloguing. Furthermore, these surveys will capture objects fainter than the night sky, termed low surface brightness objects, and at unprecedented spatial resolution owing to advancements in astronomical imaging. In this thesis, we in-vestigate the use of deep learning to automate cataloguing processes, such as detection, classification and segmentation of objects. A common theme throughout this work is the adaptation of machine learning methods to challenges specific to the domain of low surface brightness imaging.We begin with creating an annotated dataset of structures in low surface brightness images. To facilitate supervised learning in neural networks, a dataset comprised of input and corresponding ground truth target labels is required. An online tool is presented, allowing astronomers to classify and draw over objects in large multi-spectral images. A dataset produced using the tool is then detailed, containing 227 low surface brightness images from the MATLAS survey and labels made by four annotators. We then present a method for synthesising images of galactic cirrus which appear similar to MATLAS images, allowing pretraining of neural networks.A method for integrating sensitivity to orientation in convolutional neural networks is then presented. Objects in astronomical images can present in any given orientation, and thus the ability for neural networks to handle rotations is desirable. We modify con-volutional filters with sets of Gabor filters with different orientations. These orientations are learned alongside network parameters during backpropagation, allowing exact optimal orientations to be captured. The method is validated extensively on multiple datasets and use cases.We propose an attention based neural network architecture to process global contami-nants in large images. Performing analysis of low surface brightness images requires plenty of contextual information and local textual patterns. As a result, a network for processing low surface brightness images should ideally be able to accommodate large high resolu-tion images without compromising on either local or global features. We utilise attention to capture long range dependencies, and propose an efficient attention operator which significantly reduces computational cost, allowing the input of large images. We also use Gabor filters to build an attention mechanism to better capture long range orientational patterns. These techniques are validated on the task of cirrus segmentation in MAT-LAS images, and cloud segmentation on the SWIMSEG database, where state of the art performance is achieved.Following, cirrus segmentation in MATLAS images is further investigated, and a com-prehensive study is performed on the task. We discuss challenges associated with cirrus segmentation and low surface brightness images in general, and present several tech-niques to accommodate them. A novel loss function is proposed to facilitate training of the segmentation model on probabilistic targets. Results are presented on the annotated MATLAS images, with extensive ablation studies and a final benchmark to test the limits of the detailed segmentation pipeline.Finally, we develop a pipeline for multi-class segmentation of galactic structures and surrounding contaminants. Techniques of previous chapters are combined with a popu-lar instance segmentation architecture to create a neural network capable of segmenting localised objects and extended amorphous regions. The process of data preparation for training instance segmentation models is thoroughly detailed. The method is tested on segmentation of five object classes in MATLAS images. We find that unifying the tasks of galactic structure segmentation and contaminant segmentation improves model perfor-mance in comparison to isolating each task

    UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase

    Full text link
    Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space,where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg.Comment: ICCV 2023; 21 pages; 9 figures; 18 tables; Code at https://github.com/PJLab-ADG/PCSe
    • …
    corecore