21,508 research outputs found

    Dynamic Unary Convolution in Transformers

    Get PDF
    It is uncertain whether the power of transformer architectures can complement existing convolutional neural networks. A few recent attempts have combined convolution with transformer design through a range of structures in series, where the main contribution of this paper is to explore a parallel design approach. While previous transformed-based approaches need to segment the image into patch-wise tokens, we observe that the multi-head self-attention conducted on convolutional features is mainly sensitive to global correlations and that the performance degrades when these correlations are not exhibited. We propose two parallel modules along with multi-head self-attention to enhance the transformer. For local information, a dynamic local enhancement module leverages convolution to dynamically and explicitly enhance positive local patches and suppress the response to less informative ones. For mid-level structure, a novel unary co-occurrence excitation module utilizes convolution to actively search the local co-occurrence between patches. The parallel-designed Dynamic Unary Convolution in Transformer (DUCT) blocks are aggregated into a deep architecture, which is comprehensively evaluated across essential computer vision tasks in image-based classification, segmentation, retrieval and density estimation. Both qualitative and quantitative results show our parallel convolutional-transformer approach with dynamic and unary convolution outperforms existing series-designed structures

    Efficient Methods for the Design and Training of Neural Networks

    Get PDF
    The field of artificial intelligence has seen significant advancements with the development of neural networks, which have numerous applications in computer vision, natural language processing, and speech processing. Despite these advancements, designing and training these networks still pose numerous challenges. This thesis aims to address two critical aspects of neural network development, design and training, within the context of computer vision tasks. The thesis focuses on three main challenges in the development of neural networks. The first challenge is finding an efficient way to perform architecture search in an extremely large or even unlimited search space. To address this challenge, the thesis proposes a Neural Search-space Evolution (NSE) scheme that enables efficient and effective architecture search in large-scale search spaces. The second challenge is to improve the efficiency of self-supervised learning for model pretraining. To address this challenge, the thesis proposes a combinatorial patches approach that significantly improves the efficiency of self-supervised learning. The third challenge is to develop an efficient and versatile multitask model that can leverage the benefits of large-scale multitask training. To address this challenge, the thesis proposes a Unified model for Human-Centric Perceptions (UniHCP) as a simple and scalable solution for a human-centric perception system that unifies multiple human-centric tasks into a neat, efficient, and scalable model. The results of this thesis demonstrate the effectiveness of the proposed methods in improving the practicality and performance of neural network design and training. The NSE scheme, combinatorial patches approach, and UniHCP have been tested on a broad range of datasets, tasks, and settings, yielding impressive results. These findings affirm the efficacy of the proposed methods in enhancing the efficiency of the design and training process of neural networks

    Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment

    Full text link
    Multi-object tracking (MOT) has profound applications in a variety of fields, including surveillance, sports analytics, self-driving, and cooperative robotics. Despite considerable advancements, existing MOT methodologies tend to falter when faced with non-uniform movements, occlusions, and appearance-reappearance scenarios of the objects. Recognizing this inadequacy, we put forward an integrated MOT method that not only marries object detection and identity linkage within a singular, end-to-end trainable framework but also equips the model with the ability to maintain object identity links over long periods of time. Our proposed model, named STMMOT, is built around four key modules: 1) candidate proposal generation, which generates object proposals via a vision-transformer encoder-decoder architecture that detects the object from each frame in the video; 2) scale variant pyramid, a progressive pyramid structure to learn the self-scale and cross-scale similarities in multi-scale feature maps; 3) spatio-temporal memory encoder, extracting the essential information from the memory associated with each object under tracking; and 4) spatio-temporal memory decoder, simultaneously resolving the tasks of object detection and identity association for MOT. Our system leverages a robust spatio-temporal memory module that retains extensive historical observations and effectively encodes them using an attention-based aggregator. The uniqueness of STMMOT lies in representing objects as dynamic query embeddings that are updated continuously, which enables the prediction of object states with attention mechanisms and eradicates the need for post-processing

    Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos

    Full text link
    Person search has recently been a challenging task in the computer vision domain, which aims to search specific pedestrians from real cameras.Nevertheless, most surveillance videos comprise only a handful of images of each pedestrian, which often feature identical backgrounds and clothing. Hence, it is difficult to learn more discriminative features for person search in real scenes. To tackle this challenge, we draw on Generative Adversarial Networks (GAN) to synthesize data from surveillance videos. GAN has thrived in computer vision problems because it produces high-quality images efficiently. We merely alter the popular Fast R-CNN model, which is capable of processing videos and yielding accurate detection outcomes. In order to appropriately relieve the pressure brought by the two-stage model, we design an Assisted-Identity Query Module (AIDQ) to provide positive images for the behind part. Besides, the proposed novel GAN-based Scene Synthesis model that can synthesize high-quality cross-id person images for person search tasks. In order to facilitate the feature learning of the GAN-based Scene Synthesis model, we adopt an online learning strategy that collaboratively learns the synthesized images and original images. Extensive experiments on two widely used person search benchmarks, CUHK-SYSU and PRW, have shown that our method has achieved great performance, and the extensive ablation study further justifies our GAN-synthetic data can effectively increase the variability of the datasets and be more realistic
    • …