21,508 research outputs found
Dynamic Unary Convolution in Transformers
It is uncertain whether the power of transformer architectures can complement existing convolutional neural networks. A few recent attempts have combined convolution with transformer design through a range of structures in series, where the main contribution of this paper is to explore a parallel design approach. While previous transformed-based approaches need to segment the image into patch-wise tokens, we observe that the multi-head self-attention conducted on convolutional features is mainly sensitive to global correlations and that the performance degrades when these correlations are not exhibited. We propose two parallel modules along with multi-head self-attention to enhance the transformer. For local information, a dynamic local enhancement module leverages convolution to dynamically and explicitly enhance positive local patches and suppress the response to less informative ones. For mid-level structure, a novel unary co-occurrence excitation module utilizes convolution to actively search the local co-occurrence between patches. The parallel-designed Dynamic Unary Convolution in Transformer (DUCT) blocks are aggregated into a deep architecture, which is comprehensively evaluated across essential computer vision tasks in image-based classification, segmentation, retrieval and density estimation. Both qualitative and quantitative results show our parallel convolutional-transformer approach with dynamic and unary convolution outperforms existing series-designed structures
Efficient Methods for the Design and Training of Neural Networks
The field of artificial intelligence has seen significant advancements with the development of neural networks, which have numerous applications in computer vision, natural language processing, and speech processing. Despite these advancements, designing and training these networks still pose numerous challenges. This thesis aims to address two critical aspects of neural network development, design and training, within the context of computer vision tasks.
The thesis focuses on three main challenges in the development of neural networks. The first challenge is finding an efficient way to perform architecture search in an extremely large or even unlimited search space. To address this challenge, the thesis proposes a Neural Search-space Evolution (NSE) scheme that enables efficient and effective architecture search in large-scale search spaces. The second challenge is to improve the efficiency of self-supervised learning for model pretraining. To address this challenge, the thesis proposes a combinatorial patches approach that significantly improves the efficiency of self-supervised learning. The third challenge is to develop an efficient and versatile multitask model that can leverage the benefits of large-scale multitask training. To address this challenge, the thesis proposes a Unified model for Human-Centric Perceptions (UniHCP) as a simple and scalable solution for a human-centric perception system that unifies multiple human-centric tasks into a neat, efficient, and scalable model.
The results of this thesis demonstrate the effectiveness of the proposed methods in improving the practicality and performance of neural network design and training. The NSE scheme, combinatorial patches approach, and UniHCP have been tested on a broad range of datasets, tasks, and settings, yielding impressive results. These findings affirm the efficacy of the proposed methods in enhancing the efficiency of the design and training process of neural networks
Transformer Network for Multi-Person Tracking and Re-Identification in Unconstrained Environment
Multi-object tracking (MOT) has profound applications in a variety of fields,
including surveillance, sports analytics, self-driving, and cooperative
robotics. Despite considerable advancements, existing MOT methodologies tend to
falter when faced with non-uniform movements, occlusions, and
appearance-reappearance scenarios of the objects. Recognizing this inadequacy,
we put forward an integrated MOT method that not only marries object detection
and identity linkage within a singular, end-to-end trainable framework but also
equips the model with the ability to maintain object identity links over long
periods of time. Our proposed model, named STMMOT, is built around four key
modules: 1) candidate proposal generation, which generates object proposals via
a vision-transformer encoder-decoder architecture that detects the object from
each frame in the video; 2) scale variant pyramid, a progressive pyramid
structure to learn the self-scale and cross-scale similarities in multi-scale
feature maps; 3) spatio-temporal memory encoder, extracting the essential
information from the memory associated with each object under tracking; and 4)
spatio-temporal memory decoder, simultaneously resolving the tasks of object
detection and identity association for MOT. Our system leverages a robust
spatio-temporal memory module that retains extensive historical observations
and effectively encodes them using an attention-based aggregator. The
uniqueness of STMMOT lies in representing objects as dynamic query embeddings
that are updated continuously, which enables the prediction of object states
with attention mechanisms and eradicates the need for post-processing
Domain Adaptive Person Search via GAN-based Scene Synthesis for Cross-scene Videos
Person search has recently been a challenging task in the computer vision
domain, which aims to search specific pedestrians from real
cameras.Nevertheless, most surveillance videos comprise only a handful of
images of each pedestrian, which often feature identical backgrounds and
clothing. Hence, it is difficult to learn more discriminative features for
person search in real scenes. To tackle this challenge, we draw on Generative
Adversarial Networks (GAN) to synthesize data from surveillance videos. GAN has
thrived in computer vision problems because it produces high-quality images
efficiently. We merely alter the popular Fast R-CNN model, which is capable of
processing videos and yielding accurate detection outcomes. In order to
appropriately relieve the pressure brought by the two-stage model, we design an
Assisted-Identity Query Module (AIDQ) to provide positive images for the behind
part. Besides, the proposed novel GAN-based Scene Synthesis model that can
synthesize high-quality cross-id person images for person search tasks. In
order to facilitate the feature learning of the GAN-based Scene Synthesis
model, we adopt an online learning strategy that collaboratively learns the
synthesized images and original images. Extensive experiments on two widely
used person search benchmarks, CUHK-SYSU and PRW, have shown that our method
has achieved great performance, and the extensive ablation study further
justifies our GAN-synthetic data can effectively increase the variability of
the datasets and be more realistic
- …