3,819 research outputs found

    Improving Visual Embeddings using Attention and Geometry Constraints

    Get PDF
    Learning a non-linear function to embed the raw data (i.e., image, video, or language) to a discriminative feature embedding space is considered a fundamental problem in the learning community. In such embedding spaces, the data with similar semantic meaning are clustered, while the data with dissimilar semantic meaning are separated. A number of practical applications can benefit from a good feature embedding, e.g., machine translation, classification/recognition, retrieval, any-shot learning, etc In this Thesis, we aim to improve the visual embeddings using attention and geometry constraints. In the first part of the Thesis, we develop two neural attention modules, which can automatically localize the informative regions within the feature map, thereby generating a discriminative feature representation for the image. An Attention in Attention (AiA) mechanism is first proposed to align the feature map along with the deep network, by modeling the interaction of inner attention and outer attention modules. Intuitively, the AiA mechanism can be understood as having an attention inside another, with the inner one determining where to focus for the outer attention module. Further, we employ explicit non-linear mappings in Reproducing Kernel Hilbert Spaces (RHKSs) to generate attention values, leading the channel descriptor of the feature map to own the representation power of second-order polynomial kernel and Gaussian kernel. In addition, the Channel Recurrent Attention (CRA) module is proposed to build a global receptive field to the feature map. The existing attention mechanisms focus on either the channel pattern or the spatial pattern of the feature map, which cannot make full use of the information in the feature map. The CRA module can jointly learn the channel and spatial patterns of the feature map and produce attention value per every element of the input feature map. This is achieved by feeding the spatial vectors to a recurrent neural network (RNN) sequentially, such that the RNN can create a global view of the feature map. In the second part, we investigate the superiority of geometry constraint for embedding learning. We first study the geometry concern of the set as an embedding for a video clip. Usually, the video embedding is optimized using triplet loss, in which the distance is calculated between clip features, such that the frame feature cannot be optimized directly. To this end, we model the video clip as a set, and employ the distance between sets in the triplet loss. Tailored for the set-aware triplet loss, a new set distance metric is also proposed to measure the hard frames in a triplet. Optimizing over set-aware triplet loss leads to a compact clip feature embedding, improving the discriminative of the video representation. Beyond the flat Euclidean embedding space, we further study a curved space, i.e., hyperbolic spaces, as image embedding spaces. In contrast to Euclidean embedding, hyperbolic embedding can encode the data's hierarchical structure, as the volume of hyperbolic space increases exponentially. However, performing basic operations for comparison in hyperbolic spaces is complex and time-consuming. For example, the similarity measure is not well-defined in hyperbolic spaces. To mitigate this issue, we introduce the positive definite (pd) kernels for hyperbolic embeddings. Specifically, we propose four pd kernels in hyperbolic spaces in conjunction with a theoretical analysis. The proposed kernels include hyperbolic tangent kernel, hyperbolic RBF kernel, hyperbolic Laplace kernel, and hyperbolic binomial kernel. We demonstrate the effectiveness of the proposed methods via a image or video person re-identification task. We also evaluate the generalization of hyperbolic kernels by few-shot learning, zero-shot learning and knowledge distillation tasks

    Learning Deep Context-aware Features over Body and Latent Parts for Person Re-identification

    Full text link
    Person Re-identification (ReID) is to identify the same person across different cameras. It is a challenging task due to the large variations in person pose, occlusion, background clutter, etc How to extract powerful features is a fundamental problem in ReID and is still an open problem today. In this paper, we design a Multi-Scale Context-Aware Network (MSCAN) to learn powerful features over full body and body parts, which can well capture the local context knowledge by stacking multi-scale convolutions in each layer. Moreover, instead of using predefined rigid parts, we propose to learn and localize deformable pedestrian parts using Spatial Transformer Networks (STN) with novel spatial constraints. The learned body parts can release some difficulties, eg pose variations and background clutters, in part-based representation. Finally, we integrate the representation learning processes of full body and body parts into a unified framework for person ReID through multi-class person identification tasks. Extensive evaluations on current challenging large-scale person ReID datasets, including the image-based Market1501, CUHK03 and sequence-based MARS datasets, show that the proposed method achieves the state-of-the-art results.Comment: Accepted by CVPR 201

    Divide and Fuse: A Re-ranking Approach for Person Re-identification

    Full text link
    As re-ranking is a necessary procedure to boost person re-identification (re-ID) performance on large-scale datasets, the diversity of feature becomes crucial to person reID for its importance both on designing pedestrian descriptions and re-ranking based on feature fusion. However, in many circumstances, only one type of pedestrian feature is available. In this paper, we propose a "Divide and use" re-ranking framework for person re-ID. It exploits the diversity from different parts of a high-dimensional feature vector for fusion-based re-ranking, while no other features are accessible. Specifically, given an image, the extracted feature is divided into sub-features. Then the contextual information of each sub-feature is iteratively encoded into a new feature. Finally, the new features from the same image are fused into one vector for re-ranking. Experimental results on two person re-ID benchmarks demonstrate the effectiveness of the proposed framework. Especially, our method outperforms the state-of-the-art on the Market-1501 dataset.Comment: Accepted by BMVC201

    Temporal Model Adaptation for Person Re-Identification

    Full text link
    Person re-identification is an open and challenging problem in computer vision. Majority of the efforts have been spent either to design the best feature representation or to learn the optimal matching metric. Most approaches have neglected the problem of adapting the selected features or the learned model over time. To address such a problem, we propose a temporal model adaptation scheme with human in the loop. We first introduce a similarity-dissimilarity learning method which can be trained in an incremental fashion by means of a stochastic alternating directions methods of multipliers optimization procedure. Then, to achieve temporal adaptation with limited human effort, we exploit a graph-based approach to present the user only the most informative probe-gallery matches that should be used to update the model. Results on three datasets have shown that our approach performs on par or even better than state-of-the-art approaches while reducing the manual pairwise labeling effort by about 80%
    • …
    corecore