5,168 research outputs found

    Learning a Complete Image Indexing Pipeline

    Full text link
    To work at scale, a complete image indexing system comprises two components: An inverted file index to restrict the actual search to only a subset that should contain most of the items relevant to the query; An approximate distance computation mechanism to rapidly scan these lists. While supervised deep learning has recently enabled improvements to the latter, the former continues to be based on unsupervised clustering in the literature. In this work, we propose a first system that learns both components within a unifying neural framework of structured binary encoding

    Learning a Complete Image Indexing Pipeline

    Full text link
    To work at scale, a complete image indexing system comprises two components: An inverted file index to restrict the actual search to only a subset that should contain most of the items relevant to the query; An approximate distance computation mechanism to rapidly scan these lists. While supervised deep learning has recently enabled improvements to the latter, the former continues to be based on unsupervised clustering in the literature. In this work, we propose a first system that learns both components within a unifying neural framework of structured binary encoding

    Coding local and global binary visual features extracted from video sequences

    Get PDF
    Binary local features represent an effective alternative to real-valued descriptors, leading to comparable results for many visual analysis tasks, while being characterized by significantly lower computational complexity and memory requirements. When dealing with large collections, a more compact representation based on global features is often preferred, which can be obtained from local features by means of, e.g., the Bag-of-Visual-Word (BoVW) model. Several applications, including for example visual sensor networks and mobile augmented reality, require visual features to be transmitted over a bandwidth-limited network, thus calling for coding techniques that aim at reducing the required bit budget, while attaining a target level of efficiency. In this paper we investigate a coding scheme tailored to both local and global binary features, which aims at exploiting both spatial and temporal redundancy by means of intra- and inter-frame coding. In this respect, the proposed coding scheme can be conveniently adopted to support the Analyze-Then-Compress (ATC) paradigm. That is, visual features are extracted from the acquired content, encoded at remote nodes, and finally transmitted to a central controller that performs visual analysis. This is in contrast with the traditional approach, in which visual content is acquired at a node, compressed and then sent to a central unit for further processing, according to the Compress-Then-Analyze (CTA) paradigm. In this paper we experimentally compare ATC and CTA by means of rate-efficiency curves in the context of two different visual analysis tasks: homography estimation and content-based retrieval. Our results show that the novel ATC paradigm based on the proposed coding primitives can be competitive with CTA, especially in bandwidth limited scenarios.Comment: submitted to IEEE Transactions on Image Processin

    Composite Correlation Quantization for Efficient Multimodal Retrieval

    Full text link
    Efficient similarity retrieval from large-scale multimodal database is pervasive in modern search engines and social networks. To support queries across content modalities, the system should enable cross-modal correlation and computation-efficient indexing. While hashing methods have shown great potential in achieving this goal, current attempts generally fail to learn isomorphic hash codes in a seamless scheme, that is, they embed multiple modalities in a continuous isomorphic space and separately threshold embeddings into binary codes, which incurs substantial loss of retrieval accuracy. In this paper, we approach seamless multimodal hashing by proposing a novel Composite Correlation Quantization (CCQ) model. Specifically, CCQ jointly finds correlation-maximal mappings that transform different modalities into isomorphic latent space, and learns composite quantizers that convert the isomorphic latent features into compact binary codes. An optimization framework is devised to preserve both intra-modal similarity and inter-modal correlation through minimizing both reconstruction and quantization errors, which can be trained from both paired and partially paired data in linear time. A comprehensive set of experiments clearly show the superior effectiveness and efficiency of CCQ against the state of the art hashing methods for both unimodal and cross-modal retrieval

    Off the Beaten Path: Let's Replace Term-Based Retrieval with k-NN Search

    Full text link
    Retrieval pipelines commonly rely on a term-based search to obtain candidate records, which are subsequently re-ranked. Some candidates are missed by this approach, e.g., due to a vocabulary mismatch. We address this issue by replacing the term-based search with a generic k-NN retrieval algorithm, where a similarity function can take into account subtle term associations. While an exact brute-force k-NN search using this similarity function is slow, we demonstrate that an approximate algorithm can be nearly two orders of magnitude faster at the expense of only a small loss in accuracy. A retrieval pipeline using an approximate k-NN search can be more effective and efficient than the term-based pipeline. This opens up new possibilities for designing effective retrieval pipelines. Our software (including data-generating code) and derivative data based on the Stack Overflow collection is available online

    A study into annotation ranking metrics in geo-tagged image corpora

    Get PDF
    Community contributed datasets are becoming increasingly common in automated image annotation systems. One important issue with community image data is that there is no guarantee that the associated metadata is relevant. A method is required that can accurately rank the semantic relevance of community annotations. This should enable the extracting of relevant subsets from potentially noisy collections of these annotations. Having relevant, non heterogeneous tags assigned to images should improve community image retrieval systems, such as Flickr, which are based on text retrieval methods. In the literature, the current state of the art approach to ranking the semantic relevance of Flickr tags is based on the widely used tf-idf metric. In the case of datasets containing landmark images, however, this metric is inefficient due to the high frequency of common landmark tags within the data set and can be improved upon. In this paper, we present a landmark recognition framework, that provides end-to-end automated recognition and annotation. In our study into automated annotation, we evaluate 5 alternate approaches to tf-idf to rank tag relevance in community contributed landmark image corpora. We carry out a thorough evaluation of each of these ranking metrics and results of this evaluation demonstrate that four of these proposed techniques outperform the current commonly-used tf-idf approach for this task
    • 

    corecore