2,438 research outputs found

    Histogram of gradients of Time-Frequency Representations for Audio scene detection

    Full text link
    This paper addresses the problem of audio scenes classification and contributes to the state of the art by proposing a novel feature. We build this feature by considering histogram of gradients (HOG) of time-frequency representation of an audio scene. Contrarily to classical audio features like MFCC, we make the hypothesis that histogram of gradients are able to encode some relevant informations in a time-frequency {representation:} namely, the local direction of variation (in time and frequency) of the signal spectral power. In addition, in order to gain more invariance and robustness, histogram of gradients are locally pooled. We have evaluated the relevance of {the novel feature} by comparing its performances with state-of-the-art competitors, on several datasets, including a novel one that we provide, as part of our contribution. This dataset, that we make publicly available, involves 1919 classes and contains about 900900 minutes of audio scene recording. We thus believe that it may be the next standard dataset for evaluating audio scene classification algorithms. Our comparison results clearly show that our HOG-based features outperform its competitor

    Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network

    Full text link
    We introduce a new top-down pipeline for scene text detection. We propose a novel Cascaded Convolutional Text Network (CCTN) that joints two customized convolutional networks for coarse-to-fine text localization. The CCTN fast detects text regions roughly from a low-resolution image, and then accurately localizes text lines from each enlarged region. We cast previous character based detection into direct text region estimation, avoiding multiple bottom- up post-processing steps. It exhibits surprising robustness and discriminative power by considering whole text region as detection object which provides strong semantic information. We customize convolutional network by develop- ing rectangle convolutions and multiple in-network fusions. This enables it to handle multi-shape and multi-scale text efficiently. Furthermore, the CCTN is computationally efficient by sharing convolutional computations, and high-level property allows it to be invariant to various languages and multiple orientations. It achieves 0.84 and 0.86 F-measures on the ICDAR 2011 and ICDAR 2013, delivering substantial improvements over state-of-the-art results [23, 1]

    DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images

    Full text link
    In this paper, we develop a novel unified framework called DeepText for text region proposal generation and text detection in natural images via a fully convolutional neural network (CNN). First, we propose the inception region proposal network (Inception-RPN) and design a set of text characteristic prior bounding boxes to achieve high word recall with only hundred level candidate proposals. Next, we present a powerful textdetection network that embeds ambiguous text category (ATC) information and multilevel region-of-interest pooling (MLRP) for text and non-text classification and accurate localization. Finally, we apply an iterative bounding box voting scheme to pursue high recall in a complementary manner and introduce a filtering algorithm to retain the most suitable bounding box, while removing redundant inner and outer boxes for each text instance. Our approach achieves an F-measure of 0.83 and 0.85 on the ICDAR 2011 and 2013 robust text detection benchmarks, outperforming previous state-of-the-art results.Comment: 12 pages, 4 figures, 3 table

    Mid-level Elements for Object Detection

    Full text link
    Building on the success of recent discriminative mid-level elements, we propose a surprisingly simple approach for object detection which performs comparable to the current state-of-the-art approaches on PASCAL VOC comp-3 detection challenge (no external data). Through extensive experiments and ablation analysis, we show how our approach effectively improves upon the HOG-based pipelines by adding an intermediate mid-level representation for the task of object detection. This representation is easily interpretable and allows us to visualize what our object detector "sees". We also discuss the insights our approach shares with CNN-based methods, such as sharing representation between categories helps

    Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification

    Full text link
    Text in natural images contains rich semantics that are often highly relevant to objects or scene. In this paper, we focus on the problem of fully exploiting scene text for visual understanding. The main idea is combining word representations and deep visual features into a globally trainable deep convolutional neural network. First, the recognized words are obtained by a scene text reading system. Then, we combine the word embedding of the recognized words and the deep visual features into a single representation, which is optimized by a convolutional neural network for fine-grained image classification. In our framework, the attention mechanism is adopted to reveal the relevance between each recognized word and the given image, which further enhances the recognition performance. We have performed experiments on two datasets: Con-Text dataset and Drink Bottle dataset, that are proposed for fine-grained classification of business places and drink bottles, respectively. The experimental results consistently demonstrate that the proposed method combining textual and visual cues significantly outperforms classification with only visual representations. Moreover, we have shown that the learned representation improves the retrieval performance on the drink bottle images by a large margin, making it potentially useful in product search

    Cross-Modal Attentional Context Learning for RGB-D Object Detection

    Full text link
    Recognizing objects from simultaneously sensed photometric (RGB) and depth channels is a fundamental yet practical problem in many machine vision applications such as robot grasping and autonomous driving. In this paper, we address this problem by developing a Cross-Modal Attentional Context (CMAC) learning framework, which enables the full exploitation of the context information from both RGB and depth data. Compared to existing RGB-D object detection frameworks, our approach has several appealing properties. First, it consists of an attention-based global context model for exploiting adaptive contextual information and incorporating this information into a region-based CNN (e.g., Fast RCNN) framework to achieve improved object detection performance. Second, our CMAC framework further contains a fine-grained object part attention module to harness multiple discriminative object parts inside each possible object region for superior local feature representation. While greatly improving the accuracy of RGB-D object detection, the effective cross-modal information fusion as well as attentional context modeling in our proposed model provide an interpretable visualization scheme. Experimental results demonstrate that the proposed method significantly improves upon the state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin

    Reading Scene Text with Attention Convolutional Sequence Modeling

    Full text link
    Reading text in the wild is a challenging task in the field of computer vision. Existing approaches mainly adopted Connectionist Temporal Classification (CTC) or Attention models based on Recurrent Neural Network (RNN), which is computationally expensive and hard to train. In this paper, we present an end-to-end Attention Convolutional Network for scene text recognition. Firstly, instead of RNN, we adopt the stacked convolutional layers to effectively capture the contextual dependencies of the input sequence, which is characterized by lower computational complexity and easier parallel computation. Compared to the chain structure of recurrent networks, the Convolutional Neural Network (CNN) provides a natural way to capture long-term dependencies between elements, which is 9 times faster than Bidirectional Long Short-Term Memory (BLSTM). Furthermore, in order to enhance the representation of foreground text and suppress the background noise, we incorporate the residual attention modules into a small densely connected network to improve the discriminability of CNN features. We validate the performance of our approach on the standard benchmarks, including the Street View Text, IIIT5K and ICDAR datasets. As a result, state-of-the-art or highly-competitive performance and efficiency show the superiority of the proposed approach

    Learning Contextual Dependencies with Convolutional Hierarchical Recurrent Neural Networks

    Full text link
    Existing deep convolutional neural networks (CNNs) have shown their great success on image classification. CNNs mainly consist of convolutional and pooling layers, both of which are performed on local image areas without considering the dependencies among different image regions. However, such dependencies are very important for generating explicit image representation. In contrast, recurrent neural networks (RNNs) are well known for their ability of encoding contextual information among sequential data, and they only require a limited number of network parameters. General RNNs can hardly be directly applied on non-sequential data. Thus, we proposed the hierarchical RNNs (HRNNs). In HRNNs, each RNN layer focuses on modeling spatial dependencies among image regions from the same scale but different locations. While the cross RNN scale connections target on modeling scale dependencies among regions from the same location but different scales. Specifically, we propose two recurrent neural network models: 1) hierarchical simple recurrent network (HSRN), which is fast and has low computational cost; and 2) hierarchical long-short term memory recurrent network (HLSTM), which performs better than HSRN with the price of more computational cost. In this manuscript, we integrate CNNs with HRNNs, and develop end-to-end convolutional hierarchical recurrent neural networks (C-HRNNs). C-HRNNs not only make use of the representation power of CNNs, but also efficiently encodes spatial and scale dependencies among different image regions. On four of the most challenging object/scene image classification benchmarks, our C-HRNNs achieve state-of-the-art results on Places 205, SUN 397, MIT indoor, and competitive results on ILSVRC 2012

    Reading Scene Text in Deep Convolutional Sequences

    Full text link
    We develop a Deep-Text Recurrent Network (DTRN) that regards scene text reading as a sequence labelling problem. We leverage recent advances of deep convolutional neural networks to generate an ordered high-level sequence from a whole word image, avoiding the difficult character segmentation problem. Then a deep recurrent model, building on long short-term memory (LSTM), is developed to robustly recognize the generated CNN sequences, departing from most existing approaches recognising each character independently. Our model has a number of appealing properties in comparison to existing scene text recognition methods: (i) It can recognise highly ambiguous words by leveraging meaningful context information, allowing it to work reliably without either pre- or post-processing; (ii) the deep CNN feature is robust to various image distortions; (iii) it retains the explicit order information in word image, which is essential to discriminate word strings; (iv) the model does not depend on pre-defined dictionary, and it can process unknown words and arbitrary strings. Codes for the DTRN will be available.Comment: To appear in the 13th AAAI Conference on Artificial Intelligence (AAAI-16), 201

    Learning Context Graph for Person Search

    Full text link
    Person re-identification has achieved great progress with deep convolutional neural networks. However, most previous methods focus on learning individual appearance feature embedding, and it is hard for the models to handle difficult situations with different illumination, large pose variance and occlusion. In this work, we take a step further and consider employing context information for person search. For a probe-gallery pair, we first propose a contextual instance expansion module, which employs a relative attention module to search and filter useful context information in the scene. We also build a graph learning framework to effectively employ context pairs to update target similarity. These two modules are built on top of a joint detection and instance feature learning framework, which improves the discriminativeness of the learned features. The proposed framework achieves state-of-the-art performance on two widely used person search datasets.Comment: To appear in CVPR 201
    • …
    corecore