2,438 research outputs found
Histogram of gradients of Time-Frequency Representations for Audio scene detection
This paper addresses the problem of audio scenes classification and
contributes to the state of the art by proposing a novel feature. We build this
feature by considering histogram of gradients (HOG) of time-frequency
representation of an audio scene. Contrarily to classical audio features like
MFCC, we make the hypothesis that histogram of gradients are able to encode
some relevant informations in a time-frequency {representation:} namely, the
local direction of variation (in time and frequency) of the signal spectral
power. In addition, in order to gain more invariance and robustness, histogram
of gradients are locally pooled. We have evaluated the relevance of {the novel
feature} by comparing its performances with state-of-the-art competitors, on
several datasets, including a novel one that we provide, as part of our
contribution. This dataset, that we make publicly available, involves
classes and contains about minutes of audio scene recording. We thus
believe that it may be the next standard dataset for evaluating audio scene
classification algorithms. Our comparison results clearly show that our
HOG-based features outperform its competitor
Accurate Text Localization in Natural Image with Cascaded Convolutional Text Network
We introduce a new top-down pipeline for scene text detection. We propose a
novel Cascaded Convolutional Text Network (CCTN) that joints two customized
convolutional networks for coarse-to-fine text localization. The CCTN fast
detects text regions roughly from a low-resolution image, and then accurately
localizes text lines from each enlarged region. We cast previous character
based detection into direct text region estimation, avoiding multiple bottom-
up post-processing steps. It exhibits surprising robustness and discriminative
power by considering whole text region as detection object which provides
strong semantic information. We customize convolutional network by develop- ing
rectangle convolutions and multiple in-network fusions. This enables it to
handle multi-shape and multi-scale text efficiently. Furthermore, the CCTN is
computationally efficient by sharing convolutional computations, and high-level
property allows it to be invariant to various languages and multiple
orientations. It achieves 0.84 and 0.86 F-measures on the ICDAR 2011 and ICDAR
2013, delivering substantial improvements over state-of-the-art results [23,
1]
DeepText: A Unified Framework for Text Proposal Generation and Text Detection in Natural Images
In this paper, we develop a novel unified framework called DeepText for text
region proposal generation and text detection in natural images via a fully
convolutional neural network (CNN). First, we propose the inception region
proposal network (Inception-RPN) and design a set of text characteristic prior
bounding boxes to achieve high word recall with only hundred level candidate
proposals. Next, we present a powerful textdetection network that embeds
ambiguous text category (ATC) information and multilevel region-of-interest
pooling (MLRP) for text and non-text classification and accurate localization.
Finally, we apply an iterative bounding box voting scheme to pursue high recall
in a complementary manner and introduce a filtering algorithm to retain the
most suitable bounding box, while removing redundant inner and outer boxes for
each text instance. Our approach achieves an F-measure of 0.83 and 0.85 on the
ICDAR 2011 and 2013 robust text detection benchmarks, outperforming previous
state-of-the-art results.Comment: 12 pages, 4 figures, 3 table
Mid-level Elements for Object Detection
Building on the success of recent discriminative mid-level elements, we
propose a surprisingly simple approach for object detection which performs
comparable to the current state-of-the-art approaches on PASCAL VOC comp-3
detection challenge (no external data). Through extensive experiments and
ablation analysis, we show how our approach effectively improves upon the
HOG-based pipelines by adding an intermediate mid-level representation for the
task of object detection. This representation is easily interpretable and
allows us to visualize what our object detector "sees". We also discuss the
insights our approach shares with CNN-based methods, such as sharing
representation between categories helps
Integrating Scene Text and Visual Appearance for Fine-Grained Image Classification
Text in natural images contains rich semantics that are often highly relevant
to objects or scene. In this paper, we focus on the problem of fully exploiting
scene text for visual understanding. The main idea is combining word
representations and deep visual features into a globally trainable deep
convolutional neural network. First, the recognized words are obtained by a
scene text reading system. Then, we combine the word embedding of the
recognized words and the deep visual features into a single representation,
which is optimized by a convolutional neural network for fine-grained image
classification. In our framework, the attention mechanism is adopted to reveal
the relevance between each recognized word and the given image, which further
enhances the recognition performance. We have performed experiments on two
datasets: Con-Text dataset and Drink Bottle dataset, that are proposed for
fine-grained classification of business places and drink bottles, respectively.
The experimental results consistently demonstrate that the proposed method
combining textual and visual cues significantly outperforms classification with
only visual representations. Moreover, we have shown that the learned
representation improves the retrieval performance on the drink bottle images by
a large margin, making it potentially useful in product search
Cross-Modal Attentional Context Learning for RGB-D Object Detection
Recognizing objects from simultaneously sensed photometric (RGB) and depth
channels is a fundamental yet practical problem in many machine vision
applications such as robot grasping and autonomous driving. In this paper, we
address this problem by developing a Cross-Modal Attentional Context (CMAC)
learning framework, which enables the full exploitation of the context
information from both RGB and depth data. Compared to existing RGB-D object
detection frameworks, our approach has several appealing properties. First, it
consists of an attention-based global context model for exploiting adaptive
contextual information and incorporating this information into a region-based
CNN (e.g., Fast RCNN) framework to achieve improved object detection
performance. Second, our CMAC framework further contains a fine-grained object
part attention module to harness multiple discriminative object parts inside
each possible object region for superior local feature representation. While
greatly improving the accuracy of RGB-D object detection, the effective
cross-modal information fusion as well as attentional context modeling in our
proposed model provide an interpretable visualization scheme. Experimental
results demonstrate that the proposed method significantly improves upon the
state of the art on all public benchmarks.Comment: Accept as a regular paper to IEEE Transactions on Image Processin
Reading Scene Text with Attention Convolutional Sequence Modeling
Reading text in the wild is a challenging task in the field of computer
vision. Existing approaches mainly adopted Connectionist Temporal
Classification (CTC) or Attention models based on Recurrent Neural Network
(RNN), which is computationally expensive and hard to train. In this paper, we
present an end-to-end Attention Convolutional Network for scene text
recognition. Firstly, instead of RNN, we adopt the stacked convolutional layers
to effectively capture the contextual dependencies of the input sequence, which
is characterized by lower computational complexity and easier parallel
computation. Compared to the chain structure of recurrent networks, the
Convolutional Neural Network (CNN) provides a natural way to capture long-term
dependencies between elements, which is 9 times faster than Bidirectional Long
Short-Term Memory (BLSTM). Furthermore, in order to enhance the representation
of foreground text and suppress the background noise, we incorporate the
residual attention modules into a small densely connected network to improve
the discriminability of CNN features. We validate the performance of our
approach on the standard benchmarks, including the Street View Text, IIIT5K and
ICDAR datasets. As a result, state-of-the-art or highly-competitive performance
and efficiency show the superiority of the proposed approach
Learning Contextual Dependencies with Convolutional Hierarchical Recurrent Neural Networks
Existing deep convolutional neural networks (CNNs) have shown their great
success on image classification. CNNs mainly consist of convolutional and
pooling layers, both of which are performed on local image areas without
considering the dependencies among different image regions. However, such
dependencies are very important for generating explicit image representation.
In contrast, recurrent neural networks (RNNs) are well known for their ability
of encoding contextual information among sequential data, and they only require
a limited number of network parameters. General RNNs can hardly be directly
applied on non-sequential data. Thus, we proposed the hierarchical RNNs
(HRNNs). In HRNNs, each RNN layer focuses on modeling spatial dependencies
among image regions from the same scale but different locations. While the
cross RNN scale connections target on modeling scale dependencies among regions
from the same location but different scales. Specifically, we propose two
recurrent neural network models: 1) hierarchical simple recurrent network
(HSRN), which is fast and has low computational cost; and 2) hierarchical
long-short term memory recurrent network (HLSTM), which performs better than
HSRN with the price of more computational cost.
In this manuscript, we integrate CNNs with HRNNs, and develop end-to-end
convolutional hierarchical recurrent neural networks (C-HRNNs). C-HRNNs not
only make use of the representation power of CNNs, but also efficiently encodes
spatial and scale dependencies among different image regions. On four of the
most challenging object/scene image classification benchmarks, our C-HRNNs
achieve state-of-the-art results on Places 205, SUN 397, MIT indoor, and
competitive results on ILSVRC 2012
Reading Scene Text in Deep Convolutional Sequences
We develop a Deep-Text Recurrent Network (DTRN) that regards scene text
reading as a sequence labelling problem. We leverage recent advances of deep
convolutional neural networks to generate an ordered high-level sequence from a
whole word image, avoiding the difficult character segmentation problem. Then a
deep recurrent model, building on long short-term memory (LSTM), is developed
to robustly recognize the generated CNN sequences, departing from most existing
approaches recognising each character independently. Our model has a number of
appealing properties in comparison to existing scene text recognition methods:
(i) It can recognise highly ambiguous words by leveraging meaningful context
information, allowing it to work reliably without either pre- or
post-processing; (ii) the deep CNN feature is robust to various image
distortions; (iii) it retains the explicit order information in word image,
which is essential to discriminate word strings; (iv) the model does not depend
on pre-defined dictionary, and it can process unknown words and arbitrary
strings. Codes for the DTRN will be available.Comment: To appear in the 13th AAAI Conference on Artificial Intelligence
(AAAI-16), 201
Learning Context Graph for Person Search
Person re-identification has achieved great progress with deep convolutional
neural networks. However, most previous methods focus on learning individual
appearance feature embedding, and it is hard for the models to handle difficult
situations with different illumination, large pose variance and occlusion. In
this work, we take a step further and consider employing context information
for person search. For a probe-gallery pair, we first propose a contextual
instance expansion module, which employs a relative attention module to search
and filter useful context information in the scene. We also build a graph
learning framework to effectively employ context pairs to update target
similarity. These two modules are built on top of a joint detection and
instance feature learning framework, which improves the discriminativeness of
the learned features. The proposed framework achieves state-of-the-art
performance on two widely used person search datasets.Comment: To appear in CVPR 201
- …