Search CORE

92,177 research outputs found

Hybrid multi-layer Deep CNN/Aggregator feature for image classification

Author: Chevallier Louis
Jurie Frederic
Kulkarni Praveen
Perez Patrick
Zepeda Joaquin
Publication venue
Publication date: 13/03/2015
Field of study

Deep Convolutional Neural Networks (DCNN) have established a remarkable performance benchmark in the field of image classification, displacing classical approaches based on hand-tailored aggregations of local descriptors. Yet DCNNs impose high computational burdens both at training and at testing time, and training them requires collecting and annotating large amounts of training data. Supervised adaptation methods have been proposed in the literature that partially re-learn a transferred DCNN structure from a new target dataset. Yet these require expensive bounding-box annotations and are still computationally expensive to learn. In this paper, we address these shortcomings of DCNN adaptation schemes by proposing a hybrid approach that combines conventional, unsupervised aggregators such as Bag-of-Words (BoW), with the DCNN pipeline by treating the output of intermediate layers as densely extracted local descriptors. We test a variant of our approach that uses only intermediate DCNN layers on the standard PASCAL VOC 2007 dataset and show performance significantly higher than the standard BoW model and comparable to Fisher vector aggregation but with a feature that is 150 times smaller. A second variant of our approach that includes the fully connected DCNN layers significantly outperforms Fisher vector schemes and performs comparably to DCNN approaches adapted to Pascal VOC 2007, yet at only a small fraction of the training and testing cost.Comment: Accepted in ICASSP 2015 conference, 5 pages including reference, 4 figures and 2 table

arXiv.org e-Print Archive

Crossref

Deep Multimodal Learning for Audio-Visual Speech Recognition

Author: Goel Vaibhava
Marcheret Etienne
Mroueh Youssef
Publication venue
Publication date: 22/01/2015
Field of study

In this paper, we present methods in deep multimodal learning for fusing speech and visual modalities for Audio-Visual Automatic Speech Recognition (AV-ASR). First, we study an approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built. While the audio network alone achieves a phone error rate (PER) of

41\%

under clean condition on the IBM large vocabulary audio-visual studio dataset, this fusion model achieves a PER of

35.83\%

demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio. Second, we present a new deep network architecture that uses a bilinear softmax layer to account for class specific correlations between modalities. We show that combining the posteriors from the bilinear networks with those from the fused model mentioned above results in a further significant phone error rate reduction, yielding a final PER of

34.03\%

.Comment: ICASSP 201

arXiv.org e-Print Archive

Crossref