1,051 research outputs found
Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models
Conventional deep neural networks (DNN) for speech acoustic modeling rely on
Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary
class labels as the targets for DNN training. Subword classes in speech
recognition systems correspond to context-dependent tied states or senones. The
present work addresses some limitations of GMM-HMM senone alignments for DNN
training. We hypothesize that the senone probabilities obtained from a DNN
trained with binary labels can provide more accurate targets to learn better
acoustic models. However, DNN outputs bear inaccuracies which are exhibited as
high dimensional unstructured noise, whereas the informative components are
structured and low-dimensional. We exploit principle component analysis (PCA)
and sparse coding to characterize the senone subspaces. Enhanced probabilities
obtained from low-rank and sparse reconstructions are used as soft-targets for
DNN acoustic modeling, that also enables training with untranscribed data.
Experiments conducted on AMI corpus shows 4.6% relative reduction in word error
rate
Fingerprint Verification Using Spectral Minutiae Representations
Most fingerprint recognition systems are based on the use of a minutiae set, which is an unordered collection of minutiae locations and orientations suffering from various deformations such as translation, rotation, and scaling. The spectral minutiae representation introduced in this paper is a novel method to represent a minutiae set as a fixed-length feature vector, which is invariant to translation, and in which rotation and scaling become translations, so that they can be easily compensated for. These characteristics enable the combination of fingerprint recognition systems with template protection schemes that require a fixed-length feature vector. This paper introduces the concept of algorithms for two representation methods: the location-based spectral minutiae representation and the orientation-based spectral minutiae representation. Both algorithms are evaluated using two correlation-based spectral minutiae matching algorithms. We present the performance of our algorithms on three fingerprint databases. We also show how the performance can be improved by using a fusion scheme and singular points
REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval
This paper addresses the problem of very large-scale image retrieval,
focusing on improving its accuracy and robustness. We target enhanced
robustness of search to factors such as variations in illumination, object
appearance and scale, partial occlusions, and cluttered backgrounds -
particularly important when search is performed across very large datasets with
significant variability. We propose a novel CNN-based global descriptor, called
REMAP, which learns and aggregates a hierarchy of deep features from multiple
CNN layers, and is trained end-to-end with a triplet loss. REMAP explicitly
learns discriminative features which are mutually-supportive and complementary
at various semantic levels of visual abstraction. These dense local features
are max-pooled spatially at each layer, within multi-scale overlapping regions,
before aggregation into a single image-level descriptor. To identify the
semantically useful regions and layers for retrieval, we propose to measure the
information gain of each region and layer using KL-divergence. Our system
effectively learns during training how useful various regions and layers are
and weights them accordingly. We show that such relative entropy-guided
aggregation outperforms classical CNN-based aggregation controlled by SGD. The
entire framework is trained in an end-to-end fashion, outperforming the latest
state-of-the-art results. On image retrieval datasets Holidays, Oxford and
MPEG, the REMAP descriptor achieves mAP of 95.5%, 91.5%, and 80.1%
respectively, outperforming any results published to date. REMAP also formed
the core of the winning submission to the Google Landmark Retrieval Challenge
on Kaggle.Comment: Submitted to IEEE Trans. Image Processing on 24 May 2018, published
22 May 201
Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation
Deep Neural Networks (DNNs) have been widely applied in various recognition
tasks. However, recently DNNs have been shown to be vulnerable against
adversarial examples, which can mislead DNNs to make arbitrary incorrect
predictions. While adversarial examples are well studied in classification
tasks, other learning problems may have different properties. For instance,
semantic segmentation requires additional components such as dilated
convolutions and multiscale processing. In this paper, we aim to characterize
adversarial examples based on spatial context information in semantic
segmentation. We observe that spatial consistency information can be
potentially leveraged to detect adversarial examples robustly even when a
strong adaptive attacker has access to the model and detection strategies. We
also show that adversarial examples based on attacks considered within the
paper barely transfer among models, even though transferability is common in
classification. Our observations shed new light on developing adversarial
attacks and defenses to better understand the vulnerabilities of DNNs.Comment: Accepted to ECCV 201
The disappearing immigrants: hunger strike as invisible struggle
The paper explores a hunger strike undertaken by 300 immigrants in Greece in early 2011
Effectiveness of computer-based auditory training in improving the perception of noise-vocoded speech
Five experiments were designed to evaluate the effectiveness of “high-variability” lexical training in improving the ability of normal-hearing subjects to perceive noise-vocoded speech that had been spectrally shifted to simulate tonotopic misalignment. Two approaches to training were implemented. One training approach required subjects to recognize isolated words, while the other training approach required subjects to recognize words in sentences. Both approaches to training improved the ability to identify words in sentences. Improvements following a single session (lasting 1–2 h) of auditory training ranged between 7 and 12 %pts and were significantly larger than improvements following a visual control task that was matched with the auditory training task in terms of the response demands. An additional three sessions of word- and sentence-based training led to further improvements, with the average overall improvement ranging from 13 to 18 %pts. When a tonotopic misalignment of 3 mm rather than 6 mm was simulated, training with several talkers led to greater generalization to new talkers than training with a single talker. The results confirm that computer-based lexical training can help overcome the effects of spectral distortions in speech, and they suggest that training materials are most effective when several talkers are included
3D Face Tracking and Texture Fusion in the Wild
We present a fully automatic approach to real-time 3D face reconstruction
from monocular in-the-wild videos. With the use of a cascaded-regressor based
face tracking and a 3D Morphable Face Model shape fitting, we obtain a
semi-dense 3D face shape. We further use the texture information from multiple
frames to build a holistic 3D face representation from the video frames. Our
system is able to capture facial expressions and does not require any
person-specific training. We demonstrate the robustness of our approach on the
challenging 300 Videos in the Wild (300-VW) dataset. Our real-time fitting
framework is available as an open source library at http://4dface.org
- …