1,051 research outputs found

    Low-rank and Sparse Soft Targets to Learn Better DNN Acoustic Models

    Full text link
    Conventional deep neural networks (DNN) for speech acoustic modeling rely on Gaussian mixture models (GMM) and hidden Markov model (HMM) to obtain binary class labels as the targets for DNN training. Subword classes in speech recognition systems correspond to context-dependent tied states or senones. The present work addresses some limitations of GMM-HMM senone alignments for DNN training. We hypothesize that the senone probabilities obtained from a DNN trained with binary labels can provide more accurate targets to learn better acoustic models. However, DNN outputs bear inaccuracies which are exhibited as high dimensional unstructured noise, whereas the informative components are structured and low-dimensional. We exploit principle component analysis (PCA) and sparse coding to characterize the senone subspaces. Enhanced probabilities obtained from low-rank and sparse reconstructions are used as soft-targets for DNN acoustic modeling, that also enables training with untranscribed data. Experiments conducted on AMI corpus shows 4.6% relative reduction in word error rate

    Fingerprint Verification Using Spectral Minutiae Representations

    Get PDF
    Most fingerprint recognition systems are based on the use of a minutiae set, which is an unordered collection of minutiae locations and orientations suffering from various deformations such as translation, rotation, and scaling. The spectral minutiae representation introduced in this paper is a novel method to represent a minutiae set as a fixed-length feature vector, which is invariant to translation, and in which rotation and scaling become translations, so that they can be easily compensated for. These characteristics enable the combination of fingerprint recognition systems with template protection schemes that require a fixed-length feature vector. This paper introduces the concept of algorithms for two representation methods: the location-based spectral minutiae representation and the orientation-based spectral minutiae representation. Both algorithms are evaluated using two correlation-based spectral minutiae matching algorithms. We present the performance of our algorithms on three fingerprint databases. We also show how the performance can be improved by using a fusion scheme and singular points

    REMAP: Multi-layer entropy-guided pooling of dense CNN features for image retrieval

    Full text link
    This paper addresses the problem of very large-scale image retrieval, focusing on improving its accuracy and robustness. We target enhanced robustness of search to factors such as variations in illumination, object appearance and scale, partial occlusions, and cluttered backgrounds - particularly important when search is performed across very large datasets with significant variability. We propose a novel CNN-based global descriptor, called REMAP, which learns and aggregates a hierarchy of deep features from multiple CNN layers, and is trained end-to-end with a triplet loss. REMAP explicitly learns discriminative features which are mutually-supportive and complementary at various semantic levels of visual abstraction. These dense local features are max-pooled spatially at each layer, within multi-scale overlapping regions, before aggregation into a single image-level descriptor. To identify the semantically useful regions and layers for retrieval, we propose to measure the information gain of each region and layer using KL-divergence. Our system effectively learns during training how useful various regions and layers are and weights them accordingly. We show that such relative entropy-guided aggregation outperforms classical CNN-based aggregation controlled by SGD. The entire framework is trained in an end-to-end fashion, outperforming the latest state-of-the-art results. On image retrieval datasets Holidays, Oxford and MPEG, the REMAP descriptor achieves mAP of 95.5%, 91.5%, and 80.1% respectively, outperforming any results published to date. REMAP also formed the core of the winning submission to the Google Landmark Retrieval Challenge on Kaggle.Comment: Submitted to IEEE Trans. Image Processing on 24 May 2018, published 22 May 201

    Characterizing Adversarial Examples Based on Spatial Consistency Information for Semantic Segmentation

    Full text link
    Deep Neural Networks (DNNs) have been widely applied in various recognition tasks. However, recently DNNs have been shown to be vulnerable against adversarial examples, which can mislead DNNs to make arbitrary incorrect predictions. While adversarial examples are well studied in classification tasks, other learning problems may have different properties. For instance, semantic segmentation requires additional components such as dilated convolutions and multiscale processing. In this paper, we aim to characterize adversarial examples based on spatial context information in semantic segmentation. We observe that spatial consistency information can be potentially leveraged to detect adversarial examples robustly even when a strong adaptive attacker has access to the model and detection strategies. We also show that adversarial examples based on attacks considered within the paper barely transfer among models, even though transferability is common in classification. Our observations shed new light on developing adversarial attacks and defenses to better understand the vulnerabilities of DNNs.Comment: Accepted to ECCV 201

    The disappearing immigrants: hunger strike as invisible struggle

    Get PDF
    The paper explores a hunger strike undertaken by 300 immigrants in Greece in early 2011

    Effectiveness of computer-based auditory training in improving the perception of noise-vocoded speech

    Get PDF
    Five experiments were designed to evaluate the effectiveness of “high-variability” lexical training in improving the ability of normal-hearing subjects to perceive noise-vocoded speech that had been spectrally shifted to simulate tonotopic misalignment. Two approaches to training were implemented. One training approach required subjects to recognize isolated words, while the other training approach required subjects to recognize words in sentences. Both approaches to training improved the ability to identify words in sentences. Improvements following a single session (lasting 1–2 h) of auditory training ranged between 7 and 12 %pts and were significantly larger than improvements following a visual control task that was matched with the auditory training task in terms of the response demands. An additional three sessions of word- and sentence-based training led to further improvements, with the average overall improvement ranging from 13 to 18 %pts. When a tonotopic misalignment of 3 mm rather than 6 mm was simulated, training with several talkers led to greater generalization to new talkers than training with a single talker. The results confirm that computer-based lexical training can help overcome the effects of spectral distortions in speech, and they suggest that training materials are most effective when several talkers are included

    3D Face Tracking and Texture Fusion in the Wild

    Full text link
    We present a fully automatic approach to real-time 3D face reconstruction from monocular in-the-wild videos. With the use of a cascaded-regressor based face tracking and a 3D Morphable Face Model shape fitting, we obtain a semi-dense 3D face shape. We further use the texture information from multiple frames to build a holistic 3D face representation from the video frames. Our system is able to capture facial expressions and does not require any person-specific training. We demonstrate the robustness of our approach on the challenging 300 Videos in the Wild (300-VW) dataset. Our real-time fitting framework is available as an open source library at http://4dface.org
    corecore