12,885 research outputs found
Realizing Petabyte Scale Acoustic Modeling
Large scale machine learning (ML) systems such as the Alexa automatic speech
recognition (ASR) system continue to improve with increasing amounts of
manually transcribed training data. Instead of scaling manual transcription to
impractical levels, we utilize semi-supervised learning (SSL) to learn acoustic
models (AM) from the vast firehose of untranscribed audio data. Learning an AM
from 1 Million hours of audio presents unique ML and system design challenges.
We present the design and evaluation of a highly scalable and resource
efficient SSL system for AM. Employing the student/teacher learning paradigm,
we focus on the student learning subsystem: a scalable and robust data pipeline
that generates features and targets from raw audio, and an efficient model
pipeline, including the distributed trainer, that builds a student model. Our
evaluations show that, even without extensive hyper-parameter tuning, we obtain
relative accuracy improvements in the 10 to 20 range, with higher gains in
noisier conditions. The end-to-end processing time of this SSL system was 12
days, and several components in this system can trivially scale linearly with
more compute resources.Comment: 2156-3357 \copyright 2019 IEEE. Personal use is permitted, but
republication/redistribution requires IEEE permission. See
http://www.ieee.org/publications standards/publications/rights/index.html for
more informatio
The NTNU System at the Interspeech 2020 Non-Native Children's Speech ASR Challenge
This paper describes the NTNU ASR system participating in the Interspeech
2020 Non-Native Children's Speech ASR Challenge supported by the SIG-CHILD
group of ISCA. This ASR shared task is made much more challenging due to the
coexisting diversity of non-native and children speaking characteristics. In
the setting of closed-track evaluation, all participants were restricted to
develop their systems merely based on the speech and text corpora provided by
the organizer. To work around this under-resourced issue, we built our ASR
system on top of CNN-TDNNF-based acoustic models, meanwhile harnessing the
synergistic power of various data augmentation strategies, including both
utterance- and word-level speed perturbation and spectrogram augmentation,
alongside a simple yet effective data-cleansing approach. All variants of our
ASR system employed an RNN-based language model to rescore the first-pass
recognition hypotheses, which was trained solely on the text dataset released
by the organizer. Our system with the best configuration came out in second
place, resulting in a word error rate (WER) of 17.59 %, while those of the
top-performing, second runner-up and official baseline systems are 15.67%,
18.71%, 35.09%, respectively.Comment: Submitted to Interspeech 2020 Special Session: Shared Task on
Automatic Speech Recognition for Non-Native Children's Speec
State of the Art, Evaluation and Recommendations regarding "Document Processing and Visualization Techniques"
Several Networks of Excellence have been set up in the framework of the
European FP5 research program. Among these Networks of Excellence, the NEMIS
project focuses on the field of Text Mining.
Within this field, document processing and visualization was identified as
one of the key topics and the WG1 working group was created in the NEMIS
project, to carry out a detailed survey of techniques associated with the text
mining process and to identify the relevant research topics in related research
areas.
In this document we present the results of this comprehensive survey. The
report includes a description of the current state-of-the-art and practice, a
roadmap for follow-up research in the identified areas, and recommendations for
anticipated technological development in the domain of text mining.Comment: 54 pages, Report of Working Group 1 for the European Network of
Excellence (NoE) in Text Mining and its Applications in Statistics (NEMIS
von Mises-Fisher Mixture Model-based Deep learning: Application to Face Verification
A number of pattern recognition tasks, \textit{e.g.}, face verification, can
be boiled down to classification or clustering of unit length directional
feature vectors whose distance can be simply computed by their angle. In this
paper, we propose the von Mises-Fisher (vMF) mixture model as the theoretical
foundation for an effective deep-learning of such directional features and
derive a novel vMF Mixture Loss and its corresponding vMF deep features. The
proposed vMF feature learning achieves the characteristics of discriminative
learning, \textit{i.e.}, compacting the instances of the same class while
increasing the distance of instances from different classes. Moreover, it
subsumes a number of popular loss functions as well as an effective method in
deep learning, namely normalization. We conduct extensive experiments on face
verification using 4 different challenging face datasets, \textit{i.e.}, LFW,
YouTube faces, CACD and IJB-A. Results show the effectiveness and excellent
generalization ability of the proposed approach as it achieves state-of-the-art
results on the LFW, YouTube faces and CACD datasets and competitive results on
the IJB-A dataset.Comment: Under revie
PhonSenticNet: A Cognitive Approach to Microtext Normalization for Concept-Level Sentiment Analysis
With the current upsurge in the usage of social media platforms, the trend of
using short text (microtext) in place of standard words has seen a significant
rise. The usage of microtext poses a considerable performance issue in
concept-level sentiment analysis, since models are trained on standard words.
This paper discusses the impact of coupling sub-symbolic (phonetics) with
symbolic (machine learning) Artificial Intelligence to transform the
out-of-vocabulary concepts into their standard in-vocabulary form. The phonetic
distance is calculated using the Sorensen similarity algorithm. The
phonetically similar invocabulary concepts thus obtained are then used to
compute the correct polarity value, which was previously being miscalculated
because of the presence of microtext. Our proposed framework increases the
accuracy of polarity detection by 6% as compared to the earlier model. This
also validates the fact that microtext normalization is a necessary
pre-requisite for the sentiment analysis task.Comment: This paper is submitted to INTERSPEECH201
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
We show that an end-to-end deep learning approach can be used to recognize
either English or Mandarin Chinese speech--two vastly different languages.
Because it replaces entire pipelines of hand-engineered components with neural
networks, end-to-end learning allows us to handle a diverse variety of speech
including noisy environments, accents and different languages. Key to our
approach is our application of HPC techniques, resulting in a 7x speedup over
our previous system. Because of this efficiency, experiments that previously
took weeks now run in days. This enables us to iterate more quickly to identify
superior architectures and algorithms. As a result, in several cases, our
system is competitive with the transcription of human workers when benchmarked
on standard datasets. Finally, using a technique called Batch Dispatch with
GPUs in the data center, we show that our system can be inexpensively deployed
in an online setting, delivering low latency when serving users at scale
Phonocardiographic Sensing using Deep Learning for Abnormal Heartbeat Detection
Cardiac auscultation involves expert interpretation of abnormalities in heart
sounds using stethoscope. Deep learning based cardiac auscultation is of
significant interest to the healthcare community as it can help reducing the
burden of manual auscultation with automated detection of abnormal heartbeats.
However, the problem of automatic cardiac auscultation is complicated due to
the requirement of reliability and high accuracy, and due to the presence of
background noise in the heartbeat sound. In this work, we propose a Recurrent
Neural Networks (RNNs) based automated cardiac auscultation solution. Our
choice of RNNs is motivated by the great success of deep learning in medical
applications and by the observation that RNNs represent the deep learning
configuration most suitable for dealing with sequential or temporal data even
in the presence of noise. We explore the use of various RNN models, and
demonstrate that these models deliver the abnormal heartbeat classification
score with significant improvement. Our proposed approach using RNNs can be
potentially be used for real-time abnormal heartbeat detection in the Internet
of Medical Things for remote monitoring applications
Spoofing Detection Goes Noisy: An Analysis of Synthetic Speech Detection in the Presence of Additive Noise
Automatic speaker verification (ASV) technology is recently finding its way
to end-user applications for secure access to personal data, smart services or
physical facilities. Similar to other biometric technologies, speaker
verification is vulnerable to spoofing attacks where an attacker masquerades as
a particular target speaker via impersonation, replay, text-to-speech (TTS) or
voice conversion (VC) techniques to gain illegitimate access to the system. We
focus on TTS and VC that represent the most flexible, high-end spoofing
attacks. Most of the prior studies on synthesized or converted speech detection
report their findings using high-quality clean recordings. Meanwhile, the
performance of spoofing detectors in the presence of additive noise, an
important consideration in practical ASV implementations, remains largely
unknown. To this end, we analyze the suitability of state-of-the-art synthetic
speech detectors under additive noise with a special focus on front-end
features. Our comparison includes eight acoustic feature sets, five related to
spectral magnitude and three to spectral phase information. Our extensive
experiments on ASVSpoof 2015 corpus reveal several important findings. Firstly,
all the countermeasures break down even at relatively high signal-to-noise
ratios (SNRs) and fail to generalize to noisy conditions. Secondly, speech
enhancement is not found helpful. Thirdly, GMM back-end generally outperforms
the more involved i-vector back-end. Fourthly, concerning the compared
features, the Mel-frequency cepstral coefficients (MFCCs) and subband spectral
centroid magnitude coefficients (SCMCs) perform the best on average though the
winner method depends on SNR and noise type. Finally, a study with two score
fusion strategies shows that combining different feature based systems improves
recognition accuracy for known and unknown attacks in both clean and noisy
conditions.Comment: 23 Pages, 7 figure
Polyphonic Sound Event Detection by using Capsule Neural Networks
Artificial sound event detection (SED) has the aim to mimic the human ability
to perceive and understand what is happening in the surroundings. Nowadays,
Deep Learning offers valuable techniques for this goal such as Convolutional
Neural Networks (CNNs). The Capsule Neural Network (CapsNet) architecture has
been recently introduced in the image processing field with the intent to
overcome some of the known limitations of CNNs, specifically regarding the
scarce robustness to affine transformations (i.e., perspective, size,
orientation) and the detection of overlapped images. This motivated the authors
to employ CapsNets to deal with the polyphonic-SED task, in which multiple
sound events occur simultaneously. Specifically, we propose to exploit the
capsule units to represent a set of distinctive properties for each individual
sound event. Capsule units are connected through a so-called "dynamic routing"
that encourages learning part-whole relationships and improves the detection
performance in a polyphonic context. This paper reports extensive evaluations
carried out on three publicly available datasets, showing how the CapsNet-based
algorithm not only outperforms standard CNNs but also allows to achieve the
best results with respect to the state of the art algorithms
Improving large vocabulary continuous speech recognition by combining GMM-based and reservoir-based acoustic modeling
In earlier work we have shown that good phoneme recognition is possible with a so-called reservoir, a special type of recurrent neural network. In this paper, different architectures based on Reservoir Computing (RC) for large vocabulary continuous speech recognition are investigated. Besides experiments with HMM hybrids, it is shown that a RC-HMM tandem can achieve the same recognition accuracy as a classical HMM, which is a promising result for such a fairly new paradigm. It is also demonstrated that a state-level combination of the scores of the tandem and the baseline HMM leads to a significant improvement over the baseline. A word error rate reduction of the order of 20\% relative is possible
- …