706 research outputs found
Automatic Quality Estimation for ASR System Combination
Recognizer Output Voting Error Reduction (ROVER) has been widely used for
system combination in automatic speech recognition (ASR). In order to select
the most appropriate words to insert at each position in the output
transcriptions, some ROVER extensions rely on critical information such as
confidence scores and other ASR decoder features. This information, which is
not always available, highly depends on the decoding process and sometimes
tends to over estimate the real quality of the recognized words. In this paper
we propose a novel variant of ROVER that takes advantage of ASR quality
estimation (QE) for ranking the transcriptions at "segment level" instead of:
i) relying on confidence scores, or ii) feeding ROVER with randomly ordered
hypotheses. We first introduce an effective set of features to compensate for
the absence of ASR decoder information. Then, we apply QE techniques to perform
accurate hypothesis ranking at segment-level before starting the fusion
process. The evaluation is carried out on two different tasks, in which we
respectively combine hypotheses coming from independent ASR systems and
multi-microphone recordings. In both tasks, it is assumed that the ASR decoder
information is not available. The proposed approach significantly outperforms
standard ROVER and it is competitive with two strong oracles that e xploit
prior knowledge about the real quality of the hypotheses to be combined.
Compared to standard ROVER, the abs olute WER improvements in the two
evaluation scenarios range from 0.5% to 7.3%
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community
In recent years, deep learning (DL), a re-branding of neural networks (NNs),
has risen to the top in numerous areas, namely computer vision (CV), speech
recognition, natural language processing, etc. Whereas remote sensing (RS)
possesses a number of unique challenges, primarily related to sensors and
applications, inevitably RS draws from many of the same theories as CV; e.g.,
statistics, fusion, and machine learning, to name a few. This means that the RS
community should be aware of, if not at the leading edge of, of advancements
like DL. Herein, we provide the most comprehensive survey of state-of-the-art
RS DL research. We also review recent new developments in the DL field that can
be used in DL for RS. Namely, we focus on theories, tools and challenges for
the RS community. Specifically, we focus on unsolved challenges and
opportunities as it relates to (i) inadequate data sets, (ii)
human-understandable solutions for modelling physical phenomena, (iii) Big
Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and
learning algorithms for spectral, spatial and temporal data, (vi) transfer
learning, (vii) an improved theoretical understanding of DL systems, (viii)
high barriers to entry, and (ix) training and optimizing the DL.Comment: 64 pages, 411 references. To appear in Journal of Applied Remote
Sensin
Deep Spoken Keyword Spotting:An Overview
Spoken keyword spotting (KWS) deals with the identification of keywords in
audio streams and has become a fast-growing technology thanks to the paradigm
shift introduced by deep learning a few years ago. This has allowed the rapid
embedding of deep KWS in a myriad of small electronic devices with different
purposes like the activation of voice assistants. Prospects suggest a sustained
growth in terms of social use of this technology. Thus, it is not surprising
that deep KWS has become a hot research topic among speech scientists, who
constantly look for KWS performance improvement and computational complexity
reduction. This context motivates this paper, in which we conduct a literature
review into deep spoken KWS to assist practitioners and researchers who are
interested in this technology. Specifically, this overview has a comprehensive
nature by covering a thorough analysis of deep KWS systems (which includes
speech features, acoustic modeling and posterior handling), robustness methods,
applications, datasets, evaluation metrics, performance of deep KWS systems and
audio-visual KWS. The analysis performed in this paper allows us to identify a
number of directions for future research, including directions adopted from
automatic speech recognition research and directions that are unique to the
problem of spoken KWS
Adversarial Attacks and Defenses in Machine Learning-Powered Networks: A Contemporary Survey
Adversarial attacks and defenses in machine learning and deep neural network
have been gaining significant attention due to the rapidly growing applications
of deep learning in the Internet and relevant scenarios. This survey provides a
comprehensive overview of the recent advancements in the field of adversarial
attack and defense techniques, with a focus on deep neural network-based
classification models. Specifically, we conduct a comprehensive classification
of recent adversarial attack methods and state-of-the-art adversarial defense
techniques based on attack principles, and present them in visually appealing
tables and tree diagrams. This is based on a rigorous evaluation of the
existing works, including an analysis of their strengths and limitations. We
also categorize the methods into counter-attack detection and robustness
enhancement, with a specific focus on regularization-based methods for
enhancing robustness. New avenues of attack are also explored, including
search-based, decision-based, drop-based, and physical-world attacks, and a
hierarchical classification of the latest defense methods is provided,
highlighting the challenges of balancing training costs with performance,
maintaining clean accuracy, overcoming the effect of gradient masking, and
ensuring method transferability. At last, the lessons learned and open
challenges are summarized with future research opportunities recommended.Comment: 46 pages, 21 figure
Noise Robust Keyword Spotting Using Deep Neural Networks For Embedded Platforms
The recent development of embedded platforms along with spectacular growth in communication networking technologies is driving the Internet of things to thrive. More complex tasks are now possible to operate in small devices such as speech recognition and keyword spotting which are in great demand. Traditional voice recognition approaches are already being used in several embedded applications, some are hybrid(cloud-based and embedded) while others are fully embedded. However, the environment surrounding the embedded devices is usually accompanied by noise. Conventional approaches to add noise robustness to speech recognition are effective but also costly in terms of memory consumption and hardware complexities which limit their use in embedded platforms. The purpose of this thesis is to increase the robustness of keyword spotting to more than one type of noise at once without increasing the memory footprint or the need for a denoiser while maintaining the recognition accuracy to an acceptable level. In this work, robustness in treated at the phoneme classification level as the phoneme based keyword spotting is the best technique for embedded keyword spotting.
Deep Neural Networks have been successfully deployed in many applications including noise robust speech recognition. In this work, we use mutil-condition utterances training of a Deep Neural Networks model to increase the keyword spotting noise robustness. This technique is also used for a Gaussian mixture model training. The two approaches are compared and the deep learning proved to not only outperform the Gaussian approach, but has also outperformed the use of a denoiser system. This results in a smaller, more accurate and noise robust model for phoneme recognition
- …