Search CORE

43,541 research outputs found

Robust ASR using Support Vector Machines

Author: A. Gallardo-Antolín
Allwein
Bengio
Bourlard
Burges
C. Peláez-Moreno
Clarkson
Crammer
D. Martín-Iglesias
F. Díaz-de-María
Fürnkranz
Ganapathiraju
Glass
Hsu
Jiang
Joachims
Navia-Vázquez
R. Solera-Ureña
Rabiner
Schölkopf
Shimodaira
Thubthong
Trentin
Vapnik
Vapnik
Vicente-Peña
Weiss
Wu
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

The improved theoretical properties of Support Vector Machines with respect to other machine learning alternatives due to their max-margin training paradigm have led us to suggest them as a good technique for robust speech recognition. However, important shortcomings have had to be circumvented, the most important being the normalisation of the time duration of different realisations of the acoustic speech units. In this paper, we have compared two approaches in noisy environments: first, a hybrid HMM–SVM solution where a fixed number of frames is selected by means of an HMM segmentation and second, a normalisation kernel called Dynamic Time Alignment Kernel (DTAK) first introduced in Shimodaira et al. [Shimodaira, H., Noma, K., Nakai, M., Sagayama, S., 2001. Support vector machine with dynamic time-alignment kernel for speech recognition. In: Proc. Eurospeech, Aalborg, Denmark, pp. 1841–1844] and based on DTW (Dynamic Time Warping). Special attention has been paid to the adaptation of both alternatives to noisy environments, comparing two types of parameterisations and performing suitable feature normalisation operations. The results show that the DTA Kernel provides important advantages over the baseline HMM system in medium to bad noise conditions, also outperforming the results of the hybrid system.Publicad

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Universidad Carlos III de Madrid e-Archivo

Deep Learning for Audio Signal Processing

Author: Chang Shuo-yiin
Li Bo
Purwins Hendrik
Sainath Tara
Schlüter Jan
Virtanen Tuomas
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/05/2019
Field of study

Given the recent surge in developments of deep learning, this article provides a review of the state-of-the-art deep learning techniques for audio signal processing. Speech, music, and environmental sound processing are considered side-by-side, in order to point out similarities and differences between the domains, highlighting general methods, problems, key references, and potential for cross-fertilization between areas. The dominant feature representations (in particular, log-mel spectra and raw waveform) and deep learning models are reviewed, including convolutional neural networks, variants of the long short-term memory architecture, as well as more audio-specific neural network models. Subsequently, prominent deep learning application areas are covered, i.e. audio recognition (automatic speech recognition, music information retrieval, environmental sound detection, localization and tracking) and synthesis and transformation (source separation, audio enhancement, generative models for speech, sound, and music synthesis). Finally, key issues and future questions regarding deep learning applied to audio signal processing are identified.Comment: 15 pages, 2 pdf figure

arXiv.org e-Print Archive

VBN

SVMs for Automatic Speech Recognition: a Survey

Author: Díaz de María Fernando
Gallardo Antolín Ascensión
Martín Iglesias D.
Padrell Sendra J.
Peláez Moreno Carmen
Solera Ureña R.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2007
Field of study

Hidden Markov Models (HMMs) are, undoubtedly, the most employed core technique for Automatic Speech Recognition (ASR). Nevertheless, we are still far from achieving high-performance ASR systems. Some alternative approaches, most of them based on Artificial Neural Networks (ANNs), were proposed during the late eighties and early nineties. Some of them tackled the ASR problem using predictive ANNs, while others proposed hybrid HMM/ANN systems. However, despite some achievements, nowadays, the preponderance of Markov Models is a fact. During the last decade, however, a new tool appeared in the field of machine learning that has proved to be able to cope with hard classification problems in several fields of application: the Support Vector Machines (SVMs). The SVMs are effective discriminative classifiers with several outstanding characteristics, namely: their solution is that with maximum margin; they are capable to deal with samples of a very higher dimensionality; and their convergence to the minimum of the associated cost function is guaranteed. These characteristics have made SVMs very popular and successful. In this chapter we discuss their strengths and weakness in the ASR context and make a review of the current state-of-the-art techniques. We organize the contributions in two parts: isolated-word recognition and continuous speech recognition. Within the first part we review several techniques to produce the fixed-dimension vectors needed for original SVMs. Afterwards we explore more sophisticated techniques based on the use of kernels capable to deal with sequences of different length. Among them is the DTAK kernel, simple and effective, which rescues an old technique of speech recognition: Dynamic Time Warping (DTW). Within the second part, we describe some recent approaches to tackle more complex tasks like connected digit recognition or continuous speech recognition using SVMs. Finally we draw some conclusions and outline several ongoing lines of research

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Universidad Carlos III de Madrid e-Archivo

Are words easier to learn from infant- than adult-directed speech? A quantitative corpus-based investigation

Author: Cristia Alejandrina
Dupoux Emmanuel
Guevara-Rukoz Adriana
Ludusan Bogdan
Martin Andrew
Mazuka Reiko
Thiollière Roland
Publication venue
Publication date: 23/12/2017
Field of study

We investigate whether infant-directed speech (IDS) could facilitate word form learning when compared to adult-directed speech (ADS). To study this, we examine the distribution of word forms at two levels, acoustic and phonological, using a large database of spontaneous speech in Japanese. At the acoustic level we show that, as has been documented before for phonemes, the realizations of words are more variable and less discriminable in IDS than in ADS. At the phonological level, we find an effect in the opposite direction: the IDS lexicon contains more distinctive words (such as onomatopoeias) than the ADS counterpart. Combining the acoustic and phonological metrics together in a global discriminability score reveals that the bigger separation of lexical categories in the phonological space does not compensate for the opposite effect observed at the acoustic level. As a result, IDS word forms are still globally less discriminable than ADS word forms, even though the effect is numerically small. We discuss the implication of these findings for the view that the functional role of IDS is to improve language learnability.Comment: Draf

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Study to determine potential flight applications and human factors design guidelines for voice recognition and synthesis systems

Author: Parks D. L.
White R. W.
Publication venue
Publication date
Field of study

A study was conducted to determine potential commercial aircraft flight deck applications and implementation guidelines for voice recognition and synthesis. At first, a survey of voice recognition and synthesis technology was undertaken to develop a working knowledge base. Then, numerous potential aircraft and simulator flight deck voice applications were identified and each proposed application was rated on a number of criteria in order to achieve an overall payoff rating. The potential voice recognition applications fell into five general categories: programming, interrogation, data entry, switch and mode selection, and continuous/time-critical action control. The ratings of the first three categories showed the most promise of being beneficial to flight deck operations. Possible applications of voice synthesis systems were categorized as automatic or pilot selectable and many were rated as being potentially beneficial. In addition, voice system implementation guidelines and pertinent performance criteria are proposed. Finally, the findings of this study are compared with those made in a recent NASA study of a 1995 transport concept

NASA Technical Reports Server

Production and perception of speaker-specific phonetic detail at word boundaries

Author: Allen
Allen
Baayen
Bradlow
Bybee
Charles-Luce
Cho
Church
Coleman
Cooper
Cruttenden
Dahan
Davis
Davis
Eisner
Fougeron
Fougeron
Goldinger
Goldinger
Goldinger
Goldinger
Gow
Grossberg
Gårding
Hawkins
Hawkins
Hawkins
Hay
Heinrich
Hervais-Adelman
Hoard
Holm
Johnson
Johnson
Jones
Jurafsky
Kemps
Kemps
Klatt
Krakow
Kraljic
Kucera
Lachs
Lehiste
Lehiste
Local
Luce
Markham
Mattys
McLennan
McQueen
Miller
Newman
Nielsen
Norris
Norris
Nygaard
Nygaard
Nygaard
O'Connor
Ogden
Ogden
Ohala
Palmeri
Pickett
Pierrehumbert
Pierrehumbert
Pisoni
Quené
Quené
Rachel Smith
Ranbom
Rice
Rietveld
Roy
Saffran
Saltzman
Salverda
Sarah Hawkins
Shockley
Sidaras
Simko
Sommers
Sprague
Stevens
Stuart-Smith
Sumner
Traunmüller
Turk
Turk
Umeda
van Santen
Walsh
Wyld
Publication venue: 'Elsevier BV'
Publication date: 01/03/2012
Field of study

Experiments show that learning about familiar voices affects speech processing in many tasks. However, most studies focus on isolated phonemes or words and do not explore which phonetic properties are learned about or retained in memory. This work investigated inter-speaker phonetic variation involving word boundaries, and its perceptual consequences. A production experiment found significant variation in the extent to which speakers used a number of acoustic properties to distinguish junctural minimal pairs e.g. 'So he diced them'—'So he'd iced them'. A perception experiment then tested intelligibility in noise of the junctural minimal pairs before and after familiarisation with a particular voice. Subjects who heard the same voice during testing as during the familiarisation period showed significantly more improvement in identification of words and syllable constituents around word boundaries than those who heard different voices. These data support the view that perceptual learning about the particular pronunciations associated with individual speakers helps listeners to identify syllabic structure and the location of word boundaries

Crossref

Enlighten