Search CORE

49,670 research outputs found

Bio-inspired Dynamic Formant Tracking for Phonetic Labelling

Author: Fernández-Baillo Gallego de la Sacristana Roberto
Gómez Vilda Pedro
Martínez Olalla Rafael
Mazaira Fernández Luis Miguel
Muñoz Cristina
Rodellar Biarge M. Victoria
Álvarez Marquina Agustin
Publication venue: Facultad de Informática (UPM)
Publication date: 01/01/2008
Field of study

It is a known fact that phonetic labeling may be relevant in helping current Automatic Speech Recognition (ASR) when combined with classical parsing systems as HMM's by reducing the search space. Through the present paper a method for Phonetic Broad-Class Labeling (PCL) based on speech perception in the high auditory centers is described. The methodology is based in the operation of CF (Characteristic Frequency) and FM (Frequency Modulation) neurons in the cochlear nucleus and cortical complex of the human auditory apparatus in the automatic detection of formants and formant dynamics on speech. Results obtained informant detection and dynamic formant tracking are given and the applicability of the method to Speech Processing is discussed

Archivo Digital UPM

Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

Author: Bonafonte Cávez Antonio
Pascual de la Puente Santiago
Serra Joan
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version

Multidisciplinary Digital Publishing Institute

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Exploiting correlogram structure for robust speech recognition with multiple speech sources

Author: Barker J.
Coy A.
Green P.
Ma N.
Publication venue: 'Elsevier BV'
Publication date: 01/01/2007
Field of study

This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlogram domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a `speech fragment decoder' which employs `missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy

CiteSeerX

White Rose Research Online

A COMPARISON OF EXTENDED SOURCE-FILTER MODELS FOR MUSICAL SIGNAL RECONSTRUCTION

Author: Cheng T
Dixon S
Erlangen IAL
Mauch M
Publication venue
Publication date: 01/01/2014
Field of study

China Scholarship Council (CSC)/ Queen Mary Joint PhD scholarship; Royal Academy of Engineering Research Fellowshi

Queen Mary Research Online

Chorusing, synchrony, and the evolutionary functions of rhythm

Author: Bowling D.
Fitch W.
Ravignani A.
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2014
Field of study

A central goal of biomusicology is to understand the biological basis of human musicality. One approach to this problem has been to compare core components of human musicality (relative pitch perception, entrainment, etc.) with similar capacities in other animal species. Here we extend and clarify this comparative approach with respect to rhythm. First, whereas most comparisons between human music and animal acoustic behavior have focused on spectral properties (melody and harmony), we argue for the central importance of temporal properties, and propose that this domain is ripe for further comparative research. Second, whereas most rhythm research in non-human animals has examined animal timing in isolation, we consider how chorusing dynamics can shape individual timing, as in human music and dance, arguing that group behavior is key to understanding the adaptive functions of rhythm. To illustrate the interdependence between individual and chorusing dynamics, we present a computational model of chorusing agents relating individual call timing with synchronous group behavior. Third, we distinguish and clarify mechanistic and functional explanations of rhythmic phenomena, often conflated in the literature, arguing that this distinction is key for understanding the evolution of musicality. Fourth, we expand biomusicological discussions beyond the species typically considered, providing an overview of chorusing and rhythmic behavior across a broad range of taxa (orthopterans, fireflies, frogs, birds, and primates). Finally, we propose an “Evolving Signal Timing” hypothesis, suggesting that similarities between timing abilities in biological species will be based on comparable chorusing behaviors. We conclude that the comparative study of chorusing species can provide important insights into the adaptive function(s) of rhythmic behavior in our “proto-musical” primate ancestors, and thus inform our understanding of the biology and evolution of rhythm in human music and language

PubMed Central

MPG.PuRe