Search CORE

8,023 research outputs found

BaNa: a noise resilient fundamental frequency detection algorithm for speech and music

Author: Ba He
Cai Weiyang
Heinzelman Wendi
Seyfettin Demirkol Ilker
Yang Na
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2014
Field of study

Fundamental frequency (F0) is one of the essential features in many acoustic related applications. Although numerous F0 detection algorithms have been developed, the detection accuracy in noisy environments still needs improvement. We present a hybrid noise resilient F0 detection algorithm named BaNa that combines the approaches of harmonic ratios and Cepstrum analysis. A Viterbi algorithm with a cost function is used to identify the F0 value among several F0 candidates. Speech and music databases with eight different types of additive noise are used to evaluate the performance of the BaNa algorithm and several classic and state-of-the-art F0 detection algorithms. Results show that for almost all types of noise and signal-to-noise ratio (SNR) values investigated, BaNa achieves the lowest Gross Pitch Error (GPE) rate among all the algorithms. Moreover, for the 0 dB SNR scenarios, the BaNa algorithm is shown to achieve 20% to 35% GPE rate for speech and 12% to 39% GPE rate for music. We also describe implementation issues that must be addressed to run the BaNa algorithm as a real-time application on a smartphone platform.Peer ReviewedPostprint (author's final draft

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

New hos-based parameter estimation methods for speech recognition in noisy environments

Author: Moreno Bilbao M. Asunción
Rodríguez Fonollosa José Adrián
Tortola S
Vidal Manzano José
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/1995
Field of study

The problem of recognition in noisy environments is addressed. Often, a recognition system is used in a noisy environment and there is no possibility of training it with noisy samples. Classical speech analysis techniques are based on second-order statistics and their performance dramatically decreases when noise is present in the signal under analysis. New methods based on higher order statistics (HOS) are applied in a recognition system and compared against the autocorrelation method. Cumulant-based methods show better performance than autocorrelation-based methods for low SNRPeer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Automatic Estimation of Intelligibility Measure for Consonants in Speech

Author: Abavisani Ali
Hasegawa-Johnson Mark
Publication venue: 'International Speech Communication Association'
Publication date: 28/06/2020
Field of study

In this article, we provide a model to estimate a real-valued measure of the intelligibility of individual speech segments. We trained regression models based on Convolutional Neural Networks (CNN) for stop consonants \textipa{/p,t,k,b,d,g/} associated with vowel \textipa{/A/}, to estimate the corresponding Signal to Noise Ratio (SNR) at which the Consonant-Vowel (CV) sound becomes intelligible for Normal Hearing (NH) ears. The intelligibility measure for each sound is called SNR

_{90}

, and is defined to be the SNR level at which human participants are able to recognize the consonant at least 90\% correctly, on average, as determined in prior experiments with NH subjects. Performance of the CNN is compared to a baseline prediction based on automatic speech recognition (ASR), specifically, a constant offset subtracted from the SNR at which the ASR becomes capable of correctly labeling the consonant. Compared to baseline, our models were able to accurately estimate the SNR

_{90}

~intelligibility measure with less than 2 [dB

^2

] Mean Squared Error (MSE) on average, while the baseline ASR-defined measure computes SNR

_{90}

~with a variance of 5.2 to 26.6 [dB

^2

], depending on the consonant.Comment: 5 pages, 1 figure, 7 tables, submitted to Inter Speech 2020 Conferenc

arXiv.org e-Print Archive

Crossref

Exploiting Nonlinear Recurrence and Fractal Scaling Properties for Voice Disorder Detection

Author: Declan Ae Costello
Declan Ae Costello
Irene M
Irene M Moroz
Max A Little
Patrick E Mcsharry
Stephen J Roberts
Publication venue
Publication date: 01/01/2007
Field of study

Background: Voice disorders affect patients profoundly, and acoustic tools can potentially measure voice function objectively. Disordered sustained vowels exhibit wide-ranging phenomena, from nearly periodic to highly complex, aperiodic vibrations, and increased "breathiness". Modelling and surrogate data studies have shown significant nonlinear and non-Gaussian random properties in these sounds. Nonetheless, existing tools are limited to analysing voices displaying near periodicity, and do not account for this inherent biophysical nonlinearity and non-Gaussian randomness, often using linear signal processing methods insensitive to these properties. They do not directly measure the two main biophysical symptoms of disorder: complex nonlinear aperiodicity, and turbulent, aeroacoustic, non-Gaussian randomness. Often these tools cannot be applied to more severe disordered voices, limiting their clinical usefulness.

Methods: This paper introduces two new tools to speech analysis: recurrence and fractal scaling, which overcome the range limitations of existing tools by addressing directly these two symptoms of disorder, together reproducing a "hoarseness" diagram. A simple bootstrapped classifier then uses these two features to distinguish normal from disordered voices.

Results: On a large database of subjects with a wide variety of voice disorders, these new techniques can distinguish normal from disordered cases, using quadratic discriminant analysis, to overall correct classification performance of 91.8% plus or minus 2.0%. The true positive classification performance is 95.4% plus or minus 3.2%, and the true negative performance is 91.5% plus or minus 2.3% (95% confidence). This is shown to outperform all combinations of the most popular classical tools.

Conclusions: Given the very large number of arbitrary parameters and computational complexity of existing techniques, these new techniques are far simpler and yet achieve clinically useful classification performance using only a basic classification technique. They do so by exploiting the inherent nonlinearity and turbulent randomness in disordered voice signals. They are widely applicable to the whole range of disordered voice phenomena by design. These new measures could therefore be used for a variety of practical clinical purposes.&#xa

arXiv.org e-Print Archive

CiteSeerX

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Oxford University Research Archive

Nature Precedings

Improvement of speech recognition by nonlinear noise reduction

Author: Holger Kantz
Krzysztof Urbanowicz
Press W. H.
Urbanowicz K.
Urbanowicz K.
Publication venue: 'AIP Publishing'
Publication date: 01/01/2007
Field of study

The success of nonlinear noise reduction applied to a single channel recording of human voice is measured in terms of the recognition rate of a commercial speech recognition program in comparison to the optimal linear filter. The overall performance of the nonlinear method is shown to be superior. We hence demonstrate that an algorithm which has its roots in the theory of nonlinear deterministic dynamics possesses a large potential in a realistic application.Comment: see urbanowicz.org.p

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Fundamental frequency height as a resource for the management of overlap in talk-in-interaction.

Author: Brown G. J.
Kurtic E.
Wells B.
Publication venue: Emerald Group Publishing Limited
Publication date: 01/01/2009
Field of study

Overlapping talk is common in talk-in-interaction. Much of the previous research on this topic agrees that speaker overlaps can be either turn competitive or noncompetitive. An investigation of the differences in prosodic design between these two classes of overlaps can offer insight into how speakers use and orient to prosody as a resource for turn competition. In this paper, we investigate the role of fundamental frequency (F0) as a resource for turn competition in overlapping speech. Our methodological approach combines detailed conversation analysis of overlap instances with acoustic measurements of F0 in the overlapping sequence and in its local context. The analyses are based on a collection of overlap instances drawn from the ICSI Meeting corpus. We found that overlappers mark an overlapping incoming as competitive by raising F0 above their norm for turn beginnings, and retaining this higher F0 until the point of overlap resolution. Overlappees may respond to these competitive incomings by returning competition, in which case they raise their F0 too. Our results thus provide instrumental support for earlier claims made on impressionistic evidence, namely that participants in talk-in-interaction systematically manipulate F0 height when competing for the turn

White Rose Research Online

Pre-processing of Speech Signals for Robust Parameter Estimation

Author: Esquivel Jaramillo Alfredo
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2021
Field of study

VBN

Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

Author: Andreas Stolcke
Bahl
Baum
Breiman
Brown
Bruce
Buntine
Dermatas
Dilek Hakkani-Tür
Elizabeth Shriberg
Gökhan Tür
Hearst
Katz
Palmer
Shriberg
Sluijter
Swerts
Swerts
Swerts
Thorsen
Viterbi
Publication venue
Publication date: 01/01/2000
Field of study

A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository