Search CORE

12,360 research outputs found

Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification

Author: lee
martin
martin
park
stafylakis
young
zhong
Publication venue
Publication date: 02/09/2018
Field of study

Frame alignments can be computed by different methods in GMM-based speaker verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are able to compare the performance using alignments extracted from the deep neural networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted speaker verification. Based on the different characteristics of these two alignments, we present a novel content verification method to improve the system security without much computational overhead. Our experiments on the RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs on par with the HMM alignment. The results also demonstrate the effectiveness of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201

arXiv.org e-Print Archive

Crossref

Speaker diarization of multi-party conversations using participants role information: political debates and professional meetings

Author: Valente Fabio
Vinciarelli Alessandro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Speaker Diarization aims at inferring who spoke when in an audio stream and involves two simultaneous unsupervised tasks: (1) the estimation of the number of speakers, and (2) the association of speech segments to each speaker. Most of the recent efforts in the domain have addressed the problem using machine learning techniques or statistical methods (for a review see [11]) ignoring the fact that the data consists of instances of human conversations

Crossref

Enlighten

Multiple analytical perspectives of the Eleme Anterior-Perfective

Author: Bond Oliver
Publication venue: 'The Joongwon Linguistic Society of Korea'
Publication date: 01/07/2008
Field of study

There is increasing recognition in typology that linguistic categories are language-specific and not universal, increasing the need for explicitness in language descriptions. In light of this development, I argue in this paper that preexisting labels and descriptions for a set of subject-marking TAM prefixes in Eleme do not adequately characterise the distribution and use of these forms, which is conditioned by the complex interaction of person and number features, Aktionsart, epistemic modality and information structure. In response to the challenges raised by these data, I argue that when multiple analytical perspectives are required to understand the function of a grammatical form, fine-grained quantitative analyses with description give a complex but useful basis on which to compare languages

SOAS Research Online

Surrey Research Insight

Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

Author: Andreas Stolcke
Dilek Hakkani-Tür
Elizabeth Shriberg
Grosz B.
Gökhan Tür
Hearst Marti A
Passonneau Rebecca J
Publication venue
Publication date: 01/01/2000
Field of study

We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.Comment: 27 pages, 8 figure

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository

Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

Author: Andreas Stolcke
Bahl
Baum
Breiman
Brown
Bruce
Buntine
Dermatas
Dilek Hakkani-Tür
Elizabeth Shriberg
Gökhan Tür
Hearst
Katz
Palmer
Shriberg
Sluijter
Swerts
Swerts
Swerts
Thorsen
Viterbi
Publication venue
Publication date: 01/01/2000
Field of study

A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

arXiv.org e-Print Archive

CiteSeerX

Crossref

Bilkent University Institutional Repository

How affects can perturbe the automatic speech recognition of domotic interactions

Author: Aman Frédéric
Auberge Véronique
Vacher Michel
Publication venue: HAL CCSD
Publication date: 21/08/2013
Field of study

International audienceIn Smart Home, the vocal home automation orders, for comfort purposes, or assistive devoted, have been pointed as the more relevant interaction for ambient assisted living. Even if the orders are very strictly formulated, when they are daily used (directed to the smart home, or to a robot mediator), they become often pronounced with various affects. In this paper we have evaluated how some state of the art ASR systems shut down with expressive orders, acted or spontaneous, and how the ASR training with neutral and/or acted and/or spontaneous expressive commands corpus can greatly modify the ASR performances

Hal - Université Grenoble Alpes

Rhythmic performance in hypokinetic dysarthria : relationship between reading, spontaneous speech and diadochokinetic tasks

Author: Corson Stephen
Kuschmann Anja
Lowit Anja
Marchetti Agata
Publication venue: 'Elsevier BV'
Publication date: 01/03/2018
Field of study

Purpose: This study aimed to investigate whether rhythm metrics are sensitive to change in speakers with mild hypokinetic dysarthria, whether such changes can be detected in reading and spontaneous speech, and whether diadochokinetic (DDK) performance relates to rhythmic properties of speech tasks. Method: Ten people with Parkinson’s Disease (PwPD) with mild hypokinetic dysarthria and ten healthy control speakers produced DDK repetitions, a reading passage and a spontaneous monologue. Articulation rate, as well as ten rhythm metrics were applied to the speech data. DDK performance was captured by mean, standard deviation (SD) and coefficient of variation (CoV) of syllable duration. Results: Group differences were apparent across both speech tasks, but mainly in spontaneous speech. The control speakers changed their rhythm performance between the two tasks, whereas the PwPD displayed a more constant behaviour. The correlation analysis of speech and DDK tasks resulted in few meaningful relationships. Conclusions: Rhythm metrics appeared to be sensitive to mild levels of impairment in PwPD. They are thus suitable for use as diagnostic or outcome measures. In addition, we demonstrated that conversational data can be used in the investigation of rhythm. Finally, the value of DDK tasks in predicting the rhythm performance during speech could not be demonstrated successfully

Crossref

University of Strathclyde Institutional Repository

Euclidean distances as measures of speaker similarity including identical twin pairs: a forensic investigation using source and filter voice characteristics

Author: Ariyaeeinia
Athanasios Tsanas
Authi-Corp
Beck
Beck
Boersma
Davis
Doddington
Draisma
Eugenia San Segundo
Fant
French
Gil
Gold
González-Rodríguez
Gómez-Vilda
Hammarberg
Hicklin
Hirano
Honikman
Hualde
Hughes
Hughes
Ireland
Jain
Kempster
Kent
Kinnunen
Kong
Kreiman
Künzel
Künzel
Labov
Laver
Laver
Laver
Loakes
Meuwly
Morrison
Nolan
Nolan
Nolan
Nolan
Pedro Gómez-Vilda
Plumpe
Rose
San Segundo
San Segundo
San Segundo
San Segundo
San Segundo
San Segundo
San Segundo
San Segundo
Sun
Teli
Titze
Tsanas
Tsanas
Tsanas
Tsanas
Tsanas
Tschäpe
Tsnas
Van Gysel
Whiteside
Whiteside
Wolf
Yager
Yager
Publication venue: 'Elsevier BV'
Publication date: 17/11/2016
Field of study

AbstractThere is a growing consensus that hybrid approaches are necessary for successful speaker characterization in Forensic Speaker Comparison (FSC); hence this study explores the forensic potential of voice features combining source and filter characteristics. The former relate to the action of the vocal folds while the latter reflect the geometry of the speaker’s vocal tract. This set of features have been extracted from pause fillers, which are long enough for robust feature estimation while spontaneous enough to be extracted from voice samples in real forensic casework. Speaker similarity was measured using standardized Euclidean Distances (ED) between pairs of speakers: 54 different-speaker (DS) comparisons, 54 same-speaker (SS) comparisons and 12 comparisons between monozygotic twins (MZ). Results revealed that the differences between DS and SS comparisons were significant in both high quality and telephone-filtered recordings, with no false rejections and limited false acceptances; this finding suggests that this set of voice features is highly speaker-dependent and therefore forensically useful. Mean ED for MZ pairs lies between the average ED for SS comparisons and DS comparisons, as expected according to the literature on twin voices. Specific cases of MZ speakers with very high ED (i.e. strong dissimilarity) are discussed in the context of sociophonetic and twin studies. A preliminary simplification of the Vocal Profile Analysis (VPA) Scheme is proposed, which enables the quantification of voice quality features in the perceptual assessment of speaker similarity, and allows for the calculation of perceptual–acoustic correlations. The adequacy of z-score normalization for this study is also discussed, as well as the relevance of heat maps for detecting the so-called phantoms in recent approaches to the biometric menagerie

Elsevier - Publisher Connector

Crossref

Edinburgh Research Explorer

White Rose Research Online

Spoken content retrieval: A survey of techniques and technologies

Author: Ani Nenkova
C A. Nenkova
K. Mckeown
Kathleen Mckeown
Publication venue: 'Now Publishers'
Publication date: 01/01/2012
Field of study

Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

CiteSeerX

Crossref

Irish Universities

DCU Online Research Access Service