12,360 research outputs found

    Investigation of Frame Alignments for GMM-based Digit-prompted Speaker Verification

    Full text link
    Frame alignments can be computed by different methods in GMM-based speaker verification. By incorporating a phonetic Gaussian mixture model (PGMM), we are able to compare the performance using alignments extracted from the deep neural networks (DNN) and the conventional hidden Markov model (HMM) in digit-prompted speaker verification. Based on the different characteristics of these two alignments, we present a novel content verification method to improve the system security without much computational overhead. Our experiments on the RSR2015 Part-3 digit-prompted task show that, the DNN based alignment performs on par with the HMM alignment. The results also demonstrate the effectiveness of the proposed Kullback-Leibler (KL) divergence based scoring to reject speech with incorrect pass-phrases.Comment: accepted by APSIPA ASC 201

    Speaker diarization of multi-party conversations using participants role information: political debates and professional meetings

    Get PDF
    Speaker Diarization aims at inferring who spoke when in an audio stream and involves two simultaneous unsupervised tasks: (1) the estimation of the number of speakers, and (2) the association of speech segments to each speaker. Most of the recent efforts in the domain have addressed the problem using machine learning techniques or statistical methods (for a review see [11]) ignoring the fact that the data consists of instances of human conversations

    Multiple analytical perspectives of the Eleme Anterior-Perfective

    Get PDF
    There is increasing recognition in typology that linguistic categories are language-specific and not universal, increasing the need for explicitness in language descriptions. In light of this development, I argue in this paper that preexisting labels and descriptions for a set of subject-marking TAM prefixes in Eleme do not adequately characterise the distribution and use of these forms, which is conditioned by the complex interaction of person and number features, Aktionsart, epistemic modality and information structure. In response to the challenges raised by these data, I argue that when multiple analytical perspectives are required to understand the function of a grammatical form, fine-grained quantitative analyses with description give a complex but useful basis on which to compare languages

    Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation

    Get PDF
    We present a probabilistic model that uses both prosodic and lexical cues for the automatic segmentation of speech into topically coherent units. We propose two methods for combining lexical and prosodic information using hidden Markov models and decision trees. Lexical information is obtained from a speech recognizer, and prosodic features are extracted automatically from speech waveforms. We evaluate our approach on the Broadcast News corpus, using the DARPA-TDT evaluation metrics. Results show that the prosodic model alone is competitive with word-based segmentation methods. Furthermore, we achieve a significant reduction in error by combining the prosodic and word-based knowledge sources.Comment: 27 pages, 8 figure

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    How affects can perturbe the automatic speech recognition of domotic interactions

    No full text
    International audienceIn Smart Home, the vocal home automation orders, for comfort purposes, or assistive devoted, have been pointed as the more relevant interaction for ambient assisted living. Even if the orders are very strictly formulated, when they are daily used (directed to the smart home, or to a robot mediator), they become often pronounced with various affects. In this paper we have evaluated how some state of the art ASR systems shut down with expressive orders, acted or spontaneous, and how the ASR training with neutral and/or acted and/or spontaneous expressive commands corpus can greatly modify the ASR performances

    Rhythmic performance in hypokinetic dysarthria : relationship between reading, spontaneous speech and diadochokinetic tasks

    Get PDF
    Purpose: This study aimed to investigate whether rhythm metrics are sensitive to change in speakers with mild hypokinetic dysarthria, whether such changes can be detected in reading and spontaneous speech, and whether diadochokinetic (DDK) performance relates to rhythmic properties of speech tasks. Method: Ten people with Parkinson’s Disease (PwPD) with mild hypokinetic dysarthria and ten healthy control speakers produced DDK repetitions, a reading passage and a spontaneous monologue. Articulation rate, as well as ten rhythm metrics were applied to the speech data. DDK performance was captured by mean, standard deviation (SD) and coefficient of variation (CoV) of syllable duration. Results: Group differences were apparent across both speech tasks, but mainly in spontaneous speech. The control speakers changed their rhythm performance between the two tasks, whereas the PwPD displayed a more constant behaviour. The correlation analysis of speech and DDK tasks resulted in few meaningful relationships. Conclusions: Rhythm metrics appeared to be sensitive to mild levels of impairment in PwPD. They are thus suitable for use as diagnostic or outcome measures. In addition, we demonstrated that conversational data can be used in the investigation of rhythm. Finally, the value of DDK tasks in predicting the rhythm performance during speech could not be demonstrated successfully

    Euclidean distances as measures of speaker similarity including identical twin pairs: a forensic investigation using source and filter voice characteristics

    Get PDF
    AbstractThere is a growing consensus that hybrid approaches are necessary for successful speaker characterization in Forensic Speaker Comparison (FSC); hence this study explores the forensic potential of voice features combining source and filter characteristics. The former relate to the action of the vocal folds while the latter reflect the geometry of the speaker’s vocal tract. This set of features have been extracted from pause fillers, which are long enough for robust feature estimation while spontaneous enough to be extracted from voice samples in real forensic casework. Speaker similarity was measured using standardized Euclidean Distances (ED) between pairs of speakers: 54 different-speaker (DS) comparisons, 54 same-speaker (SS) comparisons and 12 comparisons between monozygotic twins (MZ). Results revealed that the differences between DS and SS comparisons were significant in both high quality and telephone-filtered recordings, with no false rejections and limited false acceptances; this finding suggests that this set of voice features is highly speaker-dependent and therefore forensically useful. Mean ED for MZ pairs lies between the average ED for SS comparisons and DS comparisons, as expected according to the literature on twin voices. Specific cases of MZ speakers with very high ED (i.e. strong dissimilarity) are discussed in the context of sociophonetic and twin studies. A preliminary simplification of the Vocal Profile Analysis (VPA) Scheme is proposed, which enables the quantification of voice quality features in the perceptual assessment of speaker similarity, and allows for the calculation of perceptual–acoustic correlations. The adequacy of z-score normalization for this study is also discussed, as well as the relevance of heat maps for detecting the so-called phantoms in recent approaches to the biometric menagerie

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR
    corecore