614 research outputs found
A study of lip movements during spontaneous dialog and its application to voice activity detection
International audienceThis paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech/nonspeech contexts, with a particular focus on silences i.e., when no sound is produced by the speaker . The aim is to characterize the relationship between "lip activity" and "speech activity" and then to use visual speech information as a voice activity detector VAD . To this aim, an original audiovisual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to synchronously monitor the speaker's lip movements. A comprehensive analysis was carried out on the lip shapes and lip movements in either silence or nonsilence i.e., speech+nonspeech audible events . A single visual parameter, defined to characterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a visual VAD that can be used in any kind of environment noise, including intricate and highly nonstationary noises, e.g., multiple and/or moving noise sources or competing speech signals
Exploring pause fillers in conversational speech for forensic phonetics: findings in a Spanish cohort including twins
Pause fillers occur naturally during conversational speech, and have recently generated interest in their use for forensic applications. We extracted pause fillers from conversational speech from 54 speakers, including twins, whose voices are often perceptually similar. Overall 872 tokens of the sound [e:] were extracted (7-33 tokens per speaker), and objectively characterised using 315 acoustic measures. We used a Random Forest (RF) classifier and tested its performance using a leaveone- sample-out scheme to obtain probabilistic estimates of binary class membership denoting whether a query token belongs to a speaker. We report results using the Receiver Operating Characteristic (ROC) curve, and computing the Area Under the Curve (AUC). When the RF was presented with at least 20 tokens in the training phase for each of the two classes, we observed AUC in the range 0.71-0.98. These findings have important implications in the potential of pause fillers as an additional objective tool in forensic speaker verification
Comparing Different Methods for Disfluency Structure Detection
This paper presents a number of experiments focusing on assessing
the performance of different machine learning methods on the identification of disfluencies and their distinct structural regions over speech data. Several machine learning methods have been applied, namely Naive Bayes, Logistic Regression, Classification and Regression Trees (CARTs), J48 and Multilayer Perceptron. Our experiments show that CARTs outperform the other methods on the identification of the distinct structural disfluent regions. Reported experiments are based on audio segmentation and prosodic features, calculated from a corpus of university lectures in European Portuguese, containing about 32h of speech and about 7.7% of disfluencies. The set of features automatically extracted from the forced alignment corpus proved to be discriminant of the regions contained in the production of a disfluency. This work shows that
using fully automatic prosodic features, disfluency structural regions
can be reliably identified using CARTs, where the best results achieved correspond to 81.5% precision, 27.6% recall, and 41.2% F-measure. The best results concern the detection of the interregnum, followed by the detection of the interruption point
Comparing different machine learning approaches for disfluency structure detection in a corpus of university lectures
This paper presents a number of experiments focusing on assessing the performance of different machine learning methods on the identification of disfluencies and their distinct structural regions over speech data. Several machine learning methods have been applied, namely Naive Bayes, Logistic Regression, Classification and Regression Trees (CARTs), J48 and Multilayer Perceptron.
Our experiments show that CARTs outperform the other methods on the identification of the distinct structural disfluent regions. Reported experiments are based on audio segmentation and prosodic features, calculated from a corpus of university lectures in European Portuguese, containing about 32h of speech and about 7.7% of disfluencies. The set of features automatically extracted from the forced alignment corpus proved to be discriminant of the regions contained in the production of a disfluency. This work shows that using fully automatic prosodic features, disfluency structural regions can be reliably identified using CARTs, where the best results achieved correspond to 81.5% precision, 27.6% recall, and 41.2% F-measure. The best results concern the detection of the interregnum, followed by the detection of the interruption point.info:eu-repo/semantics/publishedVersio
Remote Inference of Cognitive Scores in ALS Patients Using a Picture Description
Amyotrophic lateral sclerosis is a fatal disease that not only affects
movement, speech, and breath but also cognition. Recent studies have focused on
the use of language analysis techniques to detect ALS and infer scales for
monitoring functional progression. In this paper, we focused on another
important aspect, cognitive impairment, which affects 35-50% of the ALS
population. In an effort to reach the ALS population, which frequently exhibits
mobility limitations, we implemented the digital version of the Edinburgh
Cognitive and Behavioral ALS Screen (ECAS) test for the first time. This test
which is designed to measure cognitive impairment was remotely performed by 56
participants from the EverythingALS Speech Study. As part of the study,
participants (ALS and non-ALS) were asked to describe weekly one picture from a
pool of many pictures with complex scenes displayed on their computer at home.
We analyze the descriptions performed within +/- 60 days from the day the ECAS
test was administered and extract different types of linguistic and acoustic
features. We input those features into linear regression models to infer 5 ECAS
sub-scores and the total score. Speech samples from the picture description are
reliable enough to predict the ECAS subs-scores, achieving statistically
significant Spearman correlation values between 0.32 and 0.51 for the model's
performance using 10-fold cross-validation.Comment: conference pape
Combining automatic speech recognition with semantic natural language processing in schizophrenia
Natural language processing (NLP) tools are increasingly used to quantify semantic anomalies in schizophrenia. Automatic speech recognition (ASR) technology, if robust enough, could significantly speed up the NLP research process. In this study, we assessed the performance of a state-of-the-art ASR tool and its impact on diagnostic classification accuracy based on a NLP model. We compared ASR to human transcripts quantitatively (Word Error Rate (WER)) and qualitatively by analyzing error type and position. Subsequently, we evaluated the impact of ASR on classification accuracy using semantic similarity measures. Two random forest classifiers were trained with similarity measures derived from automatic and manual transcriptions, and their performance was compared. The ASR tool had a mean WER of 30.4%. Pronouns and words in sentence-final position had the highest WERs. The classification accuracy was 76.7% (sensitivity 70%; specificity 86%) using automated transcriptions and 79.8% (sensitivity 75%; specificity 86%) for manual transcriptions. The difference in performance between the models was not significant. These findings demonstrate that using ASR for semantic analysis is associated with only a small decrease in accuracy in classifying schizophrenia, compared to manual transcripts. Thus, combining ASR technology with semantic NLP models qualifies as a robust and efficient method for diagnosing schizophrenia.</p
Prosody, focus, and focal structure : some remarks on methodology
Prosody falls between several established fields as e.g. phonetics, phonology, syntax, and dialogue structure. It is therefore prone to misconceptions: often, its relevancy is overestimated, and often, it is underestimated. The traditional method in linguistics in general and in phonology in particular is the construction and evaluation of sometimes rather complex examples based on the intuition of the linguist. This intuition is replaced by more or less naive and thus non-expert subjects and inferential statistics in experimental phonetics but the examples, i.e. the experimental material, are often rather complex as well. It is a truism that in both cases, conclusions are made on an "as if\u27; basis: as if a final proof had been found that the phenomenon A really exists regularily in the language B. In fact, it only can be proven that the phenomenon A sometimes can be detected in the production of some speakers of a variety of language B. This dilemma matters if prosody has to be put into practice, e.g. in automatic speech and language processing. In this field, large speech databases are already available for English and will be available for other languages as e.g. German in the near future. At least in the beginning, the problems that can - hopefully - be solved with the help of such databases might look trivial and thus not interesting - a step backwards and not forwards. "As if\u27; statements (concerning, e.g., narrow vs. broad focus) and problems that are trivial at face value (concerning, e.g., the relationship between phrasing units and accentuation and the ontology of sentence accent) will be illustrated with own material. I will argue that such trivial problems have to be dealt with in the beginning, and that they can constitute the very basis for the proper treatment of more far reaching and complex problems
- …