1,050 research outputs found
Extending AuToBI to prominence detection in European Portuguese
This paper describes our exploratory work in applying the Automatic ToBI annotation system (AuToBI), originally developed for Standard American English, to European Portuguese. This work is motivated by the current availability of large amounts of (highly spontaneous) transcribed data and the need to further enrich those transcripts with prosodic information. Manual prosodic annotation, however, is almost impractical for extensive data sets. For that reason, automatic systems such as AuToBi stand as an alternate solution. We have started by applying the AuToBI prosodic event detection system using the existing English models to the prediction of prominent prosodic events (accents) in European Portuguese. This approach achieved an overall accuracy of 74% for prominence detection, similar to state-of-the-art results for other languages. Later, we have trained new models using prepared and spontaneous Portuguese data, achieving a considerable improvement of about 6% accuracy (absolute) over the existing English models. The achieved results are quite encouraging and provide a starting point for automatically predicting prominent events in European Portuguese.info:eu-repo/semantics/publishedVersio
A Cognitive Science Reasoning in Recognition of Emotions in Audio-Visual Speech
In this report we summarize the state-of-the-art of speech emotion recognition from the signal
processing point of view. On the bases of multi-corporal experiments with machine-learning classifiers, the
observation is made that existing approaches for supervised machine learning lead to database dependent
classifiers which can not be applied for multi-language speech emotion recognition without additional training
because they discriminate the emotion classes following the used training language. As there are experimental
results showing that Humans can perform language independent categorisation, we made a parallel between
machine recognition and the cognitive process and tried to discover the sources of these divergent results. The
analysis suggests that the main difference is that the speech perception allows extraction of language
independent features although language dependent features are incorporated in all levels of the speech signal
and play as a strong discriminative function in human perception. Based on several results in related domains, we
have suggested that in addition, the cognitive process of emotion-recognition is based on categorisation, assisted
by some hierarchical structure of the emotional categories, existing in the cognitive space of all humans. We
propose a strategy for developing language independent machine emotion recognition, related to the
identification of language independent speech features and the use of additional information from visual
(expression) features
Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal Arabic
Language modeling for an inflected language
such as Arabic poses new challenges for speech recognition and
machine translation due to its rich morphology. Rich morphology
results in large increases in out-of-vocabulary (OOV) rate and
poor language model parameter estimation in the absence of large
quantities of data. In this study, we present a joint
morphological-lexical language model (JMLLM) that takes
advantage of Arabic morphology. JMLLM combines
morphological segments with the underlying lexical items and
additional available information sources with regards to
morphological segments and lexical items in a single joint model.
Joint representation and modeling of morphological and lexical
items reduces the OOV rate and provides smooth probability
estimates while keeping the predictive power of whole words.
Speech recognition and machine translation experiments in
dialectal-Arabic show improvements over word and morpheme
based trigram language models. We also show that as the
tightness of integration between different information sources
increases, both speech recognition and machine translation
performances improve
Development of a Speaker Diarization System for Speaker Tracking in Audio Broadcast News: a Case Study
A system for speaker tracking in broadcast-news audio data is presented and the impacts of the main components of the system to the overall speaker-tracking performance are evaluated. The process of speaker tracking in continuous audio streams
involves several processing tasks and is therefore treated as a multistage process. The main building blocks of such system include the components for audio segmentation, speech detection, speaker clustering and speaker identification. The aim of the first three processes is to find homogeneous regions in continuous audio streams that belong to one speaker and to join each region of the same speaker together. The task of organizing the audio data in this way is known as speaker diarization and plays an important role in various speech-processing applications.
In our case the impact of speaker diarization
was assessed in a speaker-tracking system by performing a comparative study of how each of the component influenced the overall speaker-detection results. The evaluation experiments were performed on broadcast-news audio data with a speaker-tracking system,
which was capable of detecting 41 target speakers. We implemented several different approaches in each component of the system and compared their performances by inspecting the final speaker-tracking results. The evaluation results indicate the importance of the audio-segmentation and speech-detection components, while no significant improvement of the overall results was achieved by additionally including a speaker-clustering component to the speaker-tracking system
- …