726 research outputs found
Speaker segmentation and clustering
This survey focuses on two challenging speech processing topics, namely: speaker segmentation and speaker clustering. Speaker segmentation aims at finding speaker change points in an audio stream, whereas speaker clustering aims at grouping speech segments based on speaker characteristics. Model-based, metric-based, and hybrid speaker segmentation algorithms are reviewed. Concerning speaker clustering, deterministic and probabilistic algorithms are examined. A comparative assessment of the reviewed algorithms is undertaken, the algorithm advantages and disadvantages are indicated, insight to the algorithms is offered, and deductions as well as recommendations are given. Rich transcription and movie analysis are candidate applications that benefit from combined speaker segmentation and clustering. © 2007 Elsevier B.V. All rights reserved
Unsupervised decoding of long-term, naturalistic human neural recordings with automated video and audio annotations
Fully automated decoding of human activities and intentions from direct
neural recordings is a tantalizing challenge in brain-computer interfacing.
Most ongoing efforts have focused on training decoders on specific, stereotyped
tasks in laboratory settings. Implementing brain-computer interfaces (BCIs) in
natural settings requires adaptive strategies and scalable algorithms that
require minimal supervision. Here we propose an unsupervised approach to
decoding neural states from human brain recordings acquired in a naturalistic
context. We demonstrate our approach on continuous long-term
electrocorticographic (ECoG) data recorded over many days from the brain
surface of subjects in a hospital room, with simultaneous audio and video
recordings. We first discovered clusters in high-dimensional ECoG recordings
and then annotated coherent clusters using speech and movement labels extracted
automatically from audio and video recordings. To our knowledge, this
represents the first time techniques from computer vision and speech processing
have been used for natural ECoG decoding. Our results show that our
unsupervised approach can discover distinct behaviors from ECoG data, including
moving, speaking and resting. We verify the accuracy of our approach by
comparing to manual annotations. Projecting the discovered cluster centers back
onto the brain, this technique opens the door to automated functional brain
mapping in natural settings
Computational modeling of turn-taking dynamics in spoken conversations
The study of human interaction dynamics has been at the center for multiple research disciplines in- cluding computer and social sciences, conversational analysis and psychology, for over decades. Recent interest has been shown with the aim of designing computational models to improve human-machine interaction system as well as support humans in their decision-making process. Turn-taking is one of the key aspects of conversational dynamics in dyadic conversations and is an integral part of human- human, and human-machine interaction systems. It is used for discourse organization of a conversation by means of explicit phrasing, intonation, and pausing, and it involves intricate timing. In verbal (e.g., telephone) conversation, the turn transitions are facilitated by inter- and intra- speaker silences and over- laps. In early research of turn-taking in the speech community, the studies include durational aspects of turns, cues for turn yielding intention and lastly designing turn transition modeling for spoken dia- log agents. Compared to the studies of turn transitions very few works have been done for classifying overlap discourse, especially the competitive act of overlaps and function of silences.
Given the limitations of the current state-of-the-art, this dissertation focuses on two aspects of con- versational dynamics: 1) design automated computational models for analyzing turn-taking behavior in a dyadic conversation, 2) predict the outcome of the conversations, i.e., observed user satisfaction, using turn-taking descriptors, and later these two aspects are used to design a conversational profile for each speaker using turn-taking behavior and the outcome of the conversations. The analysis, experiments, and evaluation has been done on a large dataset of Italian call-center spoken conversations where customers and agents are engaged in real problem-solving tasks.
Towards solving our research goal, the challenges include automatically segmenting and aligning speakers’ channel from the speech signal, identifying and labeling the turn-types and its functional aspects. The task becomes more challenging due to the presence of overlapping speech. To model turn- taking behavior, the intension behind these overlapping turns needed to be considered. However, among all, the most critical question is how to model observed user satisfaction in a dyadic conversation and what properties of turn-taking behavior can be used to represent and predict the outcome.
Thus, the computational models for analyzing turn-taking dynamics, in this dissertation includes au- tomatic segmenting and labeling turn types, categorization of competitive vs non-competitive overlaps, silences (e.g., lapse, pauses) and functions of turns in terms of dialog acts.
The novel contributions of the work presented here are to
1. design of a fully automated turn segmentation and labeling (e.g., agent vs customer’s turn, lapse within the speaker, and overlap) system.
2. the design of annotation guidelines for segmenting and annotating the speech overlaps with the competitive and non-competitive labels.
3. demonstrate how different channels of information such as acoustic, linguistic, and psycholin- guistic feature sets perform in the classification of competitive vs non-competitive overlaps.
4. study the role of speakers and context (i.e., agents’ and customers’ speech) for conveying the information of competitiveness for each individual feature set and their combinations.
5. investigate the function of long silences towards the information flow in a dyadic conversation.
The extracted turn-taking cues is then used to automatically predict the outcome of the conversation, which is modeled from continuous manifestations of emotion. The contributions include
1. modeling the state of the observed user satisfaction in terms of the final emotional manifestation of the customer (i.e., user).
2. analysis and modeling turn-taking properties to display how each turn type influence the user satisfaction.
3. study of how turn-taking behavior changes within each emotional state.
Based on the studies conducted in this work, it is demonstrated that turn-taking behavior, specially competitiveness of overlaps, is more than just an organizational tool in daily human interactions. It represents the beneficial information and contains the power to predict the outcome of the conversation in terms of satisfaction vs not-satisfaction. Combining the turn-taking behavior and the outcome of the conversation, the final and resultant goal is to design a conversational profile for each speaker. Such profiled information not only facilitate domain experts but also would be useful to the call center agent in real time.
These systems are fully automated and no human intervention is required. The findings are po- tentially relevant to the research of overlapping speech and automatic analysis of human-human and human-machine interactions
The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR
English is the most widely spoken language in the world, used daily by
millions of people as a first or second language in many different contexts. As
a result, there are many varieties of English. Although the great many advances
in English automatic speech recognition (ASR) over the past decades, results
are usually reported based on test datasets which fail to represent the
diversity of English as spoken today around the globe. We present the first
release of The Edinburgh International Accents of English Corpus (EdAcc). This
dataset attempts to better represent the wide diversity of English,
encompassing almost 40 hours of dyadic video call conversations between
friends. Unlike other datasets, EdAcc includes a wide range of first and
second-language varieties of English and a linguistic background profile of
each speaker. Results on latest public, and commercial models show that EdAcc
highlights shortcomings of current English ASR models. The best performing
model, trained on 680 thousand hours of transcribed data, obtains an average of
19.7% word error rate (WER) -- in contrast to the 2.7% WER obtained when
evaluated on US English clean read speech. Across all models, we observe a drop
in performance on Indian, Jamaican, and Nigerian English speakers. Recordings,
linguistic backgrounds, data statement, and evaluation scripts are released on
our website (https://groups.inf.ed.ac.uk/edacc/) under CC-BY-SA license.Comment: Accepted to IEEE ICASSP 202
- …