6,279 research outputs found
Computational modeling of turn-taking dynamics in spoken conversations
The study of human interaction dynamics has been at the center for multiple research disciplines in- cluding computer and social sciences, conversational analysis and psychology, for over decades. Recent interest has been shown with the aim of designing computational models to improve human-machine interaction system as well as support humans in their decision-making process. Turn-taking is one of the key aspects of conversational dynamics in dyadic conversations and is an integral part of human- human, and human-machine interaction systems. It is used for discourse organization of a conversation by means of explicit phrasing, intonation, and pausing, and it involves intricate timing. In verbal (e.g., telephone) conversation, the turn transitions are facilitated by inter- and intra- speaker silences and over- laps. In early research of turn-taking in the speech community, the studies include durational aspects of turns, cues for turn yielding intention and lastly designing turn transition modeling for spoken dia- log agents. Compared to the studies of turn transitions very few works have been done for classifying overlap discourse, especially the competitive act of overlaps and function of silences.
Given the limitations of the current state-of-the-art, this dissertation focuses on two aspects of con- versational dynamics: 1) design automated computational models for analyzing turn-taking behavior in a dyadic conversation, 2) predict the outcome of the conversations, i.e., observed user satisfaction, using turn-taking descriptors, and later these two aspects are used to design a conversational profile for each speaker using turn-taking behavior and the outcome of the conversations. The analysis, experiments, and evaluation has been done on a large dataset of Italian call-center spoken conversations where customers and agents are engaged in real problem-solving tasks.
Towards solving our research goal, the challenges include automatically segmenting and aligning speakers’ channel from the speech signal, identifying and labeling the turn-types and its functional aspects. The task becomes more challenging due to the presence of overlapping speech. To model turn- taking behavior, the intension behind these overlapping turns needed to be considered. However, among all, the most critical question is how to model observed user satisfaction in a dyadic conversation and what properties of turn-taking behavior can be used to represent and predict the outcome.
Thus, the computational models for analyzing turn-taking dynamics, in this dissertation includes au- tomatic segmenting and labeling turn types, categorization of competitive vs non-competitive overlaps, silences (e.g., lapse, pauses) and functions of turns in terms of dialog acts.
The novel contributions of the work presented here are to
1. design of a fully automated turn segmentation and labeling (e.g., agent vs customer’s turn, lapse within the speaker, and overlap) system.
2. the design of annotation guidelines for segmenting and annotating the speech overlaps with the competitive and non-competitive labels.
3. demonstrate how different channels of information such as acoustic, linguistic, and psycholin- guistic feature sets perform in the classification of competitive vs non-competitive overlaps.
4. study the role of speakers and context (i.e., agents’ and customers’ speech) for conveying the information of competitiveness for each individual feature set and their combinations.
5. investigate the function of long silences towards the information flow in a dyadic conversation.
The extracted turn-taking cues is then used to automatically predict the outcome of the conversation, which is modeled from continuous manifestations of emotion. The contributions include
1. modeling the state of the observed user satisfaction in terms of the final emotional manifestation of the customer (i.e., user).
2. analysis and modeling turn-taking properties to display how each turn type influence the user satisfaction.
3. study of how turn-taking behavior changes within each emotional state.
Based on the studies conducted in this work, it is demonstrated that turn-taking behavior, specially competitiveness of overlaps, is more than just an organizational tool in daily human interactions. It represents the beneficial information and contains the power to predict the outcome of the conversation in terms of satisfaction vs not-satisfaction. Combining the turn-taking behavior and the outcome of the conversation, the final and resultant goal is to design a conversational profile for each speaker. Such profiled information not only facilitate domain experts but also would be useful to the call center agent in real time.
These systems are fully automated and no human intervention is required. The findings are po- tentially relevant to the research of overlapping speech and automatic analysis of human-human and human-machine interactions
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer
Deep neural network-based systems have significantly improved the performance
of speaker diarization tasks. However, end-to-end neural diarization (EEND)
systems often struggle to generalize to scenarios with an unseen number of
speakers, while target speaker voice activity detection (TS-VAD) systems tend
to be overly complex. In this paper, we propose a simple attention-based
encoder-decoder network for end-to-end neural diarization (AED-EEND). In our
training process, we introduce a teacher-forcing strategy to address the
speaker permutation problem, leading to faster model convergence. For
evaluation, we propose an iterative decoding method that outputs diarization
results for each speaker sequentially. Additionally, we propose an Enhancer
module to enhance the frame-level speaker embeddings, enabling the model to
handle scenarios with an unseen number of speakers. We also explore replacing
the transformer encoder with a Conformer architecture, which better models
local information. Furthermore, we discovered that commonly used simulation
datasets for speaker diarization have a much higher overlap ratio compared to
real data. We found that using simulated training data that is more consistent
with real data can achieve an improvement in consistency. Extensive
experimental validation demonstrates the effectiveness of our proposed
methodologies. Our best system achieved a new state-of-the-art diarization
error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%),
and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection
(VAD) is used. Beyond speaker diarization, our AED-EEND system also shows
remarkable competitiveness as a speech type detection model.Comment: IEEE/ACM Transactions on Audio Speech and Language Processing Under
Revie
Transitions Without Cure: A Journey Through The Neutral Zone
This capstone recounts a story of two merging paths. The first path is a story of navigating a life-changing medical diagnosis, self-realization, and managing transitions. The second path is a story of personal development via experiences and key learnings from the University of Pennsylvania Master of Science Organizational Dynamics program. The two paths are purposefully intertwined in a story format to reveal the zigs and zags of personal growth, as well as the literature and teachings that shaped me while navigating a new normal. This capstone focuses on topics of introspection, managing transitions, dealing with ambiguity, and organizational leadership. I hope this paper serves as an inspiration for those undergoing a personal transformation or navigating ambiguity as the result of a life-changing event. The paper shows how organizational change and leadership frameworks have tremendous value when used as reflection tools at an individual level
Automated Optimization Deep Learning Model for Assessment and Guidance System Through Natural Language Processing with Reduction of Anxiety Among Students
The Assisted Assessment and Guidance System serves as a valuable tool in supporting individuals' learning, growth, and development. The Assisted Assessment and Guidance System with Natural Language Processing (NLP) is an innovative software application designed to provide personalized and intelligent support for assessment and guidance processes in various domains. NLP techniques are employed to analyze and understand human language, allowing the system to extract valuable insights from text-based data and provide tailored feedback and guidance. This paper proposed an Integrated Optimization Directional Clustering Classification (IODCc) for assessment of the foreign language anxiety. Additionally, the paper introduces an Integrated Optimization Directional Clustering Classification (IODCc) approach for assessing foreign language anxiety. This approach incorporates two optimization models, namely Black Widow Optimization (BWO) and Seahorse Optimization (SHO). BWO and SHO are metaheuristic optimization algorithms that simulate the behaviors of black widow spiders and seahorses, respectively, to improve the accuracy of the assessment process. The integration of these optimization models within the IODCc approach aims to enhance the accuracy and effectiveness of the foreign language anxiety assessment. Simulation analysis is performed for the data collected from the 1000 foreign language students. The experimental analysis expressed that the proposed IODCc model achieves an accuracy of 99% for the classification. The findings suggested that through pre-training of languages, the anxiety of the students will be reduced
Recommended from our members
Freedom from control
This thesis examines my directing and leaderships tactics during my time as an MFA in Directing candidate at the University of Texas at Austin and while leading three productions: The Bigot (William Glick), Dry Land (Ruby Rae Spiegel) and ENRON (Lucy Prebble). I will chart my evolving leadership practices and discuss how tactics that fit the masculine/feminine, powerful/powerless and control/freedom binaries are from a dated and male-coded hierarchical directing practice. This document envisions a new way of directing that moves from power to authentic authority and from being in control to being in charge.Theatre and Danc
A Path to Alignment: Connecting K-12 and Higher Education via the Common Core and the Degree Qualifications Profile
The Common Core State Standards (CCSS), which aim to assure competency in English/language arts and mathematics through the K-12 curriculum, define necessary but not sufficient preparedness for success in college. The Degree Qualifications Profile (DQP), which describes what a college degree should signify, regardless of major, offers useful but not sufficient guidance to high school students preparing for college study. A coordinated strategy to prepare students to succeed in college would align these two undertakings and thus bridge an unfortunate and harmful cultural chasm between the K-12 world and that of higher education. Chasms call for bridges, and the bridge proposed by this white paper could create a vital thoroughfare. The white paper begins with a description of the CCSS and an assessment of their significance. A following analysis then explains why the CCSS, while necessary, are not sufficient as a platform for college success. A corresponding explanation of the DQP clarifies the prompts that led to its development, describes its structure, and offers some guidance for interpreting the outcomes that it defines. Again, a following analysis considers the potential of the DQP and the limitations that must be addressed if that potential is to be more fully realized. The heart of the white paper lies in sections 5 and 6, which provide a crosswalk between the CCSS and the DQP. These sections show how alignments and differences between the two may point to a comprehensive preparedness strategy. They also offer a proposal for a multifaceted strategy to realize the potential synergy of the CCSS and the DQP for the benefit of high school and college educators and their students -- and the nation
- …