6,279 research outputs found

    Computational modeling of turn-taking dynamics in spoken conversations

    Get PDF
    The study of human interaction dynamics has been at the center for multiple research disciplines in- cluding computer and social sciences, conversational analysis and psychology, for over decades. Recent interest has been shown with the aim of designing computational models to improve human-machine interaction system as well as support humans in their decision-making process. Turn-taking is one of the key aspects of conversational dynamics in dyadic conversations and is an integral part of human- human, and human-machine interaction systems. It is used for discourse organization of a conversation by means of explicit phrasing, intonation, and pausing, and it involves intricate timing. In verbal (e.g., telephone) conversation, the turn transitions are facilitated by inter- and intra- speaker silences and over- laps. In early research of turn-taking in the speech community, the studies include durational aspects of turns, cues for turn yielding intention and lastly designing turn transition modeling for spoken dia- log agents. Compared to the studies of turn transitions very few works have been done for classifying overlap discourse, especially the competitive act of overlaps and function of silences. Given the limitations of the current state-of-the-art, this dissertation focuses on two aspects of con- versational dynamics: 1) design automated computational models for analyzing turn-taking behavior in a dyadic conversation, 2) predict the outcome of the conversations, i.e., observed user satisfaction, using turn-taking descriptors, and later these two aspects are used to design a conversational profile for each speaker using turn-taking behavior and the outcome of the conversations. The analysis, experiments, and evaluation has been done on a large dataset of Italian call-center spoken conversations where customers and agents are engaged in real problem-solving tasks. Towards solving our research goal, the challenges include automatically segmenting and aligning speakers’ channel from the speech signal, identifying and labeling the turn-types and its functional aspects. The task becomes more challenging due to the presence of overlapping speech. To model turn- taking behavior, the intension behind these overlapping turns needed to be considered. However, among all, the most critical question is how to model observed user satisfaction in a dyadic conversation and what properties of turn-taking behavior can be used to represent and predict the outcome. Thus, the computational models for analyzing turn-taking dynamics, in this dissertation includes au- tomatic segmenting and labeling turn types, categorization of competitive vs non-competitive overlaps, silences (e.g., lapse, pauses) and functions of turns in terms of dialog acts. The novel contributions of the work presented here are to 1. design of a fully automated turn segmentation and labeling (e.g., agent vs customer’s turn, lapse within the speaker, and overlap) system. 2. the design of annotation guidelines for segmenting and annotating the speech overlaps with the competitive and non-competitive labels. 3. demonstrate how different channels of information such as acoustic, linguistic, and psycholin- guistic feature sets perform in the classification of competitive vs non-competitive overlaps. 4. study the role of speakers and context (i.e., agents’ and customers’ speech) for conveying the information of competitiveness for each individual feature set and their combinations. 5. investigate the function of long silences towards the information flow in a dyadic conversation. The extracted turn-taking cues is then used to automatically predict the outcome of the conversation, which is modeled from continuous manifestations of emotion. The contributions include 1. modeling the state of the observed user satisfaction in terms of the final emotional manifestation of the customer (i.e., user). 2. analysis and modeling turn-taking properties to display how each turn type influence the user satisfaction. 3. study of how turn-taking behavior changes within each emotional state. Based on the studies conducted in this work, it is demonstrated that turn-taking behavior, specially competitiveness of overlaps, is more than just an organizational tool in daily human interactions. It represents the beneficial information and contains the power to predict the outcome of the conversation in terms of satisfaction vs not-satisfaction. Combining the turn-taking behavior and the outcome of the conversation, the final and resultant goal is to design a conversational profile for each speaker. Such profiled information not only facilitate domain experts but also would be useful to the call center agent in real time. These systems are fully automated and no human intervention is required. The findings are po- tentially relevant to the research of overlapping speech and automatic analysis of human-human and human-machine interactions

    A Review of Deep Learning Techniques for Speech Processing

    Full text link
    The field of speech processing has undergone a transformative shift with the advent of deep learning. The use of multiple processing layers has enabled the creation of models capable of extracting intricate features from speech data. This development has paved the way for unparalleled advancements in speech recognition, text-to-speech synthesis, automatic speech recognition, and emotion recognition, propelling the performance of these tasks to unprecedented heights. The power of deep learning techniques has opened up new avenues for research and innovation in the field of speech processing, with far-reaching implications for a range of industries and applications. This review paper provides a comprehensive overview of the key deep learning models and their applications in speech-processing tasks. We begin by tracing the evolution of speech processing research, from early approaches, such as MFCC and HMM, to more recent advances in deep learning architectures, such as CNNs, RNNs, transformers, conformers, and diffusion models. We categorize the approaches and compare their strengths and weaknesses for solving speech-processing tasks. Furthermore, we extensively cover various speech-processing tasks, datasets, and benchmarks used in the literature and describe how different deep-learning networks have been utilized to tackle these tasks. Additionally, we discuss the challenges and future directions of deep learning in speech processing, including the need for more parameter-efficient, interpretable models and the potential of deep learning for multimodal speech processing. By examining the field's evolution, comparing and contrasting different approaches, and highlighting future directions and challenges, we hope to inspire further research in this exciting and rapidly advancing field

    Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer

    Full text link
    Deep neural network-based systems have significantly improved the performance of speaker diarization tasks. However, end-to-end neural diarization (EEND) systems often struggle to generalize to scenarios with an unseen number of speakers, while target speaker voice activity detection (TS-VAD) systems tend to be overly complex. In this paper, we propose a simple attention-based encoder-decoder network for end-to-end neural diarization (AED-EEND). In our training process, we introduce a teacher-forcing strategy to address the speaker permutation problem, leading to faster model convergence. For evaluation, we propose an iterative decoding method that outputs diarization results for each speaker sequentially. Additionally, we propose an Enhancer module to enhance the frame-level speaker embeddings, enabling the model to handle scenarios with an unseen number of speakers. We also explore replacing the transformer encoder with a Conformer architecture, which better models local information. Furthermore, we discovered that commonly used simulation datasets for speaker diarization have a much higher overlap ratio compared to real data. We found that using simulated training data that is more consistent with real data can achieve an improvement in consistency. Extensive experimental validation demonstrates the effectiveness of our proposed methodologies. Our best system achieved a new state-of-the-art diarization error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%), and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection (VAD) is used. Beyond speaker diarization, our AED-EEND system also shows remarkable competitiveness as a speech type detection model.Comment: IEEE/ACM Transactions on Audio Speech and Language Processing Under Revie

    Transitions Without Cure: A Journey Through The Neutral Zone

    Get PDF
    This capstone recounts a story of two merging paths. The first path is a story of navigating a life-changing medical diagnosis, self-realization, and managing transitions. The second path is a story of personal development via experiences and key learnings from the University of Pennsylvania Master of Science Organizational Dynamics program. The two paths are purposefully intertwined in a story format to reveal the zigs and zags of personal growth, as well as the literature and teachings that shaped me while navigating a new normal. This capstone focuses on topics of introspection, managing transitions, dealing with ambiguity, and organizational leadership. I hope this paper serves as an inspiration for those undergoing a personal transformation or navigating ambiguity as the result of a life-changing event. The paper shows how organizational change and leadership frameworks have tremendous value when used as reflection tools at an individual level

    Automated Optimization Deep Learning Model for Assessment and Guidance System Through Natural Language Processing with Reduction of Anxiety Among Students

    Get PDF
    The Assisted Assessment and Guidance System serves as a valuable tool in supporting individuals' learning, growth, and development. The Assisted Assessment and Guidance System with Natural Language Processing (NLP) is an innovative software application designed to provide personalized and intelligent support for assessment and guidance processes in various domains. NLP techniques are employed to analyze and understand human language, allowing the system to extract valuable insights from text-based data and provide tailored feedback and guidance. This paper proposed an Integrated Optimization Directional Clustering Classification (IODCc) for assessment of the foreign language anxiety. Additionally, the paper introduces an Integrated Optimization Directional Clustering Classification (IODCc) approach for assessing foreign language anxiety. This approach incorporates two optimization models, namely Black Widow Optimization (BWO) and Seahorse Optimization (SHO). BWO and SHO are metaheuristic optimization algorithms that simulate the behaviors of black widow spiders and seahorses, respectively, to improve the accuracy of the assessment process. The integration of these optimization models within the IODCc approach aims to enhance the accuracy and effectiveness of the foreign language anxiety assessment. Simulation analysis is performed for the data collected from the 1000 foreign language students. The experimental analysis expressed that the proposed IODCc model achieves an accuracy of 99% for the classification. The findings suggested that through pre-training of languages, the anxiety of the students will be reduced

    A Path to Alignment: Connecting K-12 and Higher Education via the Common Core and the Degree Qualifications Profile

    Get PDF
    The Common Core State Standards (CCSS), which aim to assure competency in English/language arts and mathematics through the K-12 curriculum, define necessary but not sufficient preparedness for success in college. The Degree Qualifications Profile (DQP), which describes what a college degree should signify, regardless of major, offers useful but not sufficient guidance to high school students preparing for college study. A coordinated strategy to prepare students to succeed in college would align these two undertakings and thus bridge an unfortunate and harmful cultural chasm between the K-12 world and that of higher education. Chasms call for bridges, and the bridge proposed by this white paper could create a vital thoroughfare. The white paper begins with a description of the CCSS and an assessment of their significance. A following analysis then explains why the CCSS, while necessary, are not sufficient as a platform for college success. A corresponding explanation of the DQP clarifies the prompts that led to its development, describes its structure, and offers some guidance for interpreting the outcomes that it defines. Again, a following analysis considers the potential of the DQP and the limitations that must be addressed if that potential is to be more fully realized. The heart of the white paper lies in sections 5 and 6, which provide a crosswalk between the CCSS and the DQP. These sections show how alignments and differences between the two may point to a comprehensive preparedness strategy. They also offer a proposal for a multifaceted strategy to realize the potential synergy of the CCSS and the DQP for the benefit of high school and college educators and their students -- and the nation
    • …
    corecore