307 research outputs found
Toward summarization of communicative activities in spoken conversation
This thesis is an inquiry into the nature and structure of face-to-face conversation, with a
special focus on group meetings in the workplace. I argue that conversations are composed
of episodes, each of which corresponds to an identifiable communicative activity such as
giving instructions or telling a story. These activities are important because they are part
of participants’ commonsense understanding of what happens in a conversation. They
appear in natural summaries of conversations such as meeting minutes, and participants
talk about them within the conversation itself. Episodic communicative activities therefore
represent an essential component of practical, commonsense descriptions of conversations.
The thesis objective is to provide a deeper understanding of how such activities may be
recognized and differentiated from one another, and to develop a computational method
for doing so automatically. The experiments are thus intended as initial steps toward future
applications that will require analysis of such activities, such as an automatic minute-taker
for workplace meetings, a browser for broadcast news archives, or an automatic decision
mapper for planning interactions.
My main theoretical contribution is to propose a novel analytical framework called participant
relational analysis. The proposal argues that communicative activities are principally
indicated through participant-relational features, i.e., expressions of relationships between
participants and the dialogue. Participant-relational features, such as subjective language,
verbal reference to the participants, and the distribution of speech activity amongst
the participants, are therefore argued to be a principal means for analyzing the nature and
structure of communicative activities.
I then apply the proposed framework to two computational problems: automatic discourse
segmentation and automatic discourse segment labeling. The first set of experiments
test whether participant-relational features can serve as a basis for automatically
segmenting conversations into discourse segments, e.g., activity episodes. Results show
that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links
between content words, i.e., lexical cohesion. They also show that feature performance is
highly dependent on segment type, suggesting that human-annotated “topic segments” are
in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units.
Analysis of commonly used evaluation measures, performed in conjunction with the
segmentation experiments, reveals that they fail to penalize substantially defective results
due to inherent biases in the measures. I therefore preface the experiments with a comprehensive
analysis of these biases and a proposal for a novel evaluation measure. A reevaluation
of state-of-the-art segmentation algorithms using the novel measure produces
substantially different results from previous studies. This raises serious questions about the
effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate
ones to employ in the subsequent experiments.
I also preface the experiments with an investigation of participant reference, an important
type of participant-relational feature. I propose an annotation scheme with novel distinctions
for vagueness, discourse function, and addressing-based referent inclusion, each
of which are assessed for inter-coder reliability. The produced dataset includes annotations
of 11,000 occasions of person-referring.
The second set of experiments concern the use of participant-relational features to
automatically identify labels for discourse segments. In contrast to assigning semantic topic
labels, such as topical headlines, the proposed algorithm automatically labels segments
according to activity type, e.g., presentation, discussion, and evaluation. The method is
unsupervised and does not learn from annotated ground truth labels. Rather, it induces the
labels through correlations between discourse segment boundaries and the occurrence of
bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what
has just occurred or what is about to occur. Results show that bracketing meta-discourse
is an effective basis for identifying some labels automatically, but that its use is limited if
global correlations to segment features are not employed.
This thesis addresses important pre-requisites to the automatic summarization of conversation.
What I provide is a novel activity-oriented perspective on how summarization
should be approached, and a novel participant-relational approach to conversational analysis.
The experimental results show that analysis of participant-relational features is
Temporal Information in Data Science: An Integrated Framework and its Applications
Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems.Data science is a well-known buzzword, that is in fact composed of two distinct keywords, i.e., data and science. Data itself is of great importance: each analysis task begins from a set of examples. Based on such a consideration, the present work starts with the analysis of a real case scenario, by considering the development of a data warehouse-based decision support system for an Italian contact center company. Then, relying on the information collected in the developed system, a set of machine learning-based analysis tasks have been developed to answer specific business questions, such as employee work anomaly detection and automatic call classification. Although such initial applications rely on already available algorithms, as we shall see, some clever analysis workflows had also to be developed. Afterwards, continuously driven by real data and real world applications, we turned ourselves to the question of how to handle temporal information within classical decision tree models. Our research brought us the development of J48SS, a decision tree induction algorithm based on Quinlan's C4.5 learner, which is capable of dealing with temporal (e.g., sequential and time series) as well as atemporal (such as numerical and categorical) data during the same execution cycle. The decision tree has been applied into some real world analysis tasks, proving its worthiness. A key characteristic of J48SS is its interpretability, an aspect that we specifically addressed through the study of an evolutionary-based decision tree pruning technique. Next, since a lot of work concerning the management of temporal information has already been done in automated reasoning and formal verification fields, a natural direction in which to proceed was that of investigating how such solutions may be combined with machine learning, following two main tracks. First, we show, through the development of an enriched decision tree capable of encoding temporal information by means of interval temporal logic formulas, how a machine learning algorithm can successfully exploit temporal logic to perform data analysis. Then, we focus on the opposite direction, i.e., that of employing machine learning techniques to generate temporal logic formulas, considering a natural language processing scenario. Finally, as a conclusive development, the architecture of a system is proposed, in which formal methods and machine learning techniques are seamlessly combined to perform anomaly detection and predictive maintenance tasks. Such an integration represents an original, thrilling research direction that may open up new ways of dealing with complex, real-world problems
Towards Better Understanding of Spoken Conversations: Assessment of Emotion and Sentiment
Emotions play a vital role in our daily life as they help us convey information impossible to express verbally to other parties. While humans can easily perceive emotions, these are notoriously difficult to define and recognize by machines. However, automatically detecting the emotion of a spoken conversation can be useful for a diverse range of applications such as human-machine interaction and conversation analysis. In this thesis, we present several approaches based on machine learning to recognize emotion from isolated utterances and long recordings.
Isolated utterances are usually shorter than 10s in duration and are assumed to contain only one major emotion. One of the main obstacles in achieving high emotion recognition accuracy is the lack of large annotated data. We propose to mitigate this problem by using transfer learning and data augmentation techniques. We show that x-vector representations extracted from speaker recognition models (x-vector models) contain emotion predictive information and adapting those models provide significant improvements in emotion recognition performance. To further improve the performance, we propose a novel perceptually motivated data augmentation method, Copy-Paste on isolated utterances. This method is based on the assumption that the
presence of emotions other than neutral dictates a speaker ’s overall perceived emotion in a recording.
As isolated utterances are assumed to contain only one emotion, the proposed models make predictions on the utterance level. However, these models can not be directly applied to conversations that can have multiple emotions unless we know the locations of emotion boundaries. In this work, we propose to recognize emotions in the conversations by doing frame-level classification where predictions are made at regular intervals. We compare models trained on isolated utterances and conversations. We propose a data augmentation method, DiverseCatAugment based on attention operation to improve the transformer models. To further improve the performance, we incorporate the turn-taking structure of the conversations into our models.
Annotating utterances with emotions is not a simple task and it depends on the number of emotions used for annotation. However, annotation schemes can be changed to reduce annotation efforts based on application. We consider one such application: predicting customer satisfaction (CSAT) in a call center conversation where the goal is to predict the overall sentiment of the customer. We conduct a comprehensive search for adequate acoustic and lexical representations at different granular levels of conversations. We show that the methods that use transfer learning (x-vectors and CSAT Tracker) perform best. Our error analysis shows that the calls where customers accomplished their goal but were still dissatisfied are the most difficult to predict correctly, and the customer’s speech is more emotional compared to the agent’s speech
SpokenWOZ: A Large-Scale Speech-Text Benchmark for Spoken Task-Oriented Dialogue Agents
Task-oriented dialogue (TOD) models have made significant progress in recent
years. However, previous studies primarily focus on datasets written by
annotators, which has resulted in a gap between academic research and
real-world spoken conversation scenarios. While several small-scale spoken TOD
datasets are proposed to address robustness issues such as ASR errors, they
ignore the unique challenges in spoken conversation. To tackle the limitations,
we introduce SpokenWOZ, a large-scale speech-text dataset for spoken TOD,
containing 8 domains, 203k turns, 5.7k dialogues and 249 hours of audios from
human-to-human spoken conversations. SpokenWOZ further incorporates common
spoken characteristics such as word-by-word processing and reasoning in spoken
language. Based on these characteristics, we present cross-turn slot and
reasoning slot detection as new challenges. We conduct experiments on various
baselines, including text-modal models, newly proposed dual-modal models, and
LLMs, e.g., ChatGPT. The results show that the current models still have
substantial room for improvement in spoken conversation, where the most
advanced dialogue state tracker only achieves 25.65% in joint goal accuracy and
the SOTA end-to-end model only correctly completes the user request in 52.1% of
dialogues. The dataset, code, and leaderboard are available:
https://spokenwoz.github.io/SpokenWOZ-github.io/
- …