8,244 research outputs found

    When to Say What and How: Adapting the Elaborateness and Indirectness of Spoken Dialogue Systems

    Get PDF
    With the aim of designing a spoken dialogue system which has the ability to adapt to the user's communication idiosyncrasies, we investigate whether it is possible to carry over insights from the usage of communication styles in human-human interaction to human-computer interaction. In an extensive literature review, it is demonstrated that communication styles play an important role in human communication. Using a multi-lingual data set, we show that there is a significant correlation between the communication style of the system and the preceding communication style of the user. This is why two components that extend the standard architecture of spoken dialogue systems are presented: 1) a communication style classifier that automatically identifies the user communication style and 2) a communication style selection module that selects an appropriate system communication style. We consider the communication styles elaborateness and indirectness as it has been shown that they influence the user's satisfaction and the user's perception of a dialogue. We present a neural classification approach based on supervised learning for each task. Neural networks are trained and evaluated with features that can be automatically derived during an ongoing interaction in every spoken dialogue system. It is shown that both components yield solid results and outperform the baseline in form of a majority-class classifier

    Contextual Language Model Adaptation for Conversational Agents

    Full text link
    Statistical language models (LM) play a key role in Automatic Speech Recognition (ASR) systems used by conversational agents. These ASR systems should provide a high accuracy under a variety of speaking styles, domains, vocabulary and argots. In this paper, we present a DNN-based method to adapt the LM to each user-agent interaction based on generalized contextual information, by predicting an optimal, context-dependent set of LM interpolation weights. We show that this framework for contextual adaptation provides accuracy improvements under different possible mixture LM partitions that are relevant for both (1) Goal-oriented conversational agents where it's natural to partition the data by the requested application and for (2) Non-goal oriented conversational agents where the data can be partitioned using topic labels that come from predictions of a topic classifier. We obtain a relative WER improvement of 3% with a 1-pass decoding strategy and 6% in a 2-pass decoding framework, over an unadapted model. We also show up to a 15% relative improvement in recognizing named entities which is of significant value for conversational ASR systems.Comment: Interspeech 2018 (accepted

    Word Importance Modeling to Enhance Captions Generated by Automatic Speech Recognition for Deaf and Hard of Hearing Users

    Get PDF
    People who are deaf or hard-of-hearing (DHH) benefit from sign-language interpreting or live-captioning (with a human transcriptionist), to access spoken information. However, such services are not legally required, affordable, nor available in many settings, e.g., impromptu small-group meetings in the workplace or online video content that has not been professionally captioned. As Automatic Speech Recognition (ASR) systems improve in accuracy and speed, it is natural to investigate the use of these systems to assist DHH users in a variety of tasks. But, ASR systems are still not perfect, especially in realistic conversational settings, leading to the issue of trust and acceptance of these systems from the DHH community. To overcome these challenges, our work focuses on: (1) building metrics for accurately evaluating the quality of automatic captioning systems, and (2) designing interventions for improving the usability of captions for DHH users. The first part of this dissertation describes our research on methods for identifying words that are important for understanding the meaning of a conversational turn within transcripts of spoken dialogue. Such knowledge about the relative importance of words in spoken messages can be used in evaluating ASR systems (in part 2 of this dissertation) or creating new applications for DHH users of captioned video (in part 3 of this dissertation). We found that models which consider both the acoustic properties of spoken words as well as text-based features (e.g., pre-trained word embeddings) are more effective at predicting the semantic importance of a word than models that utilize only one of these types of features. The second part of this dissertation describes studies to understand DHH users\u27 perception of the quality of ASR-generated captions; the goal of this work was to validate the design of automatic metrics for evaluating captions in real-time applications for these users. Such a metric could facilitate comparison of various ASR systems, for determining the suitability of specific ASR systems for supporting communication for DHH users. We designed experimental studies to elicit feedback on the quality of captions from DHH users, and we developed and evaluated automatic metrics for predicting the usability of automatically generated captions for these users. We found that metrics that consider the importance of each word in a text are more effective at predicting the usability of imperfect text captions than the traditional Word Error Rate (WER) metric. The final part of this dissertation describes research on importance-based highlighting of words in captions, as a way to enhance the usability of captions for DHH users. Similar to highlighting in static texts (e.g., textbooks or electronic documents), highlighting in captions involves changing the appearance of some texts in caption to enable readers to attend to the most important bits of information quickly. Despite the known benefits of highlighting in static texts, research on the usefulness of highlighting in captions for DHH users is largely unexplored. For this reason, we conducted experimental studies with DHH participants to understand the benefits of importance-based highlighting in captions, and their preference on different design configurations for highlighting in captions. We found that DHH users subjectively preferred highlighting in captions, and they reported higher readability and understandability scores and lower task-load scores when viewing videos with captions containing highlighting compared to the videos without highlighting. Further, in partial contrast to recommendations in prior research on highlighting in static texts (which had not been based on experimental studies with DHH users), we found that DHH participants preferred boldface, word-level, non-repeating highlighting in captions

    Understanding confounding effects in linguistic coordination: an information-theoretic approach

    Full text link
    We suggest an information-theoretic approach for measuring stylistic coordination in dialogues. The proposed measure has a simple predictive interpretation and can account for various confounding factors through proper conditioning. We revisit some of the previous studies that reported strong signatures of stylistic accommodation, and find that a significant part of the observed coordination can be attributed to a simple confounding effect - length coordination. Specifically, longer utterances tend to be followed by longer responses, which gives rise to spurious correlations in the other stylistic features. We propose a test to distinguish correlations in length due to contextual factors (topic of conversation, user verbosity, etc.) and turn-by-turn coordination. We also suggest a test to identify whether stylistic coordination persists even after accounting for length coordination and contextual factors

    Estimating Adaptacion of Dialogue Partners with Different Verbal Intelligence

    Get PDF
    This work investigates to what degree speakers with different verbal intelligence may adapt to each other. The work is based on a corpus consisting of 100 descriptions of a short film (monologues), 56 discussions about the same topic (dialogues), and verbal intelligence scores of the test participants. Adaptation between two dialogue partners was measured using cross-referencing, proportion of "I", "You" and "We" words, between-subject correlation and similarity of texts. It was shown that lower verbal intelligence speakers repeated more nouns and adjectives from the other and used the same linguistic categories more often than higher verbal intelligence speakers. In dialogues between strangers, participants with higher verbal intelligence showed a greater level of adaptation

    Personalized Emphasis Framing for Persuasive Message Generation

    Full text link
    In this paper, we present a study on personalized emphasis framing which can be used to tailor the content of a message to enhance its appeal to different individuals. With this framework, we directly model content selection decisions based on a set of psychologically-motivated domain-independent personal traits including personality (e.g., extraversion and conscientiousness) and basic human values (e.g., self-transcendence and hedonism). We also demonstrate how the analysis results can be used in automated personalized content selection for persuasive message generation
    • …
    corecore