9 research outputs found

    Improving conversational dynamics with reactive speech synthesis

    Get PDF
    The active exchange of ideas and/or information is a crucial feature of human-human conversation. Yet it is a skill that present-day ‘conversational’ interfaces are lacking, which effectively hampers the dynamics of interaction and makes it feel artificial. In this paper, we present a reactive speech synthesis system that can handle user’s interruptions. Initial results of evaluation of our interactive experiment indicate that participants prefer a reactive system to a non-reactive one. Based on participants’ feedback, we suggest potential applications for reactive speech synthesis systems (i.e. interactive tutor and adventure game) and propose further interactive user experiments to evaluate them. We anticipate that the reactive system can offer more engaging and dynamic interaction and improve user experience by making it feel more like a natural human-human conversation

    Cross Modal Evaluation of High Quality Emotional Speech Synthesis with the Virtual Human Toolkit

    Get PDF
    Emotional expression is a key requirement for intelligent virtual agents. In order for an agent to produce dynamic spoken content speech synthesis is required. However, despite substantial work with pre-recorded prompts, very little work has explored the combined effect of high quality emotional speech synthesis and facial expression. In this paper we offer a baseline evaluation of the naturalness and emotional range available by combining the freely available SmartBody component of the Virtual Human Toolkit (VHTK) with CereVoice text to speech (TTS) system. Results echo previous work using pre-recorded prompts, the visual modality is dominant and the modalities do not interact. This allows the speech synthesis to add gradual changes to the perceived emotion both in terms of valence and activation. The naturalness reported is good, 3.54 on a 5 point MOS scale

    An investigation of eyes-free spatial auditory interfaces for mobile devices: supporting multitasking and location-based information

    Get PDF
    Auditory interfaces offer a solution to the problem of effective eyes-free mobile interactions. However, a problem with audio, as opposed to visual displays, is dealing with multiple simultaneous information streams. Spatial audio can be used to differentiate between different streams by locating them into separate spatial auditory streams. In this thesis, we consider which spatial audio designs might be the most effective for supporting multiple auditory streams and the impact such spatialisation might have on the users' cognitive load. An investigation is carried out to explore the extent to which 3D audio can be effectively incorporated into mobile auditory interfaces to offer users eyes-free interaction for both multitasking and accessing location-based information. Following a successful calibration of the 3D audio controls on the mobile device of choice for this work (the Nokia N95 8GB), a systematic evaluationof 3D audio techniques is reported in the experimental chapters of this thesis which considered the effects of multitasking, multi-level displays, as well as differences between egocentric and exocentric designs. One experiment investigates the implementation and evaluation of a number of different spatial (egocentric) and non-spatial audio techniques for supporting eyes-free mobile multitasking that included spatial minimisation. The efficiency and usability of these techniques was evaluated under varying cognitive load. This evaluation showed an important interaction between cognitive load and the method used to present multiple auditory streams. The spatial minimisation technique offered an effective means of presenting and interacting with multiple auditory streams simultaneously in a selective-attention task (low cognitive load) but it was not as effective in a divided-attention task (high cognitive load), in which the interaction benefited significantly from the interruption of one of the stream. Two further experiments examine a location-based approach to supporting multiple information streams in a realistic eyes-free mobile environment. An initial case study was conducted in an outdoor mobile audio-augmented exploratory environment that allowed for the analysis and description of user behaviour in a purely exploratory environment. 3D audio was found to be an effective technique to disambiguate multiple sound sources in a mobile exploratory environment and to provide a more engaging and immersive experience as well as encouraging an exploratory behaviour. A second study extended the work of the previous case study by evaluating a number of complex multi-level spatial auditory displays that enabled interaction with multiple location-based information in an indoor mobile audio-augmented exploratory environment. It was found that a consistent exocentric design across levels failed to reduce workload or increase user satisfaction, so this design was widely rejected by users. However, the rest of spatial auditory displays tested in this study encouraged an exploratory behaviour similar to that described in the previous case study, here further characterised by increased user satisfaction and low perceived workload

    Persuasive interactive non-verbal behaviour in embodied conversational agents

    Get PDF
    Realism for embodied conversational agents (ECAs) requires both visual and behavioural fidelity. One significant area of ECA behaviour, that has to date received little attention, is non-verbal behaviour. Non-verbal behaviour occurs continually in all human-human interactions, and has been shown to be highly important in those interactions. Previous research has demonstrated that people treat media (and therefore ECAs) as real people, and so non-verbal behaviour is also important in the development of ECAs. ECAs that use non-verbal behaviour when interacting with humans or other ECAs will be more realistic, more engaging, and have higher social influence. This thesis gives an in-depth view of non-verbal behaviour in humans followed by an exploration of the potential social influence of ECAs using a novel Wizard of Oz style approach of synthetic ECAs. It is shown that ECAs have the potential to have no less social influence (as measured using a direct measure of behaviour change) than real people and also that it is important that ECAs have visual feedback on their interactants for this social influence to maximised. Throughout this thesis there is a focus on empirical evaluation of ECAs, both as a validation tool and also to provide directions for future research and development. Present ECAs frequently incorporate some form of non-verbal behaviour, but this is quite limited and more importantly not connected strongly to the behaviour of a human interactant. This interactional aspect of non-verbal behaviour is important in human-human interactions and results from the study of the persuasive potential of ECAs support this fact mapping onto human-ECA interactions. The challenges in creating non-verbally interactive ECAs are introduced and by drawing corollaries with robotics control systems development behaviour-based architectures are presented as a solution towards these challenges, and implemented in a prototypical ECA. Evaluation of this ECA using the methodology used previously in this thesis demonstrates that an ECA with non-verbal behaviour that responds to its interactant is rated more positively than an ECA that does not, indicating that directly measurable social influences will be possible with further development

    Conveying expressivity and vocal effort transformation in synthetic speech with Harmonic plus Noise Models

    Get PDF
    Aquesta tesi s'ha dut a terme dins del Grup en de Tecnologies Mèdia (GTM) de l'Escola d'Enginyeria i Arquitectura la Salle. El grup te una llarga trajectòria dins del cap de la síntesi de veu i fins i tot disposa d'un sistema propi de síntesi per concatenació d'unitats (US-TTS) que permet sintetitzar diferents estils expressius usant múltiples corpus. De forma que per a realitzar una síntesi agressiva, el sistema usa el corpus de l'estil agressiu, i per a realitzar una síntesi sensual, usa el corpus de l'estil corresponent. Aquesta tesi pretén proposar modificacions del esquema del US-TTS que permetin millorar la flexibilitat del sistema per sintetitzar múltiples expressivitats usant només un únic corpus d'estil neutre. L'enfoc seguit en aquesta tesi es basa en l'ús de tècniques de processament digital del senyal (DSP) per aplicar modificacions de senyal a la veu sintetitzada per tal que aquesta expressi l'estil de parla desitjat. Per tal de dur a terme aquestes modificacions de senyal s'han usat els models harmònic més soroll per la seva flexibilitat a l'hora de realitzar modificacions de senyal. La qualitat de la veu (VoQ) juga un paper important en els diferents estils expressius. És per això que es va estudiar la síntesi de diferents emocions mitjançant la modificació de paràmetres de VoQ de baix nivell. D'aquest estudi es van identificar un conjunt de limitacions que van donar lloc als objectius d'aquesta tesi, entre ells el trobar un paràmetre amb gran impacte sobre els estils expressius. Per aquest fet l'esforç vocal (VE) es va escollir per el seu paper important en la parla expressiva. Primer es va estudiar la possibilitat de transferir l'VE entre dues realitzacions amb diferent VE de la mateixa paraula basant-se en la tècnica de predicció lineal adaptativa del filtre de pre-èmfasi (APLP). La proposta va permetre transferir l'VE correctament però presentava limitacions per a poder generar nivells intermitjos d'VE. Amb la finalitat de millorar la flexibilitat i control de l'VE expressat a la veu sintetitzada, es va proposar un nou model d'VE basat en polinomis lineals. Aquesta proposta va permetre transferir l'VE entre dues paraules qualsevols i sintetitzar nous nivells d'VE diferents dels disponibles al corpus. Aquesta flexibilitat esta alineada amb l'objectiu general d'aquesta tesi, permetre als sistemes US-TTS sintetitzar diferents estils expressius a partir d'un únic corpus d'estil neutre. La proposta realitzada també inclou un paràmetre que permet controlar fàcilment el nivell d'VE sintetitzat. Això obre moltes possibilitats per controlar fàcilment el procés de síntesi tal i com es va fer al projecte CreaVeu usant interfícies gràfiques simples i intuïtives, també realitzat dins del grup GTM. Aquesta memòria conclou presentant el treball realitzat en aquesta tesi i amb una proposta de modificació de l'esquema d'un sistema US-TTS per incloure els blocs de DSP desenvolupats en aquesta tesi que permetin al sistema sintetitzar múltiple nivells d'VE a partir d'un corpus d'estil neutre. Això obre moltes possibilitats per generar interfícies d'usuari que permetin controlar fàcilment el procés de síntesi, tal i com es va fer al projecte CreaVeu, també realitzat dins del grup GTM. Aquesta memòria conclou presentant el treball realitzat en aquesta tesi i amb una proposta de modificació de l'esquema del sistema US-TTS per incloure els blocs de DSP desenvolupats en aquesta tesi que permetin al sistema sintetitzar múltiple nivells d'VE a partir d'un corpus d'estil neutre.Esta tesis se llevó a cabo en el Grup en Tecnologies Mèdia de la Escuela de Ingeniería y Arquitectura la Salle. El grupo lleva una larga trayectoria dentro del campo de la síntesis de voz y cuenta con su propio sistema de síntesis por concatenación de unidades (US-TTS). El sistema permite sintetizar múltiples estilos expresivos mediante el uso de corpus específicos para cada estilo expresivo. De este modo, para realizar una síntesis agresiva, el sistema usa el corpus de este estilo, y para un estilo sensual, usa otro corpus específico para ese estilo. La presente tesis aborda el problema con un enfoque distinto proponiendo cambios en el esquema del sistema con el fin de mejorar la flexibilidad para sintetizar múltiples estilos expresivos a partir de un único corpus de estilo de habla neutro. El planteamiento seguido en esta tesis esta basado en el uso de técnicas de procesamiento de señales (DSP) para llevar a cabo modificaciones del señal de voz para que este exprese el estilo de habla deseado. Para llevar acabo las modificaciones de la señal de voz se han usado los modelos harmónico más ruido (HNM) por su flexibilidad para efectuar modificaciones de señales. La cualidad de la voz (VoQ) juega un papel importante en diferentes estilos expresivos. Por ello se exploró la síntesis expresiva basada en modificaciones de parámetros de bajo nivel de la VoQ. Durante este estudio se detectaron diferentes problemas que dieron pié a los objetivos planteados en esta tesis, entre ellos el encontrar un único parámetro con fuerte influencia en la expresividad. El parámetro seleccionado fue el esfuerzo vocal (VE) por su importante papel a la hora de expresar diferentes emociones. Las primeras pruebas se realizaron con el fin de transferir el VE entre dos realizaciones con diferente grado de VE de la misma palabra usando una metodología basada en un proceso filtrado de pre-émfasis adaptativo con coeficientes de predicción lineales (APLP). Esta primera aproximación logró transferir el nivel de VE entre dos realizaciones de la misma palabra, sin embargo el proceso presentaba limitaciones para generar niveles de esfuerzo vocal intermedios. A fin de mejorar la flexibilidad y el control del sistema para expresar diferentes niveles de VE, se planteó un nuevo modelo de VE basado en polinomios lineales. Este modelo permitió transferir el VE entre dos palabras diferentes e incluso generar nuevos niveles no presentes en el corpus usado para la síntesis. Esta flexibilidad está alineada con el objetivo general de esta tesis de permitir a un sistema US-TTS expresar múltiples estilos de habla expresivos a partir de un único corpus de estilo neutro. Además, la metodología propuesta incorpora un parámetro que permite de forma sencilla controlar el nivel de VE expresado en la voz sintetizada. Esto abre la posibilidad de controlar fácilmente el proceso de síntesis tal y como se hizo en el proyecto CreaVeu usando interfaces simples e intuitivas, también realizado dentro del grupo GTM. Esta memoria concluye con una revisión del trabajo realizado en esta tesis y con una propuesta de modificación de un esquema de US-TTS para expresar diferentes niveles de VE a partir de un único corpus neutro.This thesis was conducted in the Grup en Tecnologies M`edia (GTM) from Escola d’Enginyeria i Arquitectura la Salle. The group has a long trajectory in the speech synthesis field and has developed their own Unit-Selection Text-To-Speech (US-TTS) which is able to convey multiple expressive styles using multiple expressive corpora, one for each expressive style. Thus, in order to convey aggressive speech, the US-TTS uses an aggressive corpus, whereas for a sensual speech style, the system uses a sensual corpus. Unlike that approach, this dissertation aims to present a new schema for enhancing the flexibility of the US-TTS system for performing multiple expressive styles using a single neutral corpus. The approach followed in this dissertation is based on applying Digital Signal Processing (DSP) techniques for carrying out speech modifications in order to synthesize the desired expressive style. For conducting the speech modifications the Harmonics plus Noise Model (HNM) was chosen for its flexibility in conducting signal modifications. Voice Quality (VoQ) has been proven to play an important role in different expressive styles. Thus, low-level VoQ acoustic parameters were explored for conveying multiple emotions. This raised several problems setting new objectives for the rest of the thesis, among them finding a single parameter with strong impact on the expressive style conveyed. Vocal Effort (VE) was selected for conducting expressive speech style modifications due to its salient role in expressive speech. The first approach working with VE was based on transferring VE between two parallel utterances based on the Adaptive Pre-emphasis Linear Prediction (APLP) technique. This approach allowed transferring VE but the model presented certain restrictions regarding its flexibility for generating new intermediate VE levels. Aiming to improve the flexibility and control of the conveyed VE, a new approach using polynomial model for modelling VE was presented. This model not only allowed transferring VE levels between two different utterances, but also allowed to generate other VE levels than those present in the speech corpus. This is aligned with the general goal of this thesis, allowing US-TTS systems to convey multiple expressive styles with a single neutral corpus. Moreover, the proposed methodology introduces a parameter for controlling the degree of VE in the synthesized speech signal. This opens new possibilities for controlling the synthesis process such as the one in the CreaVeu project using a simple and intuitive graphical interfaces, also conducted in the GTM group. The dissertation concludes with a review of the conducted work and a proposal for schema modifications within a US-TTS system for introducing the VE modification blocks designed in this dissertation

    A generic approach to the evolution of interaction in ubiquitous systems

    Get PDF
    This dissertation addresses the challenge of the configuration of modern (ubiquitous, context-sensitive, mobile et al.) interactive systems where it is difficult or impossible to predict (i) the resources available for evolution, (ii) the criteria for judging the success of the evolution, and (iii) the degree to which human judgements must be involved in the evaluation process used to determine the configuration. In this thesis a conceptual model of interactive system configuration over time (known as interaction evolution) is presented which relies upon the follow steps; (i) identification of opportunities for change in a system, (ii) reflection on the available configuration alternatives, (iii) decision-making and (iv) implementation, and finally iteration of the process. This conceptual model underpins the development of a dynamic evolution environment based on a notion of configuration evaluation functions (hereafter referred to as evaluation functions) that provides greater flexibility than current solutions and, when supported by appropriate tools, can provide a richer set of evaluation techniques and features that are difficult or impossible to implement in current systems. Specifically this approach has support for changes to the approach, style or mode of use used for configuration - these features may result in more effective systems, less effort involved to configure them and a greater degree of control may be offered to the user. The contributions of this work include; (i) establishing the the need for configuration evolution through a literature review and a motivating case study experiment, (ii) development of a conceptual process model supporting interaction evolution, (iii) development of a model based on the notion of evaluation functions which is shown to support a wide range of interaction configuration approaches, (iv) a characterisation of the configuration evaluation space, followed by (v) an implementation of these ideas used in (vi) a series of longitudinal technology probes and investigations into the approaches

    Proceedings of the 12th European Workshop on Natural Language Generation (ENLG 2009)

    Get PDF

    Attentive Speaking. From Listener Feedback to Interactive Adaptation

    Get PDF
    Buschmeier H. Attentive Speaking. From Listener Feedback to Interactive Adaptation. Bielefeld: Universität Bielefeld; 2018.Dialogue is an interactive endeavour in which participants jointly pursue the goal of reaching understanding. Since participants enter the interaction with their individual conceptualisation of the world and their idiosyncratic way of using language, understanding cannot, in general, be reached by exchanging messages that are encoded when speaking and decoded when listening. Instead, speakers need to design their communicative acts in such a way that listeners are likely able to infer what is meant. Listeners, in turn, need to provide evidence of their understanding in such a way that speakers can infer whether their communicative acts were successful. This is often an interactive and iterative process in which speakers and listeners work towards understanding by jointly coordinating their communicative acts through feedback and adaptation. Taking part in this interactive process requires dialogue participants to have ‘interactional intelligence’. This conceptualisation of dialogue is rather uncommon in formal or technical approaches to dialogue modelling. This thesis argues that it may, nevertheless, be a promising research direction for these fields, because it de-emphasises raw language processing performance and focusses on fundamental interaction skills. Interactionally intelligent artificial conversational agents may thus be able to reach understanding with their interlocutors by drawing upon such competences. This will likely make them more robust, more understandable, more helpful, more effective, and more human-like. This thesis develops conceptual and computational models of interactional intelligence for artificial conversational agents that are limited to (1) the speaking role, and (2) evidence of understanding in form of communicative listener feedback (short but expressive verbal/vocal signals, such as ‘okay’, ‘mhm’ and ‘huh’, head gestures, and gaze). This thesis argues that such ‘attentive speaker agents’ need to be able (1) to probabilistically reason about, infer, and represent their interlocutors’ listening related mental states (e.g., their degree of understanding), based on their interlocutors’ feedback behaviour; (2) to interactively adapt their language and behaviour such that their interlocutors’ needs, derived from the attributed mental states, are taken into account; and (3) to decide when they need feedback from their interlocutors and how they can elicit it using behavioural cues.This thesis describes computational models for these three processes, their integration in an incremental behaviour generation architecture for embodied conversational agents, and a semi-autonomous interaction study in which the resulting attentive speaker agent is evaluated. The evaluation finds that the computational models of attentive speaking developed in this thesis enable conversational agents to interactively reach understanding with their human interlocutors (through feedback and adaptation) and that these interlocutors are willing to provide natural communicative listener feedback to such an attentive speaker agent. The thesis shows that computationally modelling interactional intelligence is generally feasible, and thereby raises many new research questions and engineering problems in the interdisciplinary fields of dialogue and artificial conversational agents
    corecore