    Toward a model of computational attention based on expressive behavior: applications to cultural heritage scenarios

    Our project goals consisted in the development of attention-based analysis of human expressive behavior and the implementation of real-time algorithm in EyesWeb XMI in order to improve naturalness of human-computer interaction and context-based monitoring of human behavior. To this aim, perceptual-model that mimic human attentional processes was developed for expressivity analysis and modeled by entropy. Museum scenarios were selected as an ecological test-bed to elaborate three experiments that focus on visitor profiling and visitors flow regulation

    An End-to-End Conversational Style Matching Agent

    We present an end-to-end voice-based conversational agent that is able to engage in naturalistic multi-turn dialogue and align with the interlocutor's conversational style. The system uses a series of deep neural network components for speech recognition, dialogue generation, prosodic analysis and speech synthesis to generate language and prosodic expression with qualities that match those of the user. We conducted a user study (N=30) in which participants talked with the agent for 15 to 20 minutes, resulting in over 8 hours of natural interaction data. Users with high consideration conversational styles reported the agent to be more trustworthy when it matched their conversational style. Whereas, users with high involvement conversational styles were indifferent. Finally, we provide design guidelines for multi-turn dialogue interactions using conversational style adaptation

    Current Challenges and Visions in Music Recommender Systems Research

    Music recommender systems (MRS) have experienced a boom in recent years, thanks to the emergence and success of online streaming services, which nowadays make available almost all music in the world at the user's fingertip. While today's MRS considerably help users to find interesting music in these huge catalogs, MRS research is still facing substantial challenges. In particular when it comes to build, incorporate, and evaluate recommendation strategies that integrate information beyond simple user--item interactions or content-based descriptors, but dig deep into the very essence of listener needs, preferences, and intentions, MRS research becomes a big endeavor and related publications quite sparse. The purpose of this trends and survey article is twofold. We first identify and shed light on what we believe are the most pressing challenges MRS research is facing, from both academic and industry perspectives. We review the state of the art towards solving these challenges and discuss its limitations. Second, we detail possible future directions and visions we contemplate for the further evolution of the field. The article should therefore serve two purposes: giving the interested reader an overview of current challenges in MRS research and providing guidance for young researchers by identifying interesting, yet under-researched, directions in the field

    OpenAdaptxt: an open source enabling technology for high quality text entry

    Modern text entry systems, especially for touch screen phones and novel devices, rely on complex underlying technologies such as error correction and word suggestion. Furthermore, for global deployment a vast number of languages have to be supported. Together this has raised the entry bar for new text entry techniques, which makes developing and testing a longer process thus stifling innovation. For example, testing a new feedback mechanism in comparison to a stock keyboard now requires the researchers to support at least slip correction and probably word suggestion. This paper introduces OpenAdaptxt: an open source community driven text input platform to enable development of higher quality text input solutions. It is the first commercial-grade open source enabling technology for modern text entry that supports both multiple platforms and dictionary support for over 50 spoken languages

    Contours of Inclusion: Inclusive Arts Teaching and Learning

    The purpose of this publication is to share models and case examples of the process of inclusive arts curriculum design and evaluation. The first section explains the conceptual and curriculum frameworks that were used in the analysis and generation of the featured case studies (i.e. Understanding by Design, Differentiated Instruction, and Universal Design for Learning). Data for the cases studies was collected from three urban sites (i.e. Los Angeles, San Francisco, and Boston) and included participant observations, student and teacher interviews, curriculum documentation, digital documentation of student learning, and transcripts from discussion forum and teleconference discussions from a professional learning community.The initial case studies by Glass and Barnum use the curricular frameworks to analyze and understand what inclusive practices look like in two case studies of arts-in-education programs that included students with disabilities. The second set of precedent case studies by Kronenberg and Blair, and Jenkins and Agois Hurel uses the frameworks to explain their process of including students by providing flexible arts learning options to support student learning of content standards. Both sets of case studies illuminate curricular design decisions and instructional strategies that supported the active engagement and learning of students with disabilities in educational settings shared with their peers. The second set of cases also illustrate the reflective process of using frameworks like Universal Design for Learning (UDL) to guide curricular design, responsive instructional differentiation, and the use of the arts as a rich, meaningful, and engaging option to support learning. Appended are curriculum design and evaluation tools. (Individual chapters contain references.

    Multichannel Attention Network for Analyzing Visual Behavior in Public Speaking

    Public speaking is an important aspect of human communication and interaction. The majority of computational work on public speaking concentrates on analyzing the spoken content, and the verbal behavior of the speakers. While the success of public speaking largely depends on the content of the talk, and the verbal behavior, non-verbal (visual) cues, such as gestures and physical appearance also play a significant role. This paper investigates the importance of visual cues by estimating their contribution towards predicting the popularity of a public lecture. For this purpose, we constructed a large database of more than 18001800 TED talk videos. As a measure of popularity of the TED talks, we leverage the corresponding (online) viewers' ratings from YouTube. Visual cues related to facial and physical appearance, facial expressions, and pose variations are extracted from the video frames using convolutional neural network (CNN) models. Thereafter, an attention-based long short-term memory (LSTM) network is proposed to predict the video popularity from the sequence of visual features. The proposed network achieves state-of-the-art prediction accuracy indicating that visual cues alone contain highly predictive information about the popularity of a talk. Furthermore, our network learns a human-like attention mechanism, which is particularly useful for interpretability, i.e. how attention varies with time, and across different visual cues by indicating their relative importance

    Speaker-adaptive multimodal prediction model for listener responses

    The goal of this paper is to analyze and model the variability in speaking styles in dyadic interactions and build a predictive algorithm for listener responses that is able to adapt to these different styles. The end result of this research will be a virtual human able to automatically respond to a human speaker with proper listener responses (e.g., head nods). Our novel speaker-adaptive prediction model is created from a corpus of dyadic interactions where speaker variability is analyzed to identify a subset of prototypical speaker styles. During a live interaction our prediction model automatically identifies the closest prototypical speaker style and predicts listener responses based on this ``communicative style". Central to our approach is the idea of ``speaker profile" which uniquely identifies each speaker and enables the matching between prototypical speakers and new speakers. The paper shows the merits of our speaker-adaptive listener response prediction model by showing improvement over a state-of-the-art approach which does not adapt to the speaker. Besides the merits of speaker-adapta-tion, our experiments highlights the importance of using multimodal features when comparing speakers to select the closest prototypical speaker style

    TranSTYLer: Multimodal Behavioral Style Transfer for Facial and Body Gestures Generation

    This paper addresses the challenge of transferring the behavior expressivity style of a virtual agent to another one while preserving behaviors shape as they carry communicative meaning. Behavior expressivity style is viewed here as the qualitative properties of behaviors. We propose TranSTYLer, a multimodal transformer based model that synthesizes the multimodal behaviors of a source speaker with the style of a target speaker. We assume that behavior expressivity style is encoded across various modalities of communication, including text, speech, body gestures, and facial expressions. The model employs a style and content disentanglement schema to ensure that the transferred style does not interfere with the meaning conveyed by the source behaviors. Our approach eliminates the need for style labels and allows the generalization to styles that have not been seen during the training phase. We train our model on the PATS corpus, which we extended to include dialog acts and 2D facial landmarks. Objective and subjective evaluations show that our model outperforms state of the art models in style transfer for both seen and unseen styles during training. To tackle the issues of style and content leakage that may arise, we propose a methodology to assess the degree to which behavior and gestures associated with the target style are successfully transferred, while ensuring the preservation of the ones related to the source content
