1,648 research outputs found

    Speech-based recognition of self-reported and observed emotion in a dimensional space

    Get PDF
    The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance

    Improving elderly access to audiovisual and social media, using a multimodal human-computer interface

    Get PDF
    With the growth of Internet and especially, the proliferation of social media services, an opportunity has emerged for greater social and technological integration of the elderly. However, the adoption of new technologies by this segment of the population is not always straightforward mainly due to the physical and cognitive difficulties that are typically associated with ageing. Thus, for elderly to take advantage of new technologies and services that can help improve their quality of life, barriers must be broken by designing solutions with those needs in mind from the start. The aim of this work is to verify whether Multimodal Human-Computer Interaction (MHCI) systems designed with Universal Accessibility principles, taking into account elderly specific requirements, facilitate the adoption and access to popular Social Media Services (SMSs) and Audiovisual Communication Services, thus potentially contributing to the elderly social and technological integration. A user study was initially conducted in order to learn about the limitations and requirements of elderly people with existing HCI, concerning particularly SMSs and Audiovisual Communication Services, such as Facebook or Windows Live Messenger (WLM). The results of the study, basically a set of new MHCI requirements, were used to inform further development and enhancement of a multimodal prototype previously proposed for mobility-impaired individuals, now targeting the elderly. The prototype allows connecting users with their social networks through a text, audio and video communication service and integrates with SMSs, using natural interaction modalities, like speech, touch and gesture. After the development stage a usability evaluation study was conducted. The study reveals that such multimodal solution could simplify accessibility to the supported services, through the provision of simpler to use interfaces, by adopting natural interaction modalities and by being more satisfying to use by the elderly population, than most of the current graphical user interfaces for those same services, such as Facebook.Com o crescimento da Internet e, especialmente, das redes sociais surge a oportunidade para uma maior integração social e tecnológica dos idosos. No entanto, a adoção de novas tecnologias por essa população nem sempre é simples, principalmente devido às dificuldades físicas e cognitivas que estão associadas com o envelhecimento. Assim, e para que os idosos possam tirar proveito das novas tecnologias e serviços que podem ajudar a melhorar sua qualidade de vida, essas barreiras devem ser ultrapassadas desenhando soluções de raiz com essas necessidades em mente. O objetivo deste trabalho é verificar se interfaces humano-computador multimodais desenhadas com base em princípios de Acessibilidade Universal, tendo em conta requisitos específicos da população idosa, proporcionam um acesso simplificado a serviços de média social e serviços de comunicação audiovisuais, potencialmente contribuindo para a integração social e tecnológica desta população. Um estudo com utilizadores foi inicialmente conduzido a fim de apurar as necessidades especiais desses utilizadores com soluções de software, mais especificamente serviços de média social e serviços de conferência, como o Facebook ou o Windows Live Messenger. Os resultados do estudo foram utilizados para planear o desenvolvimento de um protótipo multimodal proposto anteriormente para utilizadores com mobilidade reduzida. Este permite ligar utilizadores às suas redes sociais através de um serviço de conferência por texto, áudio e vídeo, e um serviço integrado de média social, usando modalidades de interação natural, como o toque, fala e gestos. Após a fase de desenvolvimento foi realizado um estudo de usabilidade. Esse estudo revelou que este tipo de soluções pode simplificar a acessibilidade aos serviços considerados, dado ter interfaces mais simples, por adotar modalidades de interação mais naturais e por ser mais gratificante do que a maioria das interfaces gráficas atuais para os mesmos serviços, como por exemplo o Facebook

    Affective social anthropomorphic intelligent system

    Full text link
    Human conversational styles are measured by the sense of humor, personality, and tone of voice. These characteristics have become essential for conversational intelligent virtual assistants. However, most of the state-of-the-art intelligent virtual assistants (IVAs) are failed to interpret the affective semantics of human voices. This research proposes an anthropomorphic intelligent system that can hold a proper human-like conversation with emotion and personality. A voice style transfer method is also proposed to map the attributes of a specific emotion. Initially, the frequency domain data (Mel-Spectrogram) is created by converting the temporal audio wave data, which comprises discrete patterns for audio features such as notes, pitch, rhythm, and melody. A collateral CNN-Transformer-Encoder is used to predict seven different affective states from voice. The voice is also fed parallelly to the deep-speech, an RNN model that generates the text transcription from the spectrogram. Then the transcripted text is transferred to the multi-domain conversation agent using blended skill talk, transformer-based retrieve-and-generate generation strategy, and beam-search decoding, and an appropriate textual response is generated. The system learns an invertible mapping of data to a latent space that can be manipulated and generates a Mel-spectrogram frame based on previous Mel-spectrogram frames to voice synthesize and style transfer. Finally, the waveform is generated using WaveGlow from the spectrogram. The outcomes of the studies we conducted on individual models were auspicious. Furthermore, users who interacted with the system provided positive feedback, demonstrating the system's effectiveness.Comment: Multimedia Tools and Applications (2023

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
    corecore