1,836 research outputs found

    Audio-Visual Embedding for Cross-Modal MusicVideo Retrieval through Supervised Deep CCA

    Full text link
    Deep learning has successfully shown excellent performance in learning joint representations between different data modalities. Unfortunately, little research focuses on cross-modal correlation learning where temporal structures of different data modalities, such as audio and video, should be taken into account. Music video retrieval by given musical audio is a natural way to search and interact with music contents. In this work, we study cross-modal music video retrieval in terms of emotion similarity. Particularly, audio of an arbitrary length is used to retrieve a longer or full-length music video. To this end, we propose a novel audio-visual embedding algorithm by Supervised Deep CanonicalCorrelation Analysis (S-DCCA) that projects audio and video into a shared space to bridge the semantic gap between audio and video. This also preserves the similarity between audio and visual contents from different videos with the same class label and the temporal structure. The contribution of our approach is mainly manifested in the two aspects: i) We propose to select top k audio chunks by attention-based Long Short-Term Memory (LSTM)model, which can represent good audio summarization with local properties. ii) We propose an end-to-end deep model for cross-modal audio-visual learning where S-DCCA is trained to learn the semantic correlation between audio and visual modalities. Due to the lack of music video dataset, we construct 10K music video dataset from YouTube 8M dataset. Some promising results such as MAP and precision-recall show that our proposed model can be applied to music video retrieval.Comment: 8 pages, 9 figures. Accepted by ISM 201

    A Connotative Space for Supporting Movie Affective Recommendation

    Get PDF
    The problem of relating media content to users’affective responses is here addressed. Previous work suggests that a direct mapping of audio-visual properties into emotion categories elicited by films is rather difficult, due to the high variability of individual reactions. To reduce the gap between the objective level of video features and the subjective sphere of emotions, we propose to shift the representation towards the connotative properties of movies, in a space inter-subjectively shared among users. Consequently, the connotative space allows to define, relate and compare affective descriptions of film videos on equal footing. An extensive test involving a significant number of users watching famous movie scenes, suggests that the connotative space can be related to affective categories of a single user. We apply this finding to reach high performance in meeting user’s emotional preferences

    The Repetition of Video Game Music, its Impact on Video Game Enjoyment, and How Best to Manage it.

    Get PDF
    Video game music (VGM) has a functional role in video games which can cause it to be looped and repeated as it accompanies the player around the game world. This has an impact on players’ video game enjoyment and engagement when players become overfamiliar with repeating VGM. There have been numerous approaches and techniques implemented in video games to attempt to conceal, reduce, and remove repetition of VGM. However, familiarity through repeated exposure to VGM has a positive functional role for players with regards to player feedback. This constructivist study focuses on the phenomenon of VGM repetition and its impact on the complex concept of video game enjoyment, and gauges how best to manage the phenomenon using various approaches, and techniques, used to conceal, reduce, and remove repetition of VGM. The current study conducted qualitative interviews with actual players who believed that VGM was important to their enjoyment of video games. A codebook was developed from these interviews and used to interpret the data using heuristic inquiry. Findings show that players understand the reasons for VGM repetition and believe that their video game enjoyment is contextually dependent on whether repetition improves their engagement. Players are generally tolerant of VGM repetition but can become overfamiliar with VGM when it repeats, which has an impact on their video game enjoyment. However, players are more appreciative of the functional role that repeating VGM has with regards to feedback as they become more familiar with the repeating VGM. Ultimately a pragmatic worldview is held by the author who believes that this study could be beneficial to other VGM research and the video game industry because it focuses on the perspectives of the players themselves

    Content Discovery in Online Services: A Case Study on a Video on Demand System

    Get PDF
    Video-on-demand services have gained popularity in recent years for the large catalogue of content they offer and the ability to watch them at any desired time. Having many options to choose from may be overwhelming for the users and affect negatively the overall experience. The use of recommender systems has been proven to help users discover relevant content faster. However, content discovery is affected not only by the number of choices, but also by the way the content is displayed to the user. Moreover, the development of recommender systems has been commonly focused on increasing their prediction accuracy, rather than the usefulness and user experience. This work takes on a user-centric approach to designing an efficient content discovery experience for its users. The main contribution of this research is a set of guidelines for designing the user interface and recommender system for the aforementioned purpose, formulated based on a user study and existing research. The guidelines were additionally translated into interface designs, which were then evaluated with users. The results showed that users were satisfied with the proposed design and the goal of providing a better content discovery experience was achieved. Moreover, the guidelines were found feasible by the company in which the research was conducted and thus have a high potential to work in a real product. With this research, I aim to highlight the importance of improving the content discovery process both from the perspective of the user interface and a recommender system, and encourage researchers to consider the user experience in those aspects

    Emotion-aware voice interfaces based on speech signal processing

    Get PDF
    Voice interfaces (VIs) will become increasingly widespread in current daily lives as AI techniques progress. VIs can be incorporated into smart devices like smartphones, as well as integrated into autos, home automation systems, computer operating systems, and home appliances, among other things. Current speech interfaces, however, are unaware of users’ emotional states and hence cannot support real communication. To overcome these limitations, it is necessary to implement emotional awareness in future VIs. This thesis focuses on how speech signal processing (SSP) and speech emotion recognition (SER) can enable VIs to gain emotional awareness. Following an explanation of what emotion is and how neural networks are implemented, this thesis presents the results of several user studies and surveys. Emotions are complicated, and they are typically characterized using category and dimensional models. They can be expressed verbally or nonverbally. Although existing voice interfaces are unaware of users’ emotional states and cannot support natural conversations, it is possible to perceive users’ emotions by speech based on SSP in future VIs. One section of this thesis, based on SSP, investigates mental restorative effects on humans and their measures from speech signals. SSP is less intrusive and more accessible than traditional measures such as attention scales or response tests, and it can provide a reliable assessment for attention and mental restoration. SSP can be implemented into future VIs and utilized in future HCI user research. The thesis then moves on to present a novel attention neural network based on sparse correlation features. The detection accuracy of emotions in the continuous speech was demonstrated in a user study utilizing recordings from a real classroom. In this section, a promising result will be shown. In SER research, it is unknown if existing emotion detection methods detect acted emotions or the genuine emotion of the speaker. Another section of this thesis is concerned with humans’ ability to act on their emotions. In a user study, participants were instructed to imitate five fundamental emotions. The results revealed that they struggled with this task; nevertheless, certain emotions were easier to replicate than others. A further study concern is how VIs should respond to users’ emotions if SER techniques are implemented in VIs and can recognize users’ emotions. The thesis includes research on ways for dealing with the emotions of users. In a user study, users were instructed to make sad, angry, and terrified VI avatars happy and were asked if they would like to be treated the same way if the situation were reversed. According to the results, the majority of participants tended to respond to these unpleasant emotions with neutral emotion, but there is a difference among genders in emotion selection. For a human-centered design approach, it is important to understand what the users’ preferences for future VIs are. In three distinct cultures, a questionnaire-based survey on users’ attitudes and preferences for emotion-aware VIs was conducted. It was discovered that there are almost no gender differences. Cluster analysis found that there are three fundamental user types that exist in all cultures: Enthusiasts, Pragmatists, and Sceptics. As a result, future VI development should consider diverse sorts of consumers. In conclusion, future VIs systems should be designed for various sorts of users as well as be able to detect the users’ disguised or actual emotions using SER and SSP technologies. Furthermore, many other applications, such as restorative effects assessments, can be included in the VIs system

    Video Recommendations Based on Visual Features Extracted with Deep Learning

    Get PDF
    Postponed access: the file will be accessible after 2022-06-01When a movie is uploaded to a movie Recommender System (e.g., YouTube), the system can exploit various forms of descriptive features (e.g., tags and genre) in order to generate personalized recommendation for users. However, there are situations where the descriptive features are missing or very limited and the system may fail to include such a movie in the recommendation list, known as Cold-start problem. This thesis investigates recommendation based on a novel form of content features, extracted from movies, in order to generate recommendation for users. Such features represent the visual aspects of movies, based on Deep Learning models, and hence, do not require any human annotation when extracted. The proposed technique has been evaluated in both offline and online evaluations using a large dataset of movies. The online evaluation has been carried out in a evaluation framework developed for this thesis. Results from the offline and online evaluation (N=150) show that automatically extracted visual features can mitigate the cold-start problem by generating recommendation with a superior quality compared to different baselines, including recommendation based on human-annotated features. The results also point to subtitles as a high-quality future source of automatically extracted features. The visual feature dataset, named DeepCineProp13K and the subtitle dataset, CineSub3K, as well as the proposed evaluation framework are all made openly available online in a designated Github repository.Masteroppgave i informasjonsvitenskapINFO390MASV-INF

    CGAMES'2009

    Get PDF
    • …
    corecore