3,362 research outputs found

    Selected Topics in Audio-based Recommendation of TV Content

    Get PDF

    CHORUS Deliverable 2.2: Second report - identification of multi-disciplinary key issues for gap analysis toward EU multimedia search engines roadmap

    Get PDF
    After addressing the state-of-the-art during the first year of Chorus and establishing the existing landscape in multimedia search engines, we have identified and analyzed gaps within European research effort during our second year. In this period we focused on three directions, notably technological issues, user-centred issues and use-cases and socio- economic and legal aspects. These were assessed by two central studies: firstly, a concerted vision of functional breakdown of generic multimedia search engine, and secondly, a representative use-cases descriptions with the related discussion on requirement for technological challenges. Both studies have been carried out in cooperation and consultation with the community at large through EC concertation meetings (multimedia search engines cluster), several meetings with our Think-Tank, presentations in international conferences, and surveys addressed to EU projects coordinators as well as National initiatives coordinators. Based on the obtained feedback we identified two types of gaps, namely core technological gaps that involve research challenges, and “enablers”, which are not necessarily technical research challenges, but have impact on innovation progress. New socio-economic trends are presented as well as emerging legal challenges

    Multimodal sentiment analysis in real-life videos

    Get PDF
    This thesis extends the emerging field of multimodal sentiment analysis of real-life videos, taking two components into consideration: the emotion and the emotion's target. The emotion component of media is traditionally represented as a segment-based intensity model of emotion classes. This representation is replaced here by a value- and time-continuous view. Adjacent research fields, such as affective computing, have largely neglected the linguistic information available from automatic transcripts of audio-video material. As is demonstrated here, this text modality is well-suited for time- and value-continuous prediction. Moreover, source-specific problems, such as trustworthiness, have been largely unexplored so far. This work examines perceived trustworthiness of the source, and its quantification, in user-generated video data and presents a possible modelling path. Furthermore, the transfer between the continuous and discrete emotion representations is explored in order to summarise the emotional context at a segment level. The other component deals with the target of the emotion, for example, the topic the speaker is addressing. Emotion targets in a video dataset can, as is shown here, be coherently extracted based on automatic transcripts without limiting a priori parameters, such as the expected number of targets. Furthermore, alternatives to purely linguistic investigation in predicting targets, such as knowledge-bases and multimodal systems, are investigated. A new dataset is designed for this investigation, and, in conjunction with proposed novel deep neural networks, extensive experiments are conducted to explore the components described above. The developed systems show robust prediction results and demonstrate strengths of the respective modalities, feature sets, and modelling techniques. Finally, foundations are laid for cross-modal information prediction systems with applications to the correction of corrupted in-the-wild signals from real-life videos

    Leveraging contextual-cognitive relationships into mobile commerce systems

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophyMobile smart devices are becoming increasingly important within the on-line purchasing cycle. Thus the requirement for mobile commerce systems to become truly context-aware remains paramount if they are to be effective within the varied situations that mobile users encounter. Where traditionally a recommender system will focus upon the user – item relationship, i.e. what to recommend, in this thesis it is proposed that due to the complexity of mobile user situational profiles the how and when must also be considered for recommendations to be effective. Though non-trivial, it should be, through the understanding of a user’s ability to complete certain cognitive processes, possible to determine the likelihood of engagement and therefore the success of the recommendation. This research undertakes an investigation into physical and modal contexts and presents findings as to their relationships with cognitive processes. Through the introduction of the novel concept, disruptive contexts, situational contexts, including noise, distractions and user activity, are identified as having significant effects upon the relationship between user affective state and cognitive capability. Experimental results demonstrate that by understanding specific cognitive capabilities, e.g. a user’s perception of advert content and user levels of purchase-decision involvement, a system can determine potential user engagement and therefore improve the effectiveness of recommender systems’ performance. A quantitative approach is followed with a reliance upon statistical measures to inform the development, and subsequent validation, of a contextual-cognitive model that was implemented as part of a context-aware system. The development of SiDISense (Situational Decision Involvement Sensing system) demonstrated, through the use of smart-phone sensors and machine learning, that is was viable to classify subjectively rated contexts to then infer levels of cognitive capability and therefore likelihood of positive user engagement. Through this success in furthering the understanding of contextual-cognitive relationships there are novel and significant advances that are now viable within the area of m-commerce

    The Importance of Context When Recommending TV Content: Dataset and Algorithms

    Get PDF
    Home entertainment systems feature in a variety of usage scenarios with one or more simultaneous users, for whom the complexity of choosing media to consume has increased rapidly over the last decade. Users' decision processes are complex and highly influenced by contextual settings, but data supporting the development and evaluation of context-aware recommender systems are scarce. In this paper we present a dataset of self-reported TV consumption enriched with contextual information of viewing situations. We show how choice of genre associates with, among others, the number of present users and users' attention levels. Furthermore, we evaluate the performance of predicting chosen genres given different configurations of contextual information, and compare the results to contextless predictions. The results suggest that including contextual features in the prediction cause notable improvements, and both temporal and social context show significant contributions

    Automatic Emotion Recognition: Quantifying Dynamics and Structure in Human Behavior.

    Full text link
    Emotion is a central part of human interaction, one that has a huge influence on its overall tone and outcome. Today's human-centered interactive technology can greatly benefit from automatic emotion recognition, as the extracted affective information can be used to measure, transmit, and respond to user needs. However, developing such systems is challenging due to the complexity of emotional expressions and their dynamics in terms of the inherent multimodality between audio and visual expressions, as well as the mixed factors of modulation that arise when a person speaks. To overcome these challenges, this thesis presents data-driven approaches that can quantify the underlying dynamics in audio-visual affective behavior. The first set of studies lay the foundation and central motivation of this thesis. We discover that it is crucial to model complex non-linear interactions between audio and visual emotion expressions, and that dynamic emotion patterns can be used in emotion recognition. Next, the understanding of the complex characteristics of emotion from the first set of studies leads us to examine multiple sources of modulation in audio-visual affective behavior. Specifically, we focus on how speech modulates facial displays of emotion. We develop a framework that uses speech signals which alter the temporal dynamics of individual facial regions to temporally segment and classify facial displays of emotion. Finally, we present methods to discover regions of emotionally salient events in a given audio-visual data. We demonstrate that different modalities, such as the upper face, lower face, and speech, express emotion with different timings and time scales, varying for each emotion type. We further extend this idea into another aspect of human behavior: human action events in videos. We show how transition patterns between events can be used for automatically segmenting and classifying action events. Our experimental results on audio-visual datasets show that the proposed systems not only improve performance, but also provide descriptions of how affective behaviors change over time. We conclude this dissertation with the future directions that will innovate three main research topics: machine adaptation for personalized technology, human-human interaction assistant systems, and human-centered multimedia content analysis.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/133459/1/yelinkim_1.pd

    Affective and Implicit Tagging using Facial Expressions and Electroencephalography.

    Get PDF
    PhDRecent years have seen an explosion of user-generated, untagged multimedia data, generating a need for efficient search and retrieval of this data. The predominant method for content-based tagging is through manual annotation. Consequently, automatic tagging is currently the subject of intensive research. However, it is clear that the process will not be fully automated in the foreseeable future. We propose to involve the user and investigate methods for implicit tagging, wherein users' responses to the multimedia content are analysed in order to generate descriptive tags. We approach this problem through the modalities of facial expressions and EEG signals. We investigate tag validation and affective tagging using EEG signals. The former relies on the detection of event-related potentials triggered in response to the presentation of invalid tags alongside multimedia material. We demonstrate significant differences in users' EEG responses for valid versus invalid tags, and present results towards single-trial classification. For affective tagging, we propose methodologies to map EEG signals onto the valence-arousal space and perform both binary classification as well as regression into this space. We apply these methods in a real-time affective recommendation system. We also investigate the analysis of facial expressions for implicit tagging. This relies on a dynamic texture representation using non-rigid registration that we first evaluate on the problem of facial action unit recognition. We present results on well-known datasets (with both posed and spontaneous expressions) comparable to the state of the art in the field. Finally, we present a multi-modal approach that fuses both modalities for affective tagging. We perform classification in the valence-arousal space based on these modalities and present results for both feature-level and decision-level fusion. We demonstrate improvement in the results when using both modalities, suggesting the modalities contain complementary information
    • …
    corecore