4,068 research outputs found

    Just-in-Time Multimodal Association and Fusion from Home Entertainment

    Get PDF
    In this paper, we describe a real-time multimodal analysis system with just-in-time multimodal association and fusion for a living room environment, where multiple people may enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed to be used in open, unconstrained environments like in next generation video conferencing systems that automatically “orchestrate” the transmitted video streams to improve the overall experience of interaction between spatially separated families and friends. Performance levels achieved to date on hand-labelled dataset have shown sufficient reliability at the same time as fulfilling real-time processing requirements

    Multimodal Content Analysis for Effective Advertisements on YouTube

    Full text link
    The rapid advances in e-commerce and Web 2.0 technologies have greatly increased the impact of commercial advertisements on the general public. As a key enabling technology, a multitude of recommender systems exists which analyzes user features and browsing patterns to recommend appealing advertisements to users. In this work, we seek to study the characteristics or attributes that characterize an effective advertisement and recommend a useful set of features to aid the designing and production processes of commercial advertisements. We analyze the temporal patterns from multimedia content of advertisement videos including auditory, visual and textual components, and study their individual roles and synergies in the success of an advertisement. The objective of this work is then to measure the effectiveness of an advertisement, and to recommend a useful set of features to advertisement designers to make it more successful and approachable to users. Our proposed framework employs the signal processing technique of cross modality feature learning where data streams from different components are employed to train separate neural network models and are then fused together to learn a shared representation. Subsequently, a neural network model trained on this joint feature embedding representation is utilized as a classifier to predict advertisement effectiveness. We validate our approach using subjective ratings from a dedicated user study, the sentiment strength of online viewer comments, and a viewer opinion metric of the ratio of the Likes and Views received by each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201

    The discursive construction of nostalgic Thai identity through Thai television programmes

    Get PDF
    PhD ThesisGlobal pop culture and media consumption have influenced contemporary Thai society and its traditionally perceived national culture and heritage. Such cultural impacts are not only from the global trends of Western cultures, but also from the more local influential Asian cultures, such as South Korean and Japanese. In response to this, within a general strategy, Thai television has been utilised as a principal source for promoting a projected contemporary Thai cultural identity. This study examines the discursive construction of nostalgic Thai identity in popular television variety shows whereby a form of integrative cultural identity for the country is envisaged. The research covers processes of production and consumption of certain television texts to provide an understanding of the ways in which this Thai identity is constructed, represented and perceived. It intends: a. to analyse the textual features of the shows (including visuals, language, design, narrative etc.); b. to examine the perspectives of the key figures in the television industry; and c. to analyse and compare the viewership’s perceptions of Thai identity representations in the shows in relations to their own lifeworlds. The study adopts qualitative research methods: multimodal discourse analysis, in-depth interviews with key figures in the television industry, as well as focus groups with various viewerships. The study finds that the construction of nostalgia around the perception of Thainess has been at the centre of these particular identity discourses which promote the country’s perceived traditional culture and heritage as core identity markers. In this way, the Thai representation in the shows is constructed in line with the reflective nostalgia which focuses on reflecting the nation’s past and culture, rather than a restorative nostalgia which relates to total restoration of national past/symbols i.e. an uncompromising return to “origin”. In addition to the cultural concerns of the text producers, the television shows have actively been substantiating this notion of Thai identity in line with the government’s cultural policy, which envisages contemporary Thai identity as a form of inclusive collective identity incorporating modernism as well as traditionalism, rather than a form of openly nationalistic and exclusionary identity. This representation of Thai identity both arises from and is compatible with the country’s socio-cultural and historical circumstances as a way to underpin the maintenance of the traditionality/distinctiveness of Thai culture, while simultaneously integrating a form of multiculturalism. The inclusive vision of Thai identity ii construction/representation can also be supportive of the international policies of cooperation and relationships between Thailand and other (neighbouring) countriesBangkok Universit

    Multimodal Cue Detection Engine for Orchestrated Entertainment

    Get PDF
    In this paper, we describe a low delay real-time multimodal cue detection engine for a living room environment. The system is designed to be used in open, unconstrained environments to allow multiple people to enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed as a flexible component to be used in conjunction with an orchestrated video conferencing system to improve the overall experience of interaction between spatially separated families and friends. Reduced latency levels achieved to date have shown improved responsiveness of the system

    Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed

    Full text link
    Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul, Republic of Kore

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Smartphone picture organization: a hierarchical approach

    Get PDF
    We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin
    • 

    corecore