4,068 research outputs found
Just-in-Time Multimodal Association and Fusion from Home Entertainment
In this paper, we describe a real-time multimodal analysis system with just-in-time multimodal association and fusion for a living room environment, where multiple people may enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed to be used in open, unconstrained environments like in next generation video conferencing systems that automatically âorchestrateâ the transmitted video streams to improve the overall experience of interaction between spatially separated families and friends. Performance levels achieved to date on hand-labelled dataset have shown sufficient reliability at the same time as fulfilling real-time processing requirements
Multimodal Content Analysis for Effective Advertisements on YouTube
The rapid advances in e-commerce and Web 2.0 technologies have greatly
increased the impact of commercial advertisements on the general public. As a
key enabling technology, a multitude of recommender systems exists which
analyzes user features and browsing patterns to recommend appealing
advertisements to users. In this work, we seek to study the characteristics or
attributes that characterize an effective advertisement and recommend a useful
set of features to aid the designing and production processes of commercial
advertisements. We analyze the temporal patterns from multimedia content of
advertisement videos including auditory, visual and textual components, and
study their individual roles and synergies in the success of an advertisement.
The objective of this work is then to measure the effectiveness of an
advertisement, and to recommend a useful set of features to advertisement
designers to make it more successful and approachable to users. Our proposed
framework employs the signal processing technique of cross modality feature
learning where data streams from different components are employed to train
separate neural network models and are then fused together to learn a shared
representation. Subsequently, a neural network model trained on this joint
feature embedding representation is utilized as a classifier to predict
advertisement effectiveness. We validate our approach using subjective ratings
from a dedicated user study, the sentiment strength of online viewer comments,
and a viewer opinion metric of the ratio of the Likes and Views received by
each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201
The discursive construction of nostalgic Thai identity through Thai television programmes
PhD ThesisGlobal pop culture and media consumption have influenced contemporary Thai
society and its traditionally perceived national culture and heritage. Such cultural
impacts are not only from the global trends of Western cultures, but also from the
more local influential Asian cultures, such as South Korean and Japanese. In
response to this, within a general strategy, Thai television has been utilised as a
principal source for promoting a projected contemporary Thai cultural identity. This
study examines the discursive construction of nostalgic Thai identity in popular
television variety shows whereby a form of integrative cultural identity for the country
is envisaged. The research covers processes of production and consumption of
certain television texts to provide an understanding of the ways in which this Thai
identity is constructed, represented and perceived. It intends: a. to analyse the
textual features of the shows (including visuals, language, design, narrative etc.); b.
to examine the perspectives of the key figures in the television industry; and c. to
analyse and compare the viewershipâs perceptions of Thai identity representations in
the shows in relations to their own lifeworlds. The study adopts qualitative research
methods: multimodal discourse analysis, in-depth interviews with key figures in the
television industry, as well as focus groups with various viewerships. The study finds
that the construction of nostalgia around the perception of Thainess has been at the
centre of these particular identity discourses which promote the countryâs perceived
traditional culture and heritage as core identity markers. In this way, the Thai
representation in the shows is constructed in line with the reflective nostalgia which
focuses on reflecting the nationâs past and culture, rather than a restorative nostalgia
which relates to total restoration of national past/symbols i.e. an uncompromising
return to âoriginâ. In addition to the cultural concerns of the text producers, the
television shows have actively been substantiating this notion of Thai identity in line
with the governmentâs cultural policy, which envisages contemporary Thai identity as
a form of inclusive collective identity incorporating modernism as well as
traditionalism, rather than a form of openly nationalistic and exclusionary identity.
This representation of Thai identity both arises from and is compatible with the
countryâs socio-cultural and historical circumstances as a way to underpin the
maintenance of the traditionality/distinctiveness of Thai culture, while simultaneously
integrating a form of multiculturalism. The inclusive vision of Thai identity
ii
construction/representation can also be supportive of the international policies of
cooperation and relationships between Thailand and other (neighbouring) countriesBangkok Universit
Multimodal Cue Detection Engine for Orchestrated Entertainment
In this paper, we describe a low delay real-time multimodal cue detection engine for a living room environment. The system is designed to be used in open, unconstrained environments to allow multiple people to enter, interact and leave the observable world with no constraints. It comprises detection and tracking of up to 4 faces, estimation of head poses and visual focus of attention, detection and localisation of verbal and paralinguistic events, their association and fusion. The system is designed as a flexible component to be used in conjunction with an orchestrated video conferencing system to improve the overall experience of interaction between spatially separated families and friends. Reduced latency levels achieved to date have shown improved responsiveness of the system
Harnessing AI for Speech Reconstruction using Multi-view Silent Video Feed
Speechreading or lipreading is the technique of understanding and getting
phonetic features from a speaker's visual features such as movement of lips,
face, teeth and tongue. It has a wide range of multimedia applications such as
in surveillance, Internet telephony, and as an aid to a person with hearing
impairments. However, most of the work in speechreading has been limited to
text generation from silent videos. Recently, research has started venturing
into generating (audio) speech from silent video sequences but there have been
no developments thus far in dealing with divergent views and poses of a
speaker. Thus although, we have multiple camera feeds for the speech of a user,
but we have failed in using these multiple video feeds for dealing with the
different poses. To this end, this paper presents the world's first ever
multi-view speech reading and reconstruction system. This work encompasses the
boundaries of multimedia research by putting forth a model which leverages
silent video feeds from multiple cameras recording the same subject to generate
intelligent speech for a speaker. Initial results confirm the usefulness of
exploiting multiple camera views in building an efficient speech reading and
reconstruction system. It further shows the optimal placement of cameras which
would lead to the maximum intelligibility of speech. Next, it lays out various
innovative applications for the proposed system focusing on its potential
prodigious impact in not just security arena but in many other multimedia
analytics problems.Comment: 2018 ACM Multimedia Conference (MM '18), October 22--26, 2018, Seoul,
Republic of Kore
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Smartphone picture organization: a hierarchical approach
We live in a society where the large majority of the population has a camera-equipped smartphone. In addition, hard drives and cloud storage are getting cheaper and cheaper, leading to a tremendous growth in stored personal photos. Unlike photo collections captured by a digital camera, which typically are pre-processed by the user who organizes them into event-related folders, smartphone pictures are automatically stored in the cloud. As a consequence, photo collections captured by a smartphone are highly unstructured and because smartphones are ubiquitous, they present a larger variability compared to pictures captured by a digital camera. To solve the need of organizing large smartphone photo collections automatically, we propose here a new methodology for hierarchical photo organization into topics and topic-related categories. Our approach successfully estimates latent topics in the pictures by applying probabilistic Latent Semantic Analysis, and automatically assigns a name to each topic by relying on a lexical database. Topic-related categories are then estimated by using a set of topic-specific Convolutional Neuronal Networks. To validate our approach, we ensemble and make public a large dataset of more than 8,000 smartphone pictures from 40 persons. Experimental results demonstrate major user satisfaction with respect to state of the art solutions in terms of organization.Peer ReviewedPreprin
- âŠ