8,924 research outputs found
Learning Aligned Cross-Modal Representations from Weakly Aligned Data
People can recognize scenes across many different modalities beyond natural
images. In this paper, we investigate how to learn cross-modal scene
representations that transfer across modalities. To study this problem, we
introduce a new cross-modal scene dataset. While convolutional neural networks
can categorize cross-modal scenes well, they also learn an intermediate
representation not aligned across modalities, which is undesirable for
cross-modal transfer applications. We present methods to regularize cross-modal
convolutional neural networks so that they have a shared representation that is
agnostic of the modality. Our experiments suggest that our scene representation
can help transfer representations across modalities for retrieval. Moreover,
our visualizations suggest that units emerge in the shared representation that
tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
Relieving Triplet Ambiguity: Consensus Network for Language-Guided Image Retrieval
Language-guided image retrieval enables users to search for images and
interact with the retrieval system more naturally and expressively by using a
reference image and a relative caption as a query. Most existing studies mainly
focus on designing image-text composition architecture to extract
discriminative visual-linguistic relations. Despite great success, we identify
an inherent problem that obstructs the extraction of discriminative features
and considerably compromises model training: \textbf{triplet ambiguity}. This
problem stems from the annotation process wherein annotators view only one
triplet at a time. As a result, they often describe simple attributes, such as
color, while neglecting fine-grained details like location and style. This
leads to multiple false-negative candidates matching the same modification
text. We propose a novel Consensus Network (Css-Net) that self-adaptively
learns from noisy triplets to minimize the negative effects of triplet
ambiguity. Inspired by the psychological finding that groups perform better
than individuals, Css-Net comprises 1) a consensus module featuring four
distinct compositors that generate diverse fused image-text embeddings and 2) a
Kullback-Leibler divergence loss, which fosters learning among the compositors,
enabling them to reduce biases learned from noisy triplets and reach a
consensus. The decisions from four compositors are weighted during evaluation
to further achieve consensus. Comprehensive experiments on three datasets
demonstrate that Css-Net can alleviate triplet ambiguity, achieving competitive
performance on benchmarks, such as R@10 and R@50 on
FashionIQ.Comment: 11 page
Affective Music Information Retrieval
Much of the appeal of music lies in its power to convey emotions/moods and to
evoke them in listeners. In consequence, the past decade witnessed a growing
interest in modeling emotions from musical signals in the music information
retrieval (MIR) community. In this article, we present a novel generative
approach to music emotion modeling, with a specific focus on the
valence-arousal (VA) dimension model of emotion. The presented generative
model, called \emph{acoustic emotion Gaussians} (AEG), better accounts for the
subjectivity of emotion perception by the use of probability distributions.
Specifically, it learns from the emotion annotations of multiple subjects a
Gaussian mixture model in the VA space with prior constraints on the
corresponding acoustic features of the training music pieces. Such a
computational framework is technically sound, capable of learning in an online
fashion, and thus applicable to a variety of applications, including
user-independent (general) and user-dependent (personalized) emotion
recognition and emotion-based music retrieval. We report evaluations of the
aforementioned applications of AEG on a larger-scale emotion-annotated corpora,
AMG1608, to demonstrate the effectiveness of AEG and to showcase how
evaluations are conducted for research on emotion-based MIR. Directions of
future work are also discussed.Comment: 40 pages, 18 figures, 5 tables, author versio
A Framework for Personalized Utility-Aware IP-Based Multimedia Consumption
Providing transparent and augmented use of multimedia con-tent across a wide range of networks and devices is still a challenging task within the multimedia research community. Multimedia adaptation was figured out as a core concept to overcome this issue. Most multimedia adaptation engines for providing Universal Multimedia Access (UMA) scale the content under consideration of terminal capabilities and re-source constraints but do not really consider individual user preferences. This paper introduces an adaptive multimedia framework which offers the user a personalized content vari-ation for satisfying his/her individual utility preferences. 1
Affective games:a multimodal classification system
Affective gaming is a relatively new field of research that exploits human emotions to influence gameplay for an enhanced player experience. Changes in player’s psychology reflect on their behaviour and physiology, hence recognition of such variation is a core element in affective games. Complementary sources of affect offer more reliable recognition, especially in contexts where one modality is partial or unavailable. As a multimodal recognition system, affect-aware games are subject to the practical difficulties met by traditional trained classifiers. In addition, inherited game-related challenges in terms of data collection and performance arise while attempting to sustain an acceptable level of immersion. Most existing scenarios employ sensors that offer limited freedom of movement resulting in less realistic experiences. Recent advances now offer technology that allows players to communicate more freely and naturally with the game, and furthermore, control it without the use of input devices. However, the affective game industry is still in its infancy and definitely needs to catch up with the current life-like level of adaptation provided by graphics and animation
Automatic semantic video annotation in wide domain videos based on similarity and commonsense knowledgebases
In this paper, we introduce a novel framework for automatic Semantic Video Annotation. As this framework detects possible events occurring in video clips, it forms the annotating base of video search engine. To achieve this purpose, the system has to able to operate on uncontrolled wide-domain videos. Thus, all layers have to be based on generic features.
This framework aims to bridge the "semantic gap", which is the difference between the low-level visual features and the human's perception, by finding videos with similar visual events, then analyzing their free text annotation to find a common area then to decide the best description for this new video using commonsense knowledgebases.
Experiments were performed on wide-domain video clips from the TRECVID 2005 BBC rush standard database. Results from these experiments show promising integrity between those two layers in order to find expressing annotations for the input video. These results were evaluated based on retrieval performance
- …