5,458 research outputs found
Multimodal Content Analysis for Effective Advertisements on YouTube
The rapid advances in e-commerce and Web 2.0 technologies have greatly
increased the impact of commercial advertisements on the general public. As a
key enabling technology, a multitude of recommender systems exists which
analyzes user features and browsing patterns to recommend appealing
advertisements to users. In this work, we seek to study the characteristics or
attributes that characterize an effective advertisement and recommend a useful
set of features to aid the designing and production processes of commercial
advertisements. We analyze the temporal patterns from multimedia content of
advertisement videos including auditory, visual and textual components, and
study their individual roles and synergies in the success of an advertisement.
The objective of this work is then to measure the effectiveness of an
advertisement, and to recommend a useful set of features to advertisement
designers to make it more successful and approachable to users. Our proposed
framework employs the signal processing technique of cross modality feature
learning where data streams from different components are employed to train
separate neural network models and are then fused together to learn a shared
representation. Subsequently, a neural network model trained on this joint
feature embedding representation is utilized as a classifier to predict
advertisement effectiveness. We validate our approach using subjective ratings
from a dedicated user study, the sentiment strength of online viewer comments,
and a viewer opinion metric of the ratio of the Likes and Views received by
each advertisement from an online platform.Comment: 11 pages, 5 figures, ICDM 201
MicroExpNet: An Extremely Small and Fast Model For Expression Recognition From Face Images
This paper is aimed at creating extremely small and fast convolutional neural
networks (CNN) for the problem of facial expression recognition (FER) from
frontal face images. To this end, we employed the popular knowledge
distillation (KD) method and identified two major shortcomings with its use: 1)
a fine-grained grid search is needed for tuning the temperature hyperparameter
and 2) to find the optimal size-accuracy balance, one needs to search for the
final network size (or the compression rate). On the other hand, KD is proved
to be useful for model compression for the FER problem, and we discovered that
its effects gets more and more significant with the decreasing model size. In
addition, we hypothesized that translation invariance achieved using
max-pooling layers would not be useful for the FER problem as the expressions
are sensitive to small, pixel-wise changes around the eye and the mouth.
However, we have found an intriguing improvement on generalization when
max-pooling is used. We conducted experiments on two widely-used FER datasets,
CK+ and Oulu-CASIA. Our smallest model (MicroExpNet), obtained using knowledge
distillation, is less than 1MB in size and works at 1851 frames per second on
an Intel i7 CPU. Despite being less accurate than the state-of-the-art,
MicroExpNet still provides significant insights for designing a
microarchitecture for the FER problem.Comment: International Conference on Image Processing Theory, Tools and
Applications (IPTA) 2019 camera ready version. Codes are available at:
https://github.com/cuguilke/microexpne
ADVISE: Symbolism and External Knowledge for Decoding Advertisements
In order to convey the most content in their limited space, advertisements
embed references to outside knowledge via symbolism. For example, a motorcycle
stands for adventure (a positive property the ad wants associated with the
product being sold), and a gun stands for danger (a negative property to
dissuade viewers from undesirable behaviors). We show how to use symbolic
references to better understand the meaning of an ad. We further show how
anchoring ad understanding in general-purpose object recognition and image
captioning improves results. We formulate the ad understanding task as matching
the ad image to human-generated statements that describe the action that the ad
prompts, and the rationale it provides for taking this action. Our proposed
method outperforms the state of the art on this task, and on an alternative
formulation of question-answering on ads. We show additional applications of
our learned representations for matching ads to slogans, and clustering ads
according to their topic, without extra training.Comment: To appear, Proceedings of the European Conference on Computer Vision
(ECCV
Managing heterogeneous cues in social contexts. A holistic approach for social interactions analysis
Une interaction sociale désigne toute action réciproque entre deux ou plusieurs individus, au cours de laquelle des informations sont partagées sans "médiation technologique". Cette interaction, importante dans la socialisation de l'individu et les compétences qu'il acquiert au cours de sa vie, constitue un objet d'étude pour différentes disciplines (sociologie, psychologie, médecine, etc.). Dans le contexte de tests et d'études observationnelles, de multiples mécanismes sont utilisés pour étudier ces interactions tels que les questionnaires, l'observation directe des événements et leur analyse par des opérateurs humains, ou l'observation et l'analyse à posteriori des événements enregistrés par des spécialistes (psychologues, sociologues, médecins, etc.). Cependant, de tels mécanismes sont coûteux en termes de temps de traitement, ils nécessitent un niveau élevé d'attention pour analyser simultanément plusieurs descripteurs, ils sont dépendants de l'opérateur (subjectivité de l'analyse) et ne peuvent viser qu'une facette de l'interaction. Pour faire face aux problèmes susmentionnés, il peut donc s'avérer utile d'automatiser le processus d'analyse de l'interaction sociale. Il s'agit donc de combler le fossé entre les processus d'analyse des interactions sociales basés sur l'homme et ceux basés sur la machine. Nous proposons donc une approche holistique qui intègre des signaux hétérogènes multimodaux et des informations contextuelles (données "exogènes" complémentaires) de manière dynamique et optionnelle en fonction de leur disponibilité ou non. Une telle approche permet l'analyse de plusieurs "signaux" en parallèle (où les humains ne peuvent se concentrer que sur un seul). Cette analyse peut être encore enrichie à partir de données liées au contexte de la scène (lieu, date, type de musique, description de l'événement, etc.) ou liées aux individus (nom, âge, sexe, données extraites de leurs réseaux sociaux, etc.) Les informations contextuelles enrichissent la modélisation des métadonnées extraites et leur donnent une dimension plus "sémantique". La gestion de cette hétérogénéité est une étape essentielle pour la mise en œuvre d'une approche holistique. L'automatisation de la capture et de l'observation " in vivo " sans scénarios prédéfinis lève des verrous liés à i) la protection de la vie privée et à la sécurité ; ii) l'hétérogénéité des données ; et iii) leur volume. Par conséquent, dans le cadre de l'approche holistique, nous proposons (1) un modèle de données complet préservant la vie privée qui garantit le découplage entre les méthodes d'extraction des métadonnées et d'analyse des interactions sociales ; (2) une méthode géométrique non intrusive de détection par contact visuel ; et (3) un modèle profond de classification des repas français pour extraire les informations du contenu vidéo. L'approche proposée gère des signaux hétérogènes provenant de différentes modalités en tant que sources multicouches (signaux visuels, signaux vocaux, informations contextuelles) à différentes échelles de temps et différentes combinaisons entre les couches (représentation des signaux sous forme de séries temporelles). L'approche a été conçue pour fonctionner sans dispositifs intrusifs, afin d'assurer la capture de comportements réels et de réaliser l'observation naturaliste. Nous avons déployé l'approche proposée sur la plateforme OVALIE qui vise à étudier les comportements alimentaires dans différents contextes de la vie réelle et qui est située à l'Université Toulouse-Jean Jaurès, en France.Social interaction refers to any interaction between two or more individuals, in which information sharing is carried out without any mediating technology. This interaction is a significant part of individual socialization and experience gaining throughout one's lifetime. It is interesting for different disciplines (sociology, psychology, medicine, etc.). In the context of testing and observational studies, multiple mechanisms are used to study these interactions such as questionnaires, direct observation and analysis of events by human operators, or a posteriori observation and analysis of recorded events by specialists (psychologists, sociologists, doctors, etc.). However, such mechanisms are expensive in terms of processing time. They require a high level of attention to analyzing several cues simultaneously. They are dependent on the operator (subjectivity of the analysis) and can only target one side of the interaction. In order to face the aforementioned issues, the need to automatize the social interaction analysis process is highlighted. So, it is a question of bridging the gap between human-based and machine-based social interaction analysis processes. Therefore, we propose a holistic approach that integrates multimodal heterogeneous cues and contextual information (complementary "exogenous" data) dynamically and optionally according to their availability or not. Such an approach allows the analysis of multi "signals" in parallel (where humans are able only to focus on one). This analysis can be further enriched from data related to the context of the scene (location, date, type of music, event description, etc.) or related to individuals (name, age, gender, data extracted from their social networks, etc.). The contextual information enriches the modeling of extracted metadata and gives them a more "semantic" dimension. Managing this heterogeneity is an essential step for implementing a holistic approach. The automation of " in vivo " capturing and observation using non-intrusive devices without predefined scenarios introduces various issues that are related to data (i) privacy and security; (ii) heterogeneity; and (iii) volume. Hence, within the holistic approach we propose (1) a privacy-preserving comprehensive data model that grants decoupling between metadata extraction and social interaction analysis methods; (2) geometric non-intrusive eye contact detection method; and (3) French food classification deep model to extract information from the video content. The proposed approach manages heterogeneous cues coming from different modalities as multi-layer sources (visual signals, voice signals, contextual information) at different time scales and different combinations between layers (representation of the cues like time series). The approach has been designed to operate without intrusive devices, in order to ensure the capture of real behaviors and achieve the naturalistic observation. We have deployed the proposed approach on OVALIE platform which aims to study eating behaviors in different real-life contexts and it is located in University Toulouse-Jean Jaurès, France
Towards A Robust Group-level Emotion Recognition via Uncertainty-Aware Learning
Group-level emotion recognition (GER) is an inseparable part of human
behavior analysis, aiming to recognize an overall emotion in a multi-person
scene. However, the existing methods are devoted to combing diverse emotion
cues while ignoring the inherent uncertainties under unconstrained
environments, such as congestion and occlusion occurring within a group.
Additionally, since only group-level labels are available, inconsistent emotion
predictions among individuals in one group can confuse the network. In this
paper, we propose an uncertainty-aware learning (UAL) method to extract more
robust representations for GER. By explicitly modeling the uncertainty of each
individual, we utilize stochastic embedding drawn from a Gaussian distribution
instead of deterministic point embedding. This representation captures the
probabilities of different emotions and generates diverse predictions through
this stochasticity during the inference stage. Furthermore,
uncertainty-sensitive scores are adaptively assigned as the fusion weights of
individuals' face within each group. Moreover, we develop an image enhancement
module to enhance the model's robustness against severe noise. The overall
three-branch model, encompassing face, object, and scene component, is guided
by a proportional-weighted fusion strategy and integrates the proposed
uncertainty-aware method to produce the final group-level output. Experimental
results demonstrate the effectiveness and generalization ability of our method
across three widely used databases.Comment: 11 pages,3 figure
- …