9 research outputs found
On the use of Vision-Language models for Visual Sentiment Analysis: a study on CLIP
This work presents a study on how to exploit the CLIP embedding space to
perform Visual Sentiment Analysis. We experiment with two architectures built
on top of the CLIP embedding space, which we denote by CLIP-E. We train the
CLIP-E models with WEBEmo, the largest publicly available and manually labeled
benchmark for Visual Sentiment Analysis, and perform two sets of experiments.
First, we test on WEBEmo and compare the CLIP-E architectures with
state-of-the-art (SOTA) models and with CLIP Zero-Shot. Second, we perform
cross dataset evaluation, and test the CLIP-E architectures trained with WEBEmo
on other Visual Sentiment Analysis benchmarks. Our results show that the CLIP-E
approaches outperform SOTA models in WEBEmo fine grained categorization, and
they also generalize better when tested on datasets that have not been seen
during training. Interestingly, we observed that for the FI dataset, CLIP
Zero-Shot produces better accuracies than SOTA models and CLIP-E trained on
WEBEmo. These results motivate several questions that we discuss in this paper,
such as how we should design new benchmarks and evaluate Visual Sentiment
Analysis, and whether we should keep designing tailored Deep Learning models
for Visual Sentiment Analysis or focus our efforts on better using the
knowledge encoded in large vision-language models such as CLIP for this task
Saliency prediction in 360° architectural scenes:Performance and impact of daylight variations
Saliency models are image-based prediction models that estimate human visual attention. Such models, when applied to architectural spaces, could pave the way for design decisions where visual attention is taken into account. In this study, we tested the performance of eleven commonly used saliency models that combine traditional and deep learning methods on 126 rendered interior scenes with associated head tracking data. The data was extracted from three experiments conducted in virtual reality between 2016 and 2018. Two of these datasets pertain to the perceptual effects of daylight and include variations of daylighting conditions for a limited set of interior spaces, thereby allowing to test the influence of light conditions on human head movement. Ground truth maps were extracted from the collected head tracking logs, and the prediction accuracy of the models was tested via the correlation coefficient between ground truth and prediction maps. To address the possible inflation of results due to the equator bias, we conducted complementary analyses by restricting the area of investigation to the equatorial image regions. Although limited to immersive virtual environments, the promising performance of some traditional models such as GBVS360eq and BMS360eq for colored and textured architectural rendered spaces offers us the prospect of their possible integration into design tools. We also observed a strong correlation in head movements for the same space lit by different types of sky, a finding whose generalization requires further investigations based on datasets more specifically developed to address this question.</p
Saliency prediction in 360° architectural scenes:Performance and impact of daylight variations
Saliency models are image-based prediction models that estimate human visual attention. Such models, when applied to architectural spaces, could pave the way for design decisions where visual attention is taken into account. In this study, we tested the performance of eleven commonly used saliency models that combine traditional and deep learning methods on 126 rendered interior scenes with associated head tracking data. The data was extracted from three experiments conducted in virtual reality between 2016 and 2018. Two of these datasets pertain to the perceptual effects of daylight and include variations of daylighting conditions for a limited set of interior spaces, thereby allowing to test the influence of light conditions on human head movement. Ground truth maps were extracted from the collected head tracking logs, and the prediction accuracy of the models was tested via the correlation coefficient between ground truth and prediction maps. To address the possible inflation of results due to the equator bias, we conducted complementary analyses by restricting the area of investigation to the equatorial image regions. Although limited to immersive virtual environments, the promising performance of some traditional models such as GBVS360eq and BMS360eq for colored and textured architectural rendered spaces offers us the prospect of their possible integration into design tools. We also observed a strong correlation in head movements for the same space lit by different types of sky, a finding whose generalization requires further investigations based on datasets more specifically developed to address this question.</p
Saliency prediction in 360° architectural scenes: Performance and impact of daylight variations
Saliency models are image-based prediction models that estimate human visual attention. Such models, when applied to architectural spaces, could pave the way for design decisions where visual attention is taken into account. In this study, we tested the performance of eleven commonly used saliency models that combine traditional and deep learning methods on 126 rendered interior scenes with associated head tracking data. The data was extracted from three experiments conducted in virtual reality between 2016 and 2018. Two of these datasets pertain to the perceptual effects of daylight and include variations of daylighting conditions for a limited set of interior spaces, thereby allowing to test the influence of light conditions on human head movement. Ground truth maps were extracted from the collected head tracking logs, and the prediction accuracy of the models was tested via the correlation coefficient between ground truth and prediction maps. To address the possible inflation of results due to the equator bias, we conducted complementary analyses by restricting the area of investigation to the equatorial image regions. Although limited to immersive virtual environments, the promising performance of some traditional models such as GBVS360eq and BMS360eq for colored and textured architectural rendered spaces offers us the prospect of their possible integration into design tools. We also observed a strong correlation in head movements for the same space lit by different types of sky, a finding whose generalization requires further investigations based on datasets more specifically developed to address this question
Affective Image Content Analysis: Two Decades Review and New Perspectives
Images can convey rich semantics and induce various emotions in viewers.
Recently, with the rapid advancement of emotional intelligence and the
explosive growth of visual data, extensive research efforts have been dedicated
to affective image content analysis (AICA). In this survey, we will
comprehensively review the development of AICA in the recent two decades,
especially focusing on the state-of-the-art methods with respect to three main
challenges -- the affective gap, perception subjectivity, and label noise and
absence. We begin with an introduction to the key emotion representation models
that have been widely employed in AICA and description of available datasets
for performing evaluation with quantitative comparison of label noise and
dataset bias. We then summarize and compare the representative approaches on
(1) emotion feature extraction, including both handcrafted and deep features,
(2) learning methods on dominant emotion recognition, personalized emotion
prediction, emotion distribution learning, and learning from noisy data or few
labels, and (3) AICA based applications. Finally, we discuss some challenges
and promising research directions in the future, such as image content and
context understanding, group emotion clustering, and viewer-image interaction.Comment: Accepted by IEEE TPAM
WSCNet: Weakly Supervised Coupled Networks for Visual Sentiment Classification and Detection
Automatic assessment of sentiment from visual content has gained considerable attention with the increasing tendency of expressing opinions online. In this paper, we solve the problem of visual sentiment analysis, which is challenging due to the high-level abstraction in the recognition process. Existing methods based on convolutional neural networks learn sentiment representations from the holistic image, despite the fact that different image regions can have different influence on the evoked sentiment. In this paper, we introduce a weakly supervised coupled convolutional network (WSCNet). Our method is dedicated to automatically selecting relevant soft proposals from weak annotations (e.g., global image labels), thereby significantly reducing the annotation burden, and encompasses the following contributions. First, WSCNet detects a sentiment-specific soft map by training a fully convolutional network with the cross spatial pooling strategy in the detection branch. Second, both the holistic and localized information are utilized by coupling the sentiment map with deep features for robust representation in the classification branch. We integrate the sentiment detection and classification branches into a unified deep framework, and optimize the network in an end-to-end way. Through this joint learning strategy, weakly supervised sentiment classification and detection benefit each other. Extensive experiments demonstrate that the proposed WSCNet outperforms the state-of-the-art results on seven benchmark datasets
Machine Learning in Resource-constrained Devices: Algorithms, Strategies, and Applications
The ever-increasing growth of technologies is changing people's everyday life. As a major consequence: 1) the amount of available data is growing and 2) several applications rely on battery supplied devices that are required to process data in real time. In this scenario the need for ad-hoc strategies for the development of low-power and low-latency intelligent systems capable of learning inductive rules from data using a modest mount of computational resources is becoming vital. At the same time, one needs to develop specic methodologies to manage complex patterns such as text and images.
This Thesis presents different approaches and techniques for the development of fast learning models explicitly designed to be hosted on embedded systems. The proposed methods proved able to achieve state-of-the-art performances in term of the trade-off between generalization capabilities and area requirements when implemented in low-cost digital devices. In addition, advanced strategies for ecient sentiment analysis in text and images are proposed