9 research outputs found

    On the use of Vision-Language models for Visual Sentiment Analysis: a study on CLIP

    Full text link
    This work presents a study on how to exploit the CLIP embedding space to perform Visual Sentiment Analysis. We experiment with two architectures built on top of the CLIP embedding space, which we denote by CLIP-E. We train the CLIP-E models with WEBEmo, the largest publicly available and manually labeled benchmark for Visual Sentiment Analysis, and perform two sets of experiments. First, we test on WEBEmo and compare the CLIP-E architectures with state-of-the-art (SOTA) models and with CLIP Zero-Shot. Second, we perform cross dataset evaluation, and test the CLIP-E architectures trained with WEBEmo on other Visual Sentiment Analysis benchmarks. Our results show that the CLIP-E approaches outperform SOTA models in WEBEmo fine grained categorization, and they also generalize better when tested on datasets that have not been seen during training. Interestingly, we observed that for the FI dataset, CLIP Zero-Shot produces better accuracies than SOTA models and CLIP-E trained on WEBEmo. These results motivate several questions that we discuss in this paper, such as how we should design new benchmarks and evaluate Visual Sentiment Analysis, and whether we should keep designing tailored Deep Learning models for Visual Sentiment Analysis or focus our efforts on better using the knowledge encoded in large vision-language models such as CLIP for this task

    Saliency prediction in 360° architectural scenes:Performance and impact of daylight variations

    Get PDF
    Saliency models are image-based prediction models that estimate human visual attention. Such models, when applied to architectural spaces, could pave the way for design decisions where visual attention is taken into account. In this study, we tested the performance of eleven commonly used saliency models that combine traditional and deep learning methods on 126 rendered interior scenes with associated head tracking data. The data was extracted from three experiments conducted in virtual reality between 2016 and 2018. Two of these datasets pertain to the perceptual effects of daylight and include variations of daylighting conditions for a limited set of interior spaces, thereby allowing to test the influence of light conditions on human head movement. Ground truth maps were extracted from the collected head tracking logs, and the prediction accuracy of the models was tested via the correlation coefficient between ground truth and prediction maps. To address the possible inflation of results due to the equator bias, we conducted complementary analyses by restricting the area of investigation to the equatorial image regions. Although limited to immersive virtual environments, the promising performance of some traditional models such as GBVS360eq and BMS360eq for colored and textured architectural rendered spaces offers us the prospect of their possible integration into design tools. We also observed a strong correlation in head movements for the same space lit by different types of sky, a finding whose generalization requires further investigations based on datasets more specifically developed to address this question.</p

    Saliency prediction in 360° architectural scenes:Performance and impact of daylight variations

    Get PDF
    Saliency models are image-based prediction models that estimate human visual attention. Such models, when applied to architectural spaces, could pave the way for design decisions where visual attention is taken into account. In this study, we tested the performance of eleven commonly used saliency models that combine traditional and deep learning methods on 126 rendered interior scenes with associated head tracking data. The data was extracted from three experiments conducted in virtual reality between 2016 and 2018. Two of these datasets pertain to the perceptual effects of daylight and include variations of daylighting conditions for a limited set of interior spaces, thereby allowing to test the influence of light conditions on human head movement. Ground truth maps were extracted from the collected head tracking logs, and the prediction accuracy of the models was tested via the correlation coefficient between ground truth and prediction maps. To address the possible inflation of results due to the equator bias, we conducted complementary analyses by restricting the area of investigation to the equatorial image regions. Although limited to immersive virtual environments, the promising performance of some traditional models such as GBVS360eq and BMS360eq for colored and textured architectural rendered spaces offers us the prospect of their possible integration into design tools. We also observed a strong correlation in head movements for the same space lit by different types of sky, a finding whose generalization requires further investigations based on datasets more specifically developed to address this question.</p

    Saliency prediction in 360° architectural scenes: Performance and impact of daylight variations

    Get PDF
    Saliency models are image-based prediction models that estimate human visual attention. Such models, when applied to architectural spaces, could pave the way for design decisions where visual attention is taken into account. In this study, we tested the performance of eleven commonly used saliency models that combine traditional and deep learning methods on 126 rendered interior scenes with associated head tracking data. The data was extracted from three experiments conducted in virtual reality between 2016 and 2018. Two of these datasets pertain to the perceptual effects of daylight and include variations of daylighting conditions for a limited set of interior spaces, thereby allowing to test the influence of light conditions on human head movement. Ground truth maps were extracted from the collected head tracking logs, and the prediction accuracy of the models was tested via the correlation coefficient between ground truth and prediction maps. To address the possible inflation of results due to the equator bias, we conducted complementary analyses by restricting the area of investigation to the equatorial image regions. Although limited to immersive virtual environments, the promising performance of some traditional models such as GBVS360eq and BMS360eq for colored and textured architectural rendered spaces offers us the prospect of their possible integration into design tools. We also observed a strong correlation in head movements for the same space lit by different types of sky, a finding whose generalization requires further investigations based on datasets more specifically developed to address this question

    Affective Image Content Analysis: Two Decades Review and New Perspectives

    Get PDF
    Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.Comment: Accepted by IEEE TPAM

    Affective image content analysis: two decades review and new perspectives

    Get PDF

    WSCNet: Weakly Supervised Coupled Networks for Visual Sentiment Classification and Detection

    Get PDF
    Automatic assessment of sentiment from visual content has gained considerable attention with the increasing tendency of expressing opinions online. In this paper, we solve the problem of visual sentiment analysis, which is challenging due to the high-level abstraction in the recognition process. Existing methods based on convolutional neural networks learn sentiment representations from the holistic image, despite the fact that different image regions can have different influence on the evoked sentiment. In this paper, we introduce a weakly supervised coupled convolutional network (WSCNet). Our method is dedicated to automatically selecting relevant soft proposals from weak annotations (e.g., global image labels), thereby significantly reducing the annotation burden, and encompasses the following contributions. First, WSCNet detects a sentiment-specific soft map by training a fully convolutional network with the cross spatial pooling strategy in the detection branch. Second, both the holistic and localized information are utilized by coupling the sentiment map with deep features for robust representation in the classification branch. We integrate the sentiment detection and classification branches into a unified deep framework, and optimize the network in an end-to-end way. Through this joint learning strategy, weakly supervised sentiment classification and detection benefit each other. Extensive experiments demonstrate that the proposed WSCNet outperforms the state-of-the-art results on seven benchmark datasets

    Machine Learning in Resource-constrained Devices: Algorithms, Strategies, and Applications

    Get PDF
    The ever-increasing growth of technologies is changing people's everyday life. As a major consequence: 1) the amount of available data is growing and 2) several applications rely on battery supplied devices that are required to process data in real time. In this scenario the need for ad-hoc strategies for the development of low-power and low-latency intelligent systems capable of learning inductive rules from data using a modest mount of computational resources is becoming vital. At the same time, one needs to develop specic methodologies to manage complex patterns such as text and images. This Thesis presents different approaches and techniques for the development of fast learning models explicitly designed to be hosted on embedded systems. The proposed methods proved able to achieve state-of-the-art performances in term of the trade-off between generalization capabilities and area requirements when implemented in low-cost digital devices. In addition, advanced strategies for ecient sentiment analysis in text and images are proposed
    corecore