433,392 research outputs found
ABAW: Valence-Arousal Estimation, Expression Recognition, Action Unit Detection & Multi-Task Learning Challenges
This paper describes the third Affective Behavior Analysis in-the-wild (ABAW)
Competition, held in conjunction with IEEE International Conference on Computer
Vision and Pattern Recognition (CVPR), 2022. The 3rd ABAW Competition is a
continuation of the Competitions held at ICCV 2021, IEEE FG 2020 and IEEE CVPR
2017 Conferences, and aims at automatically analyzing affect. This year the
Competition encompasses four Challenges: i) uni-task Valence-Arousal
Estimation, ii) uni-task Expression Classification, iii) uni-task Action Unit
Detection, and iv) Multi-Task-Learning. All the Challenges are based on a
common benchmark database, Aff-Wild2, which is a large scale in-the-wild
database and the first one to be annotated in terms of valence-arousal,
expressions and action units. In this paper, we present the four Challenges,
with the utilized Competition corpora, we outline the evaluation metrics and
present the baseline systems along with their obtained results
Image Search with Text Feedback by Visiolinguistic Attention Learning
Image search with text feedback has promising impacts in various real-world applications, such as e-commerce and internet search. Given a reference image and text feedback from user, the goal is to retrieve images that not only resemble the input image, but also change certain aspects in accordance with the given text. This is a challenging task as it requires the synergistic understanding of both image and text. In this work, we tackle this task by a novel Visiolinguistic Attention Learning (VAL) framework. Specifically, we propose a composite transformer that can be seamlessly plugged in a CNN to selectively preserve and transform the visual features conditioned on language semantics. By inserting multiple composite transformers at varying depths, VAL is incentive to encapsulate the multi-granular visiolinguistic information, thus yielding an expressive representation for effective image search. We conduct comprehensive evaluation on three datasets: Fashion200k, Shoes and FashionIQ. Extensive experiments show our model exceeds existing approaches on all datasets, demonstrating consistent superiority in coping with various text feedbacks, including attribute-like and natural language descriptions
Computer Analysis of Architecture Using Automatic Image Understanding
In the past few years, computer vision and pattern recognition systems have
been becoming increasingly more powerful, expanding the range of automatic
tasks enabled by machine vision. Here we show that computer analysis of
building images can perform quantitative analysis of architecture, and quantify
similarities between city architectural styles in a quantitative fashion.
Images of buildings from 18 cities and three countries were acquired using
Google StreetView, and were used to train a machine vision system to
automatically identify the location of the imaged building based on the image
visual content. Experimental results show that the automatic computer analysis
can automatically identify the geographical location of the StreetView image.
More importantly, the algorithm was able to group the cities and countries and
provide a phylogeny of the similarities between architectural styles as
captured by StreetView images. These results demonstrate that computer vision
and pattern recognition algorithms can perform the complex cognitive task of
analyzing images of buildings, and can be used to measure and quantify visual
similarities and differences between different styles of architectures. This
experiment provides a new paradigm for studying architecture, based on a
quantitative approach that can enhance the traditional manual observation and
analysis. The source code used for the analysis is open and publicly available
A feasibility cachaca type recognition using computer vision and pattern recognition
Brazilian rum (also known as cachaça) is the third most commonly consumed distilled alcoholic drink in the world, with approximately 2.5 billion liters produced each year. It is a traditional drink with refined features and a delicate aroma that is produced mainly in Brazil but consumed in many countries. It can be aged in various types of wood for 1-3 years, which adds aroma and a distinctive flavor with different characteristics that affect the price. A research challenge is to develop a cheap automatic recognition system that inspects the finished product for the wood type and the aging time of its production. Some classical methods use chemical analysis, but this approach requires relatively expensive laboratory equipment. By contrast, the system proposed in this paper captures image signals from samples and uses an intelligent classification technique to recognize the wood type and the aging time. The classification system uses an ensemble of classifiers obtained from different wavelet decompositions. Each classifier is obtained with different wavelet transform settings. We compared the proposed approach with classical methods based on chemical features. We analyzed 105 samples that had been aged for 3 years and we showed that the proposed solution could automatically recognize wood types and the aging time with an accuracy up to 100.00% and 85.71% respectively, and our method is also cheaper.info:eu-repo/semantics/publishedVersio
What Will I Do Next? The Intention from Motion Experiment
In computer vision, video-based approaches have been widely explored for the
early classification and the prediction of actions or activities. However, it
remains unclear whether this modality (as compared to 3D kinematics) can still
be reliable for the prediction of human intentions, defined as the overarching
goal embedded in an action sequence. Since the same action can be performed
with different intentions, this problem is more challenging but yet affordable
as proved by quantitative cognitive studies which exploit the 3D kinematics
acquired through motion capture systems. In this paper, we bridge cognitive and
computer vision studies, by demonstrating the effectiveness of video-based
approaches for the prediction of human intentions. Precisely, we propose
Intention from Motion, a new paradigm where, without using any contextual
information, we consider instantaneous grasping motor acts involving a bottle
in order to forecast why the bottle itself has been reached (to pass it or to
place in a box, or to pour or to drink the liquid inside). We process only the
grasping onsets casting intention prediction as a classification framework.
Leveraging on our multimodal acquisition (3D motion capture data and 2D optical
videos), we compare the most commonly used 3D descriptors from cognitive
studies with state-of-the-art video-based techniques. Since the two analyses
achieve an equivalent performance, we demonstrate that computer vision tools
are effective in capturing the kinematics and facing the cognitive problem of
human intention prediction.Comment: 2017 IEEE Conference on Computer Vision and Pattern Recognition
Workshop
- …
