223 research outputs found
Object Detection in 20 Years: A Survey
Object detection, as of one the most fundamental and challenging problems in
computer vision, has received great attention in recent years. Its development
in the past two decades can be regarded as an epitome of computer vision
history. If we think of today's object detection as a technical aesthetics
under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+
papers of object detection in the light of its technical evolution, spanning
over a quarter-century's time (from the 1990s to 2019). A number of topics have
been covered in this paper, including the milestone detectors in history,
detection datasets, metrics, fundamental building blocks of the detection
system, speed up techniques, and the recent state of the art detection methods.
This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep
analysis of their challenges as well as technical improvements in recent years.Comment: This work has been submitted to the IEEE TPAMI for possible
publicatio
Pedestrian Attribute Recognition: A Survey
Recognizing pedestrian attributes is an important task in computer vision
community due to it plays an important role in video surveillance. Many
algorithms has been proposed to handle this task. The goal of this paper is to
review existing works using traditional methods or based on deep learning
networks. Firstly, we introduce the background of pedestrian attributes
recognition (PAR, for short), including the fundamental concepts of pedestrian
attributes and corresponding challenges. Secondly, we introduce existing
benchmarks, including popular datasets and evaluation criterion. Thirdly, we
analyse the concept of multi-task learning and multi-label learning, and also
explain the relations between these two learning algorithms and pedestrian
attribute recognition. We also review some popular network architectures which
have widely applied in the deep learning community. Fourthly, we analyse
popular solutions for this task, such as attributes group, part-based,
\emph{etc}. Fifthly, we shown some applications which takes pedestrian
attributes into consideration and achieve better performance. Finally, we
summarized this paper and give several possible research directions for
pedestrian attributes recognition. The project page of this paper can be found
from the following website:
\url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey:
https://sites.google.com/view/ahu-pedestrianattributes
The Emotional Impact of Audio - Visual Stimuli
Induced affect is the emotional effect of an object on an individual. It can be quantified through two metrics: valence and arousal. Valance quantifies how positive or negative something is, while arousal quantifies the intensity from calm to exciting. These metrics enable researchers to study how people opine on various topics. Affective content analysis of visual media is a challenging problem due to differences in perceived reactions. Industry standard machine learning classifiers such as Support Vector Machines can be used to help determine user affect. The best affect-annotated video datasets are often analyzed by feeding large amounts of visual and audio features through machine-learning algorithms. The goal is to maximize accuracy, with the hope that each feature will bring useful information to the table.
We depart from this approach to quantify how different modalities such as visual, audio, and text description information can aid in the understanding affect. To that end, we train independent models for visual, audio and text description. Each are convolutional neural networks paired with support vector machines to classify valence and arousal. We also train various ensemble models that combine multi-modal information with the hope that the information from independent modalities benefits each other.
We find that our visual network alone achieves state-of-the-art valence classification accuracy and that our audio network, when paired with our visual, achieves competitive results on arousal classification. Each network is much stronger on one metric than the other. This may lead to more sophisticated multimodal approaches to accurately identifying affect in video data. This work also contributes to induced emotion classification by augmenting existing sizable media datasets and providing a robust framework for classifying the same
High-level and Low-level Feature Set for Image Caption Generation with Optimized Convolutional Neural Network, Journal of Telecommunications and Information Technology, 2022, nr 4
Automatic creation of image descriptions, i.e. captioning of images, is an important topic in artificial intelligence (AI) that bridges the gap between computer vision (CV) and natural language processing (NLP). Currently, neural networks are becoming increasingly popular in captioning images and researchers are looking for more efficient models for CV and sequence-sequence systems. This study focuses on a new image caption generation model that is divided into two stages. Initially, low-level features, such as contrast, sharpness, color and their high-level counterparts, such as motion and facial impact score, are extracted. Then, an optimized convolutional neural network (CNN) is harnessed to generate the captions from images. To enhance the accuracy of the process, the weights of CNN are optimally tuned via spider monkey optimization with sine chaotic map evaluation (SMO-SCME). The development of the proposed method is evaluated with a diversity of metrics
- …