5,498 research outputs found
Describing Images by Semantic Modeling using Attributes and Tags
This dissertation addresses the problem of describing images using visual attributes and textual tags, a fundamental task that narrows down the semantic gap between the visual reasoning of humans and machines. Automatic image annotation assigns relevant textual tags to the images. In this dissertation, we propose a query-specific formulation based on Weighted Multi-view Non-negative Matrix Factorization to perform automatic image annotation. Our proposed technique seamlessly adapt to the changes in training data, naturally solves the problem of feature fusion and handles the challenge of the rare tags. Unlike tags, attributes are category-agnostic, hence their combination models an exponential number of semantic labels. Motivated by the fact that most attributes describe local properties, we propose exploiting localization cues, through semantic parsing of human face and body to improve person-related attribute prediction. We also demonstrate that image-level attribute labels can be effectively used as weak supervision for the task of semantic segmentation. Next, we analyze the Selfie images by utilizing tags and attributes. We collect the first large-scale Selfie dataset and annotate it with different attributes covering characteristics such as gender, age, race, facial gestures, and hairstyle. We then study the popularity and sentiments of the selfies given an estimated appearance of various semantic concepts. In brief, we automatically infer what makes a good selfie. Despite its extensive usage, the deep learning literature falls short in understanding the characteristics and behavior of the Batch Normalization. We conclude this dissertation by providing a fresh view, in light of information geometry and Fisher kernels to why the batch normalization works. We propose Mixture Normalization that disentangles modes of variation in the underlying distribution of the layer outputs and confirm that it effectively accelerates training of different batch-normalized architectures including Inception-V3, Densely Connected Networks, and Deep Convolutional Generative Adversarial Networks while achieving better generalization error
A Diaspora of Humans to Technology: VEDA Net for Sentiments and their Technical Analysis
Background: Human sentiments are the representation of one’s soul. Visual media has emerged as one of the most potent instruments for communicating thoughts and feelings in today's world. The area of visible emotion analysis is abstract due to the considerable amount of bias in the human cognitive process. Machines need to apprehend better and segment these for future AI advancements. A broad range of prior research has investigated only the emotion class identifier part of the whole process. In this work, we focus on proposing a better architecture to assess an emotion identifier and finding a better strategy to extract and process an input image for the architecture.
Objective: We investigate the subject of visual emotion detection and analysis using a connected Dense Blocked Network to propose an architecture VEDANet. We show that the proposed architecture performed extremely effectively across different datasets.
Method: Using CNN based pre-trained architectures, we would like to highlight the spatial hierarchies of visual features. Because the image's spatial regions communicate substantial feelings, we utilize a dense block-based model VEDANet that focuses on the image's relevant sentiment-rich regions for effective emotion extraction. This work makes a substantial addition by providing an in-depth investigation of the proposed architecture by carrying out extensive trials on popular benchmark datasets to assess accuracy gains over the comparable state-of-the-art. In terms of emotion detection, the outcomes of the study show that the proposed VED system outperforms the existing ones (accuracy). Further, we explore over the top optimization i.e. OTO layer to achieve higher efficiency.
Results: When compared to the recent past research works, the proposed model performs admirably and obtains accuracy of 87.30% on the AffectNet dataset, 92.76% on Google FEC, 95.23% on Yale Dataset, and 97.63% on FER2013 dataset. We successfully merged the model with a face detector to obtain 98.34 percent accuracy on Real-Time live frames, further encouraging real-time applications. In comparison to existing approaches, we achieve real-time performance with a minimum TAT (Turn-around-Time) trade-off by using an appropriate network size and fewer parameters
Bag-Level Aggregation for Multiple Instance Active Learning in Instance Classification Problems
A growing number of applications, e.g. video surveillance and medical image
analysis, require training recognition systems from large amounts of weakly
annotated data while some targeted interactions with a domain expert are
allowed to improve the training process. In such cases, active learning (AL)
can reduce labeling costs for training a classifier by querying the expert to
provide the labels of most informative instances. This paper focuses on AL
methods for instance classification problems in multiple instance learning
(MIL), where data is arranged into sets, called bags, that are weakly labeled.
Most AL methods focus on single instance learning problems. These methods are
not suitable for MIL problems because they cannot account for the bag structure
of data. In this paper, new methods for bag-level aggregation of instance
informativeness are proposed for multiple instance active learning (MIAL). The
\textit{aggregated informativeness} method identifies the most informative
instances based on classifier uncertainty, and queries bags incorporating the
most information. The other proposed method, called \textit{cluster-based
aggregative sampling}, clusters data hierarchically in the instance space. The
informativeness of instances is assessed by considering bag labels, inferred
instance labels, and the proportion of labels that remain to be discovered in
clusters. Both proposed methods significantly outperform reference methods in
extensive experiments using benchmark data from several application domains.
Results indicate that using an appropriate strategy to address MIAL problems
yields a significant reduction in the number of queries needed to achieve the
same level of performance as single instance AL methods
An Efficient End-to-End Transformer with Progressive Tri-modal Attention for Multi-modal Emotion Recognition
Recent works on multi-modal emotion recognition move towards end-to-end
models, which can extract the task-specific features supervised by the target
task compared with the two-phase pipeline. However, previous methods only model
the feature interactions between the textual and either acoustic and visual
modalities, ignoring capturing the feature interactions between the acoustic
and visual modalities. In this paper, we propose the multi-modal end-to-end
transformer (ME2ET), which can effectively model the tri-modal features
interaction among the textual, acoustic, and visual modalities at the low-level
and high-level. At the low-level, we propose the progressive tri-modal
attention, which can model the tri-modal feature interactions by adopting a
two-pass strategy and can further leverage such interactions to significantly
reduce the computation and memory complexity through reducing the input token
length. At the high-level, we introduce the tri-modal feature fusion layer to
explicitly aggregate the semantic representations of three modalities. The
experimental results on the CMU-MOSEI and IEMOCAP datasets show that ME2ET
achieves the state-of-the-art performance. The further in-depth analysis
demonstrates the effectiveness, efficiency, and interpretability of the
proposed progressive tri-modal attention, which can help our model to achieve
better performance while significantly reducing the computation and memory
cost. Our code will be publicly available
Unleashing the Power of VGG16: Advancements in Facial Emotion Recognization
In facial emotion detection, researchers are actively exploring effective methods to identify and understand facial expressions. This study introduces a novel mechanism for emotion identification using diverse facial photos captured under varying lighting conditions. A meticulously pre-processed dataset ensures data consistency and quality. Leveraging deep learning architectures, the study utilizes feature extraction techniques to capture subtle emotive cues and build an emotion classification model using convolutional neural networks (CNNs). The proposed methodology achieves an impressive 97% accuracy on the validation set, outperforming previous methods in terms of accuracy and robustness. Challenges such as lighting variations, head posture, and occlusions are acknowledged, and multimodal approaches incorporating additional modalities like auditory or physiological data are suggested for further improvement. The outcomes of this research have wide-ranging implications for affective computing, human-computer interaction, and mental health diagnosis, advancing the field of facial emotion identification and paving the way for sophisticated technology capable of understanding and responding to human emotions across diverse domains
MOON: A Mixed Objective Optimization Network for the Recognition of Facial Attributes
Attribute recognition, particularly facial, extracts many labels for each
image. While some multi-task vision problems can be decomposed into separate
tasks and stages, e.g., training independent models for each task, for a
growing set of problems joint optimization across all tasks has been shown to
improve performance. We show that for deep convolutional neural network (DCNN)
facial attribute extraction, multi-task optimization is better. Unfortunately,
it can be difficult to apply joint optimization to DCNNs when training data is
imbalanced, and re-balancing multi-label data directly is structurally
infeasible, since adding/removing data to balance one label will change the
sampling of the other labels. This paper addresses the multi-label imbalance
problem by introducing a novel mixed objective optimization network (MOON) with
a loss function that mixes multiple task objectives with domain adaptive
re-weighting of propagated loss. Experiments demonstrate that not only does
MOON advance the state of the art in facial attribute recognition, but it also
outperforms independently trained DCNNs using the same data. When using facial
attributes for the LFW face recognition task, we show that our balanced (domain
adapted) network outperforms the unbalanced trained network.Comment: Post-print of manuscript accepted to the European Conference on
Computer Vision (ECCV) 2016
http://link.springer.com/chapter/10.1007%2F978-3-319-46454-1_
Affective Image Content Analysis: Two Decades Review and New Perspectives
Images can convey rich semantics and induce various emotions in viewers.
Recently, with the rapid advancement of emotional intelligence and the
explosive growth of visual data, extensive research efforts have been dedicated
to affective image content analysis (AICA). In this survey, we will
comprehensively review the development of AICA in the recent two decades,
especially focusing on the state-of-the-art methods with respect to three main
challenges -- the affective gap, perception subjectivity, and label noise and
absence. We begin with an introduction to the key emotion representation models
that have been widely employed in AICA and description of available datasets
for performing evaluation with quantitative comparison of label noise and
dataset bias. We then summarize and compare the representative approaches on
(1) emotion feature extraction, including both handcrafted and deep features,
(2) learning methods on dominant emotion recognition, personalized emotion
prediction, emotion distribution learning, and learning from noisy data or few
labels, and (3) AICA based applications. Finally, we discuss some challenges
and promising research directions in the future, such as image content and
context understanding, group emotion clustering, and viewer-image interaction.Comment: Accepted by IEEE TPAM
- …