1,072 research outputs found
Differential Recurrent Neural Networks for Human Activity Recognition
Human activity recognition has been an active research area in recent years. The difficulty of this problem lies in the complex dynamical motion patterns embedded through the sequential frames. The Long Short-Term Memory (LSTM) recurrent neural network is capable of processing complex sequential information since it utilizes special gating schemes for learning representations from long input sequences. It has the potential to model various time-series data, where the current hidden state has to be considered in the context of the past hidden states. Unfortunately, the conventional LSTMs do not consider the impact of spatio-temporal dynamics corresponding to the given salient motion patterns, when they gate the information that ought to be memorized through time. To address this problem, we propose a differential gating scheme for the LSTM neural network, which emphasizes the change in information gain caused by the salient motions between the successive video frames. This change in information gain is quantified by Derivative of States (DoS), and thus the proposed LSTM model is termed differential Recurrent Neural Network (dRNN). Based on the energy profiling of DoS, we further propose to employ the State Energy Profile (SEP) to search for salient dRNN states and construct more informative representations. To better understand the scene and human appearance information, the dRNN model is extended by connecting Convolutional Neural Networks (CNN) and stacked dRNNs into an end-to-end model. Lastly, the dissertation continues to discuss and compare the combined and the individual orders of DoS used within the dRNN. We propose to control the LSTM gates via individual order of DoS and stack multiple levels of LSTM cells in increasing orders of state derivatives. To this end, we have introduced a new family of LSTMs, expanding the applications of LSTMs and advancing the performances of the state-of-the-art methods
The Evolution of First Person Vision Methods: A Survey
The emergence of new wearable technologies such as action cameras and
smart-glasses has increased the interest of computer vision scientists in the
First Person perspective. Nowadays, this field is attracting attention and
investments of companies aiming to develop commercial devices with First Person
Vision recording capabilities. Due to this interest, an increasing demand of
methods to process these videos, possibly in real-time, is expected. Current
approaches present a particular combinations of different image features and
quantitative methods to accomplish specific objectives like object detection,
activity recognition, user machine interaction and so on. This paper summarizes
the evolution of the state of the art in First Person Vision video analysis
between 1997 and 2014, highlighting, among others, most commonly used features,
methods, challenges and opportunities within the field.Comment: First Person Vision, Egocentric Vision, Wearable Devices, Smart
Glasses, Computer Vision, Video Analytics, Human-machine Interactio
Introducing and assessing the explainable AI (XAI)method: SIDU
Explainable Artificial Intelligence (XAI) has in recent years become a
well-suited framework to generate human understandable explanations of black
box models. In this paper, we present a novel XAI visual explanation algorithm
denoted SIDU that can effectively localize entire object regions responsible
for prediction in a full extend. We analyze its robustness and effectiveness
through various computational and human subject experiments. In particular, we
assess the SIDU algorithm using three different types of evaluations
(Application, Human and Functionally-Grounded) to demonstrate its superior
performance. The robustness of SIDU is further studied in presence of
adversarial attack on black box models to better understand its performance.Comment: Preprint-submitted to Journal of Pattern Recognition (Elsevier
ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple yet General Complementary Transformer
Deep learning (DL) has advanced the field of dense prediction, while
gradually dissolving the inherent barriers between different tasks. However,
most existing works focus on designing architectures and constructing visual
cues only for the specific task, which ignores the potential uniformity
introduced by the DL paradigm. In this paper, we attempt to construct a novel
\underline{ComP}lementary \underline{tr}ansformer, \textbf{ComPtr}, for diverse
bi-source dense prediction tasks. Specifically, unlike existing methods that
over-specialize in a single task or a subset of tasks, ComPtr starts from the
more general concept of bi-source dense prediction. Based on the basic
dependence on information complementarity, we propose consistency enhancement
and difference awareness components with which ComPtr can evacuate and collect
important visual semantic cues from different image sources for diverse tasks,
respectively. ComPtr treats different inputs equally and builds an efficient
dense interaction model in the form of sequence-to-sequence on top of the
transformer. This task-generic design provides a smooth foundation for
constructing the unified model that can simultaneously deal with various
bi-source information. In extensive experiments across several representative
vision tasks, i.e. remote sensing change detection, RGB-T crowd counting,
RGB-D/T salient object detection, and RGB-D semantic segmentation, the proposed
method consistently obtains favorable performance. The code will be available
at \url{https://github.com/lartpang/ComPtr}
Entropy in Image Analysis II
Image analysis is a fundamental task for any application where extracting information from images is required. The analysis requires highly sophisticated numerical and analytical methods, particularly for those applications in medicine, security, and other fields where the results of the processing consist of data of vital importance. This fact is evident from all the articles composing the Special Issue "Entropy in Image Analysis II", in which the authors used widely tested methods to verify their results. In the process of reading the present volume, the reader will appreciate the richness of their methods and applications, in particular for medical imaging and image security, and a remarkable cross-fertilization among the proposed research areas
Object Counting with Deep Learning
This thesis explores various empirical aspects of deep learning or convolutional network based models for efficient object counting. First, we train moderately large convolutional networks on comparatively smaller datasets containing few hundred samples from scratch with conventional image processing based data augmentation. Then, we extend this approach for unconstrained, outdoor images using more advanced architectural concepts. Additionally, we propose an efficient, randomized data augmentation strategy based on sub-regional pixel distribution for low-resolution images.
Next, the effectiveness of depth-to-space shuffling of feature elements for efficient segmentation is investigated for simpler problems like binary segmentation -- often required in the counting framework. This depth-to-space operation violates the basic assumption of encoder-decoder type of segmentation architectures. Consequently, it helps to train the encoder model as a sparsely connected graph. Nonetheless, we have found comparable accuracy to that of the standard encoder-decoder architectures with our depth-to-space models.
After that, the subtleties regarding the lack of localization information in the conventional scalar count loss for one-look models are illustrated. At this point, without using additional annotations, a possible solution is proposed based on the regulation of a network-generated heatmap in the form of a weak, subsidiary loss. The models trained with this auxiliary loss alongside the conventional loss perform much better compared to their baseline counterparts, both qualitatively and quantitatively. Lastly, the intricacies of tiled prediction for high-resolution images are studied in detail, and a simple and effective trick of eliminating the normalization factor in an existing computational block is demonstrated. All of the approaches employed here are thoroughly benchmarked across multiple heterogeneous datasets for object counting against previous, state-of-the-art approaches
- …