17 research outputs found
Cross View Action Recognition
openCross View Action Recognition (CVAR) appraises a system's ability to recognise actions from viewpoints that are unfamiliar to the system. The state of the art methods that train on large amounts of training data rely on variation in the training data itself to increase their ability to tackle viewpoints changes. Therefore, these methods not only require a large scale dataset of appropriate classes for the application every time they train, but also correspondingly large amount of computation power for the training process leading to high costs, in terms of time, effort, funds and electrical energy. In this thesis, we propose a methodological pipeline that tackles change in viewpoint, training on small datasets and employing sustainable amounts of resources. Our method uses the optical flow input with a stream of a pre-trained model as-is to obtain a feature. Thereafter, this feature is used to train a custom designed classifier that promotes view-invariant properties. Our method only uses video information as input, in contrast to another set of methods that approach CVAR by using depth or pose input at the expense of increased sensor costs. We present a number of comparative analysis that aided the design of the pipelines, farther assessing the power of each component in the pipeline. The technique can also be adopted to existing, trained classifiers, with minimal fine-tuning, as this work demonstrates by comparing classifiers including shallow classifiers, deep pre-trained classifiers and our proposed classifier trained from scratch. Additionally, we present a set of qualitative results that promote our understanding of the relationship between viewpoints in the feature-space.openXXXII CICLO - INFORMATICA E INGEGNERIA DEI SISTEMI/ COMPUTER SCIENCE AND SYSTEMS ENGINEERING - InformaticaGoyal, Gaurv
Deep learning for texture and dynamic texture analysis
Texture is a fundamental visual cue in computer vision which provides useful information about image regions. Dynamic Texture (DT) extends the analysis of texture to sequences of moving scenes. Classic approaches to texture and DT analysis are based on shallow hand-crafted descriptors including local binary patterns and filter banks. Deep learning and in particular Convolutional Neural Networks (CNNs) have significantly contributed to the field of computer vision in the last decade. These biologically inspired networks trained with powerful algorithms have largely improved the state of the art in various tasks such as digit, object and face recognition. This thesis explores the use of CNNs in texture and DT analysis, replacing classic hand-crafted filters by deep trainable filters. An introduction to deep learning is provided in the thesis as well as a thorough review of texture and DT analysis methods. While CNNs present interesting features for the analysis of textures such as a dense extraction of filter responses trained end to end, the deepest layers used in the decision rules commonly learn to detect large shapes and image layout instead of local texture patterns. A CNN architecture is therefore adapted to textures by using an orderless pooling of intermediate layers to discard the overall shape analysis, resulting in a reduced computational cost and improved accuracy. An application to biomedical texture images is proposed in which large tissue images are tiled and combined in a recognition scheme. An approach is also proposed for DT recognition using the developed CNNs on three orthogonal planes to combine spatial and temporal analysis. Finally, a fully convolutional network is adapted to texture segmentation based on the same idea of discarding the overall shape and by combining local shallow features with larger and deeper features
Shape Dynamical Models for Activity Recognition and Coded Aperture Imaging for Light-Field Capture
Classical applications of Pattern recognition in image processing and computer vision have typically dealt with modeling, learning and recognizing static patterns in images and videos.
There are, of course, in nature, a whole class of patterns that dynamically evolve over time.
Human activities, behaviors of insects and animals, facial expression changes, lip reading, genetic expression profiles are some examples of patterns that are dynamic.
Models and algorithms to study these patterns must take into account the dynamics of these patterns while exploiting the classical pattern recognition techniques.
The first part of this dissertation is an attempt to model and recognize such dynamically evolving patterns.
We will look at specific instances of such dynamic patterns like human activities, and behaviors of insects and develop algorithms to learn models of such patterns and classify such patterns.
The models and algorithms proposed are validated by extensive experiments on gait-based person identification, activity recognition and simultaneous tracking and behavior analysis of insects.
The problem of comparing dynamically deforming shape sequences arises repeatedly in problems like activity recognition and lip reading.
We describe and evaluate parametric and non-parametric models for shape sequences.
In particular, we emphasize the need to model activity execution rate variations and propose a non-parametric model that is insensitive to such variations.
These models and the resulting algorithms are shown to be extremely effective for a wide range of applications from gait-based person identification to human action recognition.
We further show that the shape dynamical models are not only effective for the problem of recognition, but also can be used as effective priors for the problem of simultaneous tracking and behavior analysis.
We validate the proposed algorithm for performing simultaneous behavior analysis and tracking on videos of bees dancing in a hive.
In the last part of this dissertaion, we investigate computational imaging, an emerging field where the process of image formation involves the use of a computer.
The current trend in computational imaging is to capture as much information about the scene as possible during capture time so that appropriate images with varying focus, aperture, blur and colorimetric settings may be rendered as required.
In this regard, capturing the 4D light-field as opposed to a 2D image allows us to freely vary viewpoint and focus at the time of rendering an image.
In this dissertation, we describe a theoretical framework for reversibly modulating {4D} light fields using an attenuating mask in the optical path of a lens based camera.
Based on this framework, we present a novel design to reconstruct the {4D} light field from a {2D} camera image without
any additional refractive elements as required by previous light field cameras.
The patterned mask attenuates light rays inside the camera instead
of bending them, and the attenuation recoverably encodes the rays on
the {2D} sensor. Our mask-equipped camera focuses just as a traditional camera to capture conventional {2D} photos at full
sensor resolution, but the raw pixel values also hold a modulated
{4D} light field. The light field can be recovered by rearranging
the tiles of the {2D} Fourier transform of sensor values into {4D}
planes, and computing the inverse Fourier transform.
In addition, one can also recover the full resolution image information for the in-focus parts
of the scene
Feature Learning for RGB-D Data
RGB-D data has turned out to be a very useful representation for solving fundamental computer
vision problems. It takes the advantages of the color images that provide appearance
information of an object and also the depth image that is immune to the variations in color,
illumination, rotation angle and scale. With the invention of the low-cost Microsoft Kinect
sensor, which was initially used for gaming and later became a popular device for computer
vision, high quality RGB-D data can be acquired easily. RGB-D image/video can facilitate
a wide range of application areas, such as computer vision, robotics, construction and medical
imaging. Furthermore, how to fuse RGB information and depth information is still a
problem in computer vision. It is not enough to simply concatenate RGB data and depth
data together. A new fusion method could better fuse RGB images and depth images. It
still needs more powerful algorithms on this. In this thesis, to explore more advantages of
RGB-D data, we use some popular RGB-D datasets for deep feature learning algorithms
evaluation, hyper-parameter optimization, local multi-modal feature learning, RGB-D data
fusion and recognizing RGB information from RGB-D images: i)With the success of Deep
Neural Network in computer vision, deep features from fused RGB-D data can be proved to
gain better results than RGB data only. However, different deep learning algorithms show
different performance on different RGB-D datasets. Through large-scale experiments to
comprehensively evaluate the performance of deep feature learning models for RGB-D image/
video classification, we obtain the conclusion that RGB-D fusion methods using CNNs
always outperform other selected methods (DBNs, SDAE and LSTM). On the other side, since
LSTM can learn from experience to classify, process and predict time series, it achieved
better performances than DBN and SDAE in video classification tasks. ii) Hyper-parameter
optimization can help researchers quickly choose an initial set of hyper-parameters for a new
coming classification task, thus reducing the number of trials in terms of hyper-parameter
space. We present a simple and efficient framework for improving the efficiency and accuracy
of hyper-parameter optimization by considering the classification complexity of a
particular dataset. We verify this framework on three real-world RGB-D datasets. After
the analysis of experiments, we confirm that our framework can provide deeper insights
into the relationship between dataset classification tasks and hyperparameters optimization, thus quickly choosing an accurate initial set of hyper-parameters for a new coming classification
task. iii) We propose a new Convolutional Neural Networks (CNNs)-based local
multi-modal feature learning framework for RGB-D scene classification. This method can
effectively capture much of the local structure from the RGB-D scene images and automatically
learn a fusion strategy for the object-level recognition step instead of simply training a
classifier on top of features extracted from both modalities. Experiments are conducted on
two popular datasets to thoroughly test the performance of our method, which show that our
method with local multi-modal CNNs greatly outperforms state-of-the-art approaches. Our
method has the potential to improve RGB-D scene understanding. Some extended evaluation
shows that CNNs trained using a scene-centric dataset is able to achieve an improvement
on scene benchmarks compared to a network trained using an object-centric dataset.
iv) We propose a novel method for RGB-D data fusion. We project raw RGB-D data into
a complex space and then jointly extract features from the fused RGB-D images. Besides
three observations about the fusion methods, the experimental results also show that our
method achieves competing performance against the classical SIFT. v) We propose a novel
method called adaptive Visual-Depth Embedding (aVDE) which learns the compact shared
latent space between two representations of labeled RGB and depth modalities in the source
domain first. Then the shared latent space can help the transfer of the depth information to
the unlabeled target dataset. At last, aVDE matches features and reweights instances jointly
across the shared latent space and the projected target domain for an adaptive classifier. This
method can utilize the additional depth information in the source domain and simultaneously
reduce the domain mismatch between the source and target domains. On two real-world
image datasets, the experimental results illustrate that the proposed method significantly
outperforms the state-of-the-art methods
Entropy in Image Analysis II
Image analysis is a fundamental task for any application where extracting information from images is required. The analysis requires highly sophisticated numerical and analytical methods, particularly for those applications in medicine, security, and other fields where the results of the processing consist of data of vital importance. This fact is evident from all the articles composing the Special Issue "Entropy in Image Analysis II", in which the authors used widely tested methods to verify their results. In the process of reading the present volume, the reader will appreciate the richness of their methods and applications, in particular for medical imaging and image security, and a remarkable cross-fertilization among the proposed research areas
Methods, Models, and Datasets for Visual Servoing and Vehicle Localisation
Machine autonomy has become a vibrant part of industrial and commercial aspirations. A growing demand exists for dexterous and intelligent machines that can work in unstructured environments without any human assistance. An autonomously operating machine should sense its surroundings, classify different kinds of observed objects, and interpret sensory information to perform necessary operations.
This thesis summarizes original methods aimed at enhancing machine’s autonomous operation capability. These methods and the corresponding results are grouped into two main categories. The first category consists of research works that focus on improving visual servoing systems for robotic manipulators to accurately position workpieces. We start our investigation with the hand-eye calibration problem that focuses on calibrating visual sensors with a robotic manipulator. We thoroughly investigate the problem from various perspectives and provide alternative formulations of the problem and error objectives. The experimental results demonstrate that the proposed methods are robust and yield accurate solutions when tested on real and simulated data. The work package is bundled as a toolkit and available online for public use. In an extension, we proposed a constrained multiview pose estimation approach for robotic manipulators. The approach exploits the available geometric constraints on the robotic system and infuses them directly into the pose estimation method. The empirical results demonstrate higher accuracy and significantly higher precision compared to other studies.
In the second part of this research, we tackle problems pertaining to the field of autonomous vehicles and its related applications. First, we introduce a pose estimation and mapping scheme to extend the application of visual Simultaneous Localization and Mapping to unstructured dynamic environments. We identify, extract, and discard dynamic entities from the pose estimation step. Moreover, we track the dynamic entities and actively update the map based on changes in the environment. Upon observing the limitations of the existing datasets during our earlier work, we introduce FinnForest, a novel dataset for testing and validating the performance of visual odometry and Simultaneous Localization and Mapping methods in an un-structured environment. We explored an environment with a forest landscape and recorded data with multiple stereo cameras, an IMU, and a GNSS receiver. The dataset offers unique challenges owing to the nature of the environment, variety of trajectories, and changes in season, weather, and daylight conditions. Building upon the future works proposed in FinnForest Dataset, we introduce a novel scheme that can localize an observer with extreme perspective changes. More specifically, we tailor the problem for autonomous vehicles such that they can recognize a previously visited place irrespective of the direction it previously traveled the route. To the best of our knowledge, this is the first study that accomplishes bi-directional loop closure on monocular images with a nominal field of view. To solve the localisation problem, we segregate the place identification from the pose regression by using deep learning in two steps. We demonstrate that bi-directional loop closure on monocular images is indeed possible when the problem is posed correctly, and the training data is adequately leveraged.
All methodological contributions of this thesis are accompanied by extensive empirical analysis and discussions demonstrating the need, novelty, and improvement in performance over existing methods for pose estimation, odometry, mapping, and place recognition
Earth Observation Open Science and Innovation
geospatial analytics; social observatory; big earth data; open data; citizen science; open innovation; earth system science; crowdsourced geospatial data; citizen science; science in society; data scienc