77 research outputs found
Advances in Image Processing, Analysis and Recognition Technology
For many decades, researchers have been trying to make computers’ analysis of images as effective as the system of human vision is. For this purpose, many algorithms and systems have previously been created. The whole process covers various stages, including image processing, representation and recognition. The results of this work can be applied to many computer-assisted areas of everyday life. They improve particular activities and provide handy tools, which are sometimes only for entertainment, but quite often, they significantly increase our safety. In fact, the practical implementation of image processing algorithms is particularly wide. Moreover, the rapid growth of computational complexity and computer efficiency has allowed for the development of more sophisticated and effective algorithms and tools. Although significant progress has been made so far, many issues still remain, resulting in the need for the development of novel approaches
Advanced Biometrics with Deep Learning
Biometrics, such as fingerprint, iris, face, hand print, hand vein, speech and gait recognition, etc., as a means of identity management have become commonplace nowadays for various applications. Biometric systems follow a typical pipeline, that is composed of separate preprocessing, feature extraction and classification. Deep learning as a data-driven representation learning approach has been shown to be a promising alternative to conventional data-agnostic and handcrafted pre-processing and feature extraction for biometric systems. Furthermore, deep learning offers an end-to-end learning paradigm to unify preprocessing, feature extraction, and recognition, based solely on biometric data. This Special Issue has collected 12 high-quality, state-of-the-art research papers that deal with challenging issues in advanced biometric systems based on deep learning. The 12 papers can be divided into 4 categories according to biometric modality; namely, face biometrics, medical electronic signals (EEG and ECG), voice print, and others
Semantic Spaces for Video Analysis of Behaviour
PhDThere are ever growing interests from the computer vision community into human behaviour
analysis based on visual sensors. These interests generally include: (1) behaviour recognition -
given a video clip or specific spatio-temporal volume of interest discriminate it into one or more
of a set of pre-defined categories; (2) behaviour retrieval - given a video or textual description
as query, search for video clips with related behaviour; (3) behaviour summarisation - given a
number of video clips, summarise out representative and distinct behaviours. Although countless
efforts have been dedicated into problems mentioned above, few works have attempted to
analyse human behaviours in a semantic space. In this thesis, we define semantic spaces as a
collection of high-dimensional Euclidean space in which semantic meaningful events, e.g. individual
word, phrase and visual event, can be represented as vectors or distributions which are
referred to as semantic representations. With the semantic space, semantic texts, visual events
can be quantitatively compared by inner product, distance and divergence. The introduction of
semantic spaces can bring lots of benefits for visual analysis. For example, discovering semantic
representations for visual data can facilitate semantic meaningful video summarisation, retrieval
and anomaly detection. Semantic space can also seamlessly bridge categories and datasets which
are conventionally treated independent. This has encouraged the sharing of data and knowledge
across categories and even datasets to improve recognition performance and reduce labelling effort.
Moreover, semantic space has the ability to generalise learned model beyond known classes
which is usually referred to as zero-shot learning. Nevertheless, discovering such a semantic
space is non-trivial due to (1) semantic space is hard to define manually. Humans always have
a good sense of specifying the semantic relatedness between visual and textual instances. But a
measurable and finite semantic space can be difficult to construct with limited manual supervision.
As a result, constructing semantic space from data is adopted to learn in an unsupervised
manner; (2) It is hard to build a universal semantic space, i.e. this space is always contextual
dependent. So it is important to build semantic space upon selected data such that it is always
meaningful within the context. Even with a well constructed semantic space, challenges are still
present including; (3) how to represent visual instances in the semantic space; and (4) how to mitigate
the misalignment of visual feature and semantic spaces across categories and even datasets
when knowledge/data are generalised. This thesis tackles the above challenges by exploiting data
from different sources and building contextual semantic space with which data and knowledge
can be transferred and shared to facilitate the general video behaviour analysis.
To demonstrate the efficacy of semantic space for behaviour analysis, we focus on studying
real world problems including surveillance behaviour analysis, zero-shot human action recognition
and zero-shot crowd behaviour recognition with techniques specifically tailored for the
nature of each problem.
Firstly, for video surveillances scenes, we propose to discover semantic representations from
the visual data in an unsupervised manner. This is due to the largely availability of unlabelled
visual data in surveillance systems. By representing visual instances in the semantic space, data
and annotations can be generalised to new events and even new surveillance scenes. Specifically,
to detect abnormal events this thesis studies a geometrical alignment between semantic representation
of events across scenes. Semantic actions can be thus transferred to new scenes and
abnormal events can be detected in an unsupervised way. To model multiple surveillance scenes
simultaneously, we show how to learn a shared semantic representation across a group of semantic
related scenes through a multi-layer clustering of scenes. With multi-scene modelling we
show how to improve surveillance tasks including scene activity profiling/understanding, crossscene
query-by-example, behaviour classification, and video summarisation.
Secondly, to avoid extremely costly and ambiguous video annotating, we investigate how
to generalise recognition models learned from known categories to novel ones, which is often
termed as zero-shot learning. To exploit the limited human supervision, e.g. category names,
we construct the semantic space via a word-vector representation trained on large textual corpus
in an unsupervised manner. Representation of visual instance in semantic space is obtained by
learning a visual-to-semantic mapping. We notice that blindly applying the mapping learned
from known categories to novel categories can cause bias and deteriorating the performance
which is termed as domain shift. To solve this problem we employed techniques including semisupervised
learning, self-training, hubness correction, multi-task learning and domain adaptation.
All these methods in combine achieve state-of-the-art performance in zero-shot human action
task.
In the last, we study the possibility to re-use known and manually labelled semantic crowd
attributes to recognise rare and unknown crowd behaviours. This task is termed as zero-shot
crowd behaviours recognition. Crucially we point out that given the multi-labelled nature of
semantic crowd attributes, zero-shot recognition can be improved by exploiting the co-occurrence
between attributes.
To summarise, this thesis studies methods for analysing video behaviours and demonstrates
that exploring semantic spaces for video analysis is advantageous and more importantly enables
multi-scene analysis and zero-shot learning beyond conventional learning strategies
Carried baggage detection and recognition in video surveillance with foreground segmentation
Security cameras installed in public spaces or in private organizations continuously
record video data with the aim of detecting and preventing crime. For that reason,
video content analysis applications, either for real time (i.e. analytic) or post-event
(i.e. forensic) analysis, have gained high interest in recent years. In this thesis,
the primary focus is on two key aspects of video analysis, reliable moving object
segmentation and carried object detection & identification.
A novel moving object segmentation scheme by background subtraction is presented
in this thesis. The scheme relies on background modelling which is based
on multi-directional gradient and phase congruency. As a post processing step,
the detected foreground contours are refined by classifying the edge segments as
either belonging to the foreground or background. Further contour completion
technique by anisotropic diffusion is first introduced in this area. The proposed
method targets cast shadow removal, gradual illumination change invariance, and
closed contour extraction.
A state of the art carried object detection method is employed as a benchmark
algorithm. This method includes silhouette analysis by comparing human temporal
templates with unencumbered human models. The implementation aspects of
the algorithm are improved by automatically estimating the viewing direction of
the pedestrian and are extended by a carried luggage identification module. As
the temporal template is a frequency template and the information that it provides
is not sufficient, a colour temporal template is introduced. The standard
steps followed by the state of the art algorithm are approached from a different
extended (by colour information) perspective, resulting in more accurate carried
object segmentation.
The experiments conducted in this research show that the proposed closed
foreground segmentation technique attains all the aforementioned goals. The incremental
improvements applied to the state of the art carried object detection
algorithm revealed the full potential of the scheme. The experiments demonstrate
the ability of the proposed carried object detection algorithm to supersede the
state of the art method
On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator
Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise
Speech data analysis for semantic indexing of video of simulated medical crises.
The Simulation for Pediatric Assessment, Resuscitation, and Communication (SPARC) group within the Department of Pediatrics at the University of Louisville, was established to enhance the care of children by using simulation based educational methodologies to improve patient safety and strengthen clinician-patient interactions. After each simulation session, the physician must manually review and annotate the recordings and then debrief the trainees. The physician responsible for the simulation has recorded 100s of videos, and is seeking solutions that can automate the process. This dissertation introduces our developed system for efficient segmentation and semantic indexing of videos of medical simulations using machine learning methods. It provides the physician with automated tools to review important sections of the simulation by identifying who spoke, when and what was his/her emotion. Only audio information is extracted and analyzed because the quality of the image recording is low and the visual environment is static for most parts. Our proposed system includes four main components: preprocessing, speaker segmentation, speaker identification, and emotion recognition. The preprocessing consists of first extracting the audio component from the video recording. Then, extracting various low-level audio features to detect and remove silence segments. We investigate and compare two different approaches for this task. The first one is threshold-based and the second one is classification-based. The second main component of the proposed system consists of detecting speaker changing points for the purpose of segmenting the audio stream. We propose two fusion methods for this task. The speaker identification and emotion recognition components of our system are designed to provide users the capability to browse the video and retrieve shots that identify ”who spoke, when, and the speaker’s emotion” for further analysis. For this component, we propose two feature representation methods that map audio segments of arbitary length to a feature vector with fixed dimensions. The first one is based on soft bag-of-word (BoW) feature representations. In particular, we define three types of BoW that are based on crisp, fuzzy, and possibilistic voting. The second feature representation is a generalization of the BoW and is based on Fisher Vector (FV). FV uses the Fisher Kernel principle and combines the benefits of generative and discriminative approaches. The proposed feature representations are used within two learning frameworks. The first one is supervised learning and assumes that a large collection of labeled training data is available. Within this framework, we use standard classifiers including K-nearest neighbor (K-NN), support vector machine (SVM), and Naive Bayes. The second framework is based on semi-supervised learning where only a limited amount of labeled training samples are available. We use an approach that is based on label propagation. Our proposed algorithms were evaluated using 15 medical simulation sessions. The results were analyzed and compared to those obtained using state-of-the-art algorithms. We show that our proposed speech segmentation fusion algorithms and feature mappings outperform existing methods. We also integrated all proposed algorithms and developed a GUI prototype system for subjective evaluation. This prototype processes medical simulation video and provides the user with a visual summary of the different speech segments. It also allows the user to browse videos and retrieve scenes that provide answers to semantic queries such as: who spoke and when; who interrupted who? and what was the emotion of the speaker? The GUI prototype can also provide summary statistics of each simulation video. Examples include: for how long did each person spoke? What is the longest uninterrupted speech segment? Is there an unusual large number of pauses within the speech segment of a given speaker
A Survey of Computer Vision Methods for 2D Object Detection from Unmanned Aerial Vehicles
The spread of Unmanned Aerial Vehicles (UAVs) in the last decade revolutionized many applications fields. Most investigated research topics focus on increasing autonomy during operational campaigns, environmental monitoring, surveillance, maps, and labeling. To achieve such complex goals, a high-level module is exploited to build semantic knowledge leveraging the outputs of the low-level module that takes data acquired from multiple sensors and extracts information concerning what is sensed. All in all, the detection of the objects is undoubtedly the most important low-level task, and the most employed sensors to accomplish it are by far RGB cameras due to costs, dimensions, and the wide literature on RGB-based object detection. This survey presents recent advancements in 2D object detection for the case of UAVs, focusing on the differences, strategies, and trade-offs between the generic problem of object detection, and the adaptation of such solutions for operations of the UAV. Moreover, a new taxonomy that considers different heights intervals and driven by the methodological approaches introduced by the works in the state of the art instead of hardware, physical and/or technological constraints is proposed
- …