128 research outputs found
VISUAL SALIENCY ANALYSIS, PREDICTION, AND VISUALIZATION: A DEEP LEARNING PERSPECTIVE
In the recent years, a huge success has been accomplished in prediction of human eye fixations. Several studies employed deep learning to achieve high accuracy of prediction of human eye fixations. These studies rely on pre-trained deep learning for object classification. They exploit deep learning either as a transfer-learning problem, or the weights of the pre-trained network as the initialization to learn a saliency model. The utilization of such pre-trained neural networks is due to the relatively small datasets of human fixations available to train a deep learning model. Another relatively less prioritized problem is amount of computation of such deep learning models requires expensive hardware. In this dissertation, two approaches are proposed to tackle abovementioned problems. The first approach, codenamed DeepFeat, incorporates the deep features of convolutional neural networks pre-trained for object and scene classifications. This approach is the first approach that uses deep features without further learning. Performance of the DeepFeat model is extensively evaluated over a variety of datasets using a variety of implementations. The second approach is a deep learning saliency model, codenamed ClassNet. Two main differences separate the ClassNet from other deep learning saliency models. The ClassNet model is the only deep learning saliency model that learns its weights from scratch. In addition, the ClassNet saliency model treats prediction of human fixation as a classification problem, while other deep learning saliency models treat the human fixation prediction as a regression problem or as a classification of a regression problem
Matching and Segmentation for Multimedia Data
With the development of society, both industry and academia draw increasing attention to multimedia systems, which handle image/video data, audio data, and text data comprehensively and simultaneously. In this thesis, we mainly focus on multi-modality data understanding, combining the two subjects of Computer Vision (CV) and Natural Language Processing (NLP). Such a task is widely used in many real-world scenarios, including criminal search with language descriptions by the witness, robotic navigation with language instruction in the smart industry, terrorist tracking, missing person identification, and so on. However, such a multi-modality system still faces many challenges, limiting its performance and ability in real-life situations, including the domain gap between the modalities of vision and language, the request for high-quality datasets, and so on. Therefore, to better analyze and handle these challenges, this thesis focuses on the two fundamental tasks, including matching and segmentation.
Image-Text Matching (ITM) aims to retrieve the texts (images) that describe the most relevant contents for a given image (text) query. Due to the semantic gap between the linguistic and visual domains, aligning and comparing feature representations for languages and images are still challenging. To overcome this limitation, we propose a new framework for the image-text matching task, which uses an auxiliary captioning step to enhance the image feature, where the image feature is fused with the text feature of the captioning output. As the downstream application of ITM, the language-person search is one of the specific cases where language descriptions are provided to retrieve person images, which also suffers the domain gap between linguistic and visual data. To handle this problem, we propose a transformer-based language-person search matching framework with matching conducted between words and image regions for better image-text interaction. However, collecting a large amount of training data is neither cheap nor reliable using human annotations. We further study the one-shot person Re-ID (re-identification) task aiming to match people by offering one labeled reference image for each person, where previous methods request a large number of ground-truth labels. We propose progressive sample mining and representation learning to fit the limited labels for the one-shot Re-ID task better.
Referring Expression Segmentation (RES) aims to localize and segment the target according to the given language expression. Existing methods jointly consider the localization and segmentation steps, which rely on the fused visual and linguistic features for both steps. We argue that the conflict between the purpose of finding the object and generating the mask limits the RES performance. To solve this problem, we propose a parallel position-kernel-segmentation pipeline to better isolate then interact with the localization and segmentation steps. In our pipeline, linguistic information will not directly contaminate the visual feature for segmentation. Specifically, the localization step localizes the target object in the image based on the referring expression, then the visual kernel obtained from the localization step guides the segmentation step. This pipeline also enables us to train RES in a weakly-supervised way, where the pixel-level segmentation labels are replaced by click annotations on center and corner points. The position head is fully-supervised trained with the click annotations as supervision, and the segmentation head is trained with weakly-supervised segmentation losses.
This thesis focus on the key limitations of the multimedia system, where the experiments prove that the proposed frameworks are effective for specific tasks. The experiments are easy to reproduce with clear details, and source codes are provided for future works aiming at these tasks
Advances in Deep Learning Towards Fire Emergency Application : Novel Architectures, Techniques and Applications of Neural Networks
Paper IV is not published yet.With respect to copyright paper IV and paper VI was excluded from the dissertation.Deep Learning has been successfully used in various applications, and recently, there has been an increasing interest in applying deep learning in emergency management. However, there are still many significant challenges that limit the use of deep learning in the latter application domain. In this thesis, we address some of these challenges and propose novel deep learning methods and architectures.
The challenges we address fall in these three areas of emergency management: Detection of the emergency (fire), Analysis of the situation without human intervention and finally Evacuation Planning. In this thesis, we have used computer vision tasks of image classification and semantic segmentation, as well as sound recognition, for detection and analysis. For evacuation planning, we have used deep reinforcement learning.publishedVersio
Recommended from our members
Tumour grading and discrimination based on class assignment and quantitative texture analysis techniques
Medical imaging represents the utilisation of technology in biology for the purpose of noninvasively revealing the internal structure of the organs of the human body. It is a way to improve the quality of the patient's life through a more precise and rapid diagnosis, and with limited side-effects, leading to an effective overall treatment procedure. The main objective of this thesis is to propose novel tumour discrimination techniques that cover both micro and macro-scale textures encountered in computed tomography (CI') and digital microscopy (DM) modalities, respectively. Image texture can provide significant information on the (ab)normality of tissue, and this thesis expands this idea to tumour texture grading and classification. The fractal dimension (FO) as a texture measure was applied to contrast enhanced CT lung tumour images in an aim to improve tumour grading accuracy from conventional CI' modality, and quantitative performance analysis showed an accuracy of 83.30% in distinguishing between advanced (aggressive) and early stage (non-aggressive) malignant tumours. A different approach was adopted for subtype discrimination of brain tumour OM images via a set of statistical and model-based texture analysis algorithms. The combined Gaussian Markov random field and run-length matrix texture measures outperformed all other combinations, achieving an overall class assignment classification accuracy of 92.50%. Also two new histopathological multi resolution approaches based on applying the FO as the best bases selection for discrete wavelet packet transform, and when fused with the Gabor filters' energy output improved the accuracy to 91.25% and 95.00%, respectively. While noise is quite common in all medical imaging modalities, the impact of noise on the applied texture measures was assessed as well. The developed lung and brain texture analysis techniques can improve the physician's ability to detect and analyse pathologies leading for a more reliable diagnosis and treatment of disease
Recommended from our members
View-invariant gait person re-identification with spatial and temporal attention
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonPerson re-identification at a distance across multiple none overlapping cameras has
been an active research area for years. In the past ten years, Short term Person Re-Id
techniques have made great strides in terms of accuracy using only appearance features
in limited environments. However, massive intraclass variations and inter-class
confusion limit their ability to be used in practical applications. Moreover, appearance
consistency can only be assumed in a short time span from one camera to the other.
Since the holistic appearance will change drastically over days and weeks, the technique,
as mentioned above, will be ineffective. Practical applications usually require a
long-term solution in which the subject appearance and clothing might have changed
after a significant period has elapsed. Facing these problems, soft biometric features
such as Gait have been proposed in the past. Nevertheless, even Gait can vary with
illness, ageing and changes in the emotional state, changes in walking surfaces, shoe
type, clothes type, objects carried by the subject and even clutter in the scene. Therefore,
Gait is considered a temporal cue that could provide biometric motion information.
On the other hand, the shape of the human body could be viewed as a spatial signal
which can produce valuable information. So, extracting discriminative features from
both spatial and temporal domains would be very beneficial to this research. Therefore,
this thesis focuses on finding the best and most robust method to tackle the gait human Re-identification problem and solve it for practical applications. In real-world
surveillance scenarios, the human gait cycle is primarily abnormal. These abnormalities
include but not limited to temporal and spatial characteristics changes such as
walking speed, broken gait phase and most importantly, varied camera angles. Our
work performed an extensive literature study on spatial and temporal gait feature extraction
methods with a focus on deep learning. Next, we conducted a comparative
study and proposed a spatial-temporal approach for gait feature extraction using the
fusion of multiple modalities, including optical-flow, raw silhouettes and RGB images.
This approach was tested on two of the most challenging publicly available datasets for
gait recognition TUM-GAID and CASIA-B, with excellent results presented in chapter
3.
Furthermore, a modern spatial-temporal attention mechanism was proposed and
tested on CASIA-B and OULP datasets which learns salient features independent of
the gait cycle and view variations. The spatial attention layer in the proposed method
extracts the spatial feature maps using a two-layered architecture that are fused using
late fusion. It can pay attention to the identity-related salient regions in silhouette sequences
discriminatively using the spatial feature maps. The temporal attention layer
consists of an LSTM that encodes the temporal motion for silhouette sequences. It
uses the encoded output vectors in the temporal attention architecture to focus on the
most critical timesteps in the gait cycle and discard the rest. Furthermore, we improved
the performance of our method by mapping our extracted spatial-temporal gait
features to a discriminative null space for use in our Siamese architecture for crossmatching.
We also conducted an element removal experiment on each segment of our
spatial-temporal attentional network to gain insight into each component’s contribution to the performance. Our method showed outstanding robustness against abnormal
gait cycles as well as viewpoint variations on both benchmark datasets
Word Knowledge and Word Usage
Word storage and processing define a multi-factorial domain of scientific inquiry whose thorough investigation goes well beyond the boundaries of traditional disciplinary taxonomies, to require synergic integration of a wide range of methods, techniques and empirical and experimental findings. The present book intends to approach a few central issues concerning the organization, structure and functioning of the Mental Lexicon, by asking domain experts to look at common, central topics from complementary standpoints, and discuss the advantages of developing converging perspectives. The book will explore the connections between computational and algorithmic models of the mental lexicon, word frequency distributions and information theoretical measures of word families, statistical correlations across psycho-linguistic and cognitive evidence, principles of machine learning and integrative brain models of word storage and processing. Main goal of the book will be to map out the landscape of future research in this area, to foster the development of interdisciplinary curricula and help single-domain specialists understand and address issues and questions as they are raised in other disciplines
Object Recognition
Vision-based object recognition tasks are very familiar in our everyday activities, such as driving our car in the correct lane. We do these tasks effortlessly in real-time. In the last decades, with the advancement of computer technology, researchers and application developers are trying to mimic the human's capability of visually recognising. Such capability will allow machine to free human from boring or dangerous jobs
- …