56 research outputs found
Saliency-based approaches for multidimensional explainability of deep networks
In deep learning, visualization techniques extract the salient patterns exploited by deep networks to perform a task (e.g. image classification) focusing on single images. These methods allow a better understanding of these complex models, empowering the identification of the most informative parts of the input data. Beyond the deep network understanding, visual saliency is useful for many quantitative reasons and applications, both in the 2D and 3D domains, such as the analysis of the generalization capabilities of a classifier and autonomous navigation. In this thesis, we describe an approach to cope with the interpretability problem of a convolutional neural network and propose our ideas on how to exploit the visualization for applications like image classification and active object recognition. After a brief overview on common visualization methods producing attention/saliency maps, we will address two separate points: firstly, we will describe how visual saliency can be effectively used in the 2D domain (e.g. RGB images) to boost image classification performances: as a matter of fact, visual summaries, i.e. a compact representation of an ensemble of saliency maps, can be used to improve the classification accuracy of a network through summary-driven specializations. Then, we will present a 3D active recognition system that allows to consider different views of a target object, overcoming the single-view hypothesis of classical object recognition, making the classification problem much easier in principle. Here we adopt such attention maps in a quantitative fashion, by building a 3D dense saliency volume which fuses together saliency maps obtained from different viewpoints, obtaining a continuous proxy on which parts of an object are more discriminative for a given classifier. Finally, we will show how to inject this representations in a real world application, so that an agent (e.g. robot) can move knowing the capabilities of its classifier
Real-time appearance-based gaze tracking.
PhDGaze tracking technology is widely used in Human Computer Interaction applications
such as in interfaces for assisting people with disabilities and for driver attention monitoring.
However, commercially available gaze trackers are expensive and their performance
deteriorates if the user is not positioned in front of the camera and facing it. Also, head
motion or being far from the device degrades their accuracy.
This thesis focuses on the development of real-time time appearance based gaze
tracking algorithms using low cost devices, such as a webcam or Kinect. The proposed
algorithms are developed by considering accuracy, robustness to head pose variation and
the ability to generalise to different persons. In order to deal with head pose variation, we
propose to estimate the head pose and then compensate for the appearance change and
the bias to a gaze estimator that it introduces. Head pose is estimated by a novel method
that utilizes tensor-based regressors at the leaf nodes of a random forest. For a baseline
gaze estimator we use an SVM-based appearance-based regressor. For compensating
the appearance variation introduced by the head pose, we use a geometric model, and
for compensating for the bias we use a regression function that has been trained on a
training set. Our methods are evaluated on publicly available dataset
Artificial Intelligence Tools for Facial Expression Analysis.
Inner emotions show visibly upon the human face and are understood as a basic guide to an individual’s inner world. It is, therefore, possible to determine a person’s attitudes and the effects of others’ behaviour on their deeper feelings through examining facial expressions. In real world applications, machines that interact with people need strong facial expression recognition. This recognition is seen to hold advantages for varied applications in affective computing, advanced human-computer interaction, security, stress and depression analysis, robotic systems, and machine learning. This thesis starts by proposing a benchmark of dynamic versus static methods for facial Action Unit (AU) detection. AU activation is a set of local individual facial muscle parts that occur in unison constituting a natural facial expression event. Detecting AUs automatically can provide explicit benefits since it considers both static and dynamic facial features. For this research, AU occurrence activation detection was conducted by extracting features (static and dynamic) of both nominal hand-crafted and deep learning representation from each static image of a video. This confirmed the superior ability of a pretrained model that leaps in performance. Next, temporal modelling was investigated to detect the underlying temporal variation phases using supervised and unsupervised methods from dynamic sequences. During these processes, the importance of stacking dynamic on top of static was discovered in encoding deep features for learning temporal information when combining the spatial and temporal schemes simultaneously. Also, this study found that fusing both temporal and temporal features will give more long term temporal pattern information. Moreover, we hypothesised that using an unsupervised method would enable the leaching of invariant information from dynamic textures. Recently, fresh cutting-edge developments have been created by approaches based on Generative Adversarial Networks (GANs). In the second section of this thesis, we propose a model based on the adoption of an unsupervised DCGAN for the facial features’ extraction and classification to achieve the following: the creation of facial expression images under different arbitrary poses (frontal, multi-view, and in the wild), and the recognition of emotion categories and AUs, in an attempt to resolve the problem of recognising the static seven classes of emotion in the wild. Thorough experimentation with the proposed cross-database performance demonstrates that this approach can improve the generalization results. Additionally, we showed that the features learnt by the DCGAN process are poorly suited to encoding facial expressions when observed under multiple views, or when trained from a limited number of positive examples. Finally, this research focuses on disentangling identity from expression for facial expression recognition. A novel technique was implemented for emotion recognition from a single monocular image. A large-scale dataset (Face vid) was created from facial image videos which were rich in variations and distribution of facial dynamics, appearance, identities, expressions, and 3D poses. This dataset was used to train a DCNN (ResNet) to regress the expression parameters from a 3D Morphable Model jointly with a back-end classifier
Deep Learning Approaches for Seizure Video Analysis: A Review
Seizure events can manifest as transient disruptions in the control of
movements which may be organized in distinct behavioral sequences, accompanied
or not by other observable features such as altered facial expressions. The
analysis of these clinical signs, referred to as semiology, is subject to
observer variations when specialists evaluate video-recorded events in the
clinical setting. To enhance the accuracy and consistency of evaluations,
computer-aided video analysis of seizures has emerged as a natural avenue. In
the field of medical applications, deep learning and computer vision approaches
have driven substantial advancements. Historically, these approaches have been
used for disease detection, classification, and prediction using diagnostic
data; however, there has been limited exploration of their application in
evaluating video-based motion detection in the clinical epileptology setting.
While vision-based technologies do not aim to replace clinical expertise, they
can significantly contribute to medical decision-making and patient care by
providing quantitative evidence and decision support. Behavior monitoring tools
offer several advantages such as providing objective information, detecting
challenging-to-observe events, reducing documentation efforts, and extending
assessment capabilities to areas with limited expertise. The main applications
of these could be (1) improved seizure detection methods; (2) refined semiology
analysis for predicting seizure type and cerebral localization. In this paper,
we detail the foundation technologies used in vision-based systems in the
analysis of seizure videos, highlighting their success in semiology detection
and analysis, focusing on work published in the last 7 years. Additionally, we
illustrate how existing technologies can be interconnected through an
integrated system for video-based semiology analysis.Comment: Accepted in Epilepsy & Behavio
Video metadata extraction in a videoMail system
Currently the world swiftly adapts to visual communication. Online services like
YouTube and Vine show that video is no longer the domain of broadcast television only.
Video is used for different purposes like entertainment, information, education or communication.
The rapid growth of today’s video archives with sparsely available editorial data creates
a big problem of its retrieval. The humans see a video like a complex interplay of
cognitive concepts. As a result there is a need to build a bridge between numeric values and semantic concepts. This establishes a connection that will facilitate videos’ retrieval by humans.
The critical aspect of this bridge is video annotation. The process could be done manually or automatically. Manual annotation is very tedious, subjective and expensive.
Therefore automatic annotation is being actively studied.
In this thesis we focus on the multimedia content automatic annotation. Namely
the use of analysis techniques for information retrieval allowing to automatically extract
metadata from video in a videomail system. Furthermore the identification of text, people, actions, spaces, objects, including animals and plants.
Hence it will be possible to align multimedia content with the text presented in the
email message and the creation of applications for semantic video database indexing and retrieving
Kinematics and control of precision grip grasping
This thesis is about the kind of signals used in our central nervous system for guiding
skilled motor behavior.
In the first two projects a currently very influential theory on the flow of visual
information inside our brain was tested. According to A. D. Milner and Goodale
(1995) there exist two largely independent visual streams. The dorsal stream is
supposed to transmit visual information for the guidance of action. The ventral
stream is thought generate a conscious percept of the environment. The streams
are said to use different parts of the visual information and to differ in temporal
characteristics. Namely, the dorsal stream is proposed to have a lower sensitivity
for color and a more rapid decay of information than the ventral stream.
In the first project the role of chromatic information in action guidance was
probed. We let participants grasp colored stimuli which varied in luminance. Criti-
cally, some of these stimuli were completely isoluminant with the background. These
stimuli thus could only be discriminated from their surrounding by means of chro-
matic contrast, a poor input signal for the dorsal stream. Nevertheless, our partici-
pants were perfectly able to guide their grip to these targets as well.
In the second project the temporal characteristics of the two streams were
probed. For a certain group of neurological patients it has been argued that they
are able to switch from dorsal to ventral control when visual information is re-
moved. These optic ataxic patients are normally quite bad at executing visually
guided movements like e.g. pointing or grasping. Different researchers, however,
demonstrated that their accuracy does improve when there is a delay between tar-
get presentation and movement execution. Using different delay times and pointing
movements Himmelbach and Karnath (2005) had shown that this improvement in-
creases linearly with longer delay. We aimed at a replication of this result and a
generalization to precision grip movements. Our results from two patients, however,
did not show any improvement in grasping due to longer delay time. In pointing an
effect was found only in one of the patients and only in one of several measures of
pointing accuracy.
Taken together the results of the first two projects don´t support the idea of
two independent visual streams and are more in line with the idea of a single visual
representation of target objects.
The third project aimed at closing a gap in existing model approaches on pre-
cision grip kinematics. The available models need the target points of a movement
as an input on which they can operate. From the literature on human and robotic
grasping we extracted the most plausible set of rules for grasp point selection. We
created objects suitable to put these rules into conflict with each other. Thereby
we estimated the individual contribution of each rule. We validated the model by
predicting grasp points on a completely novel set of objects. Our straightforward
approach showed a very good performance in predicting the preferred contact points
of human actors.Diese Dissertation handelt von den Mechanismen mit denen unser Zentralnerven-
system menschliche Feinmotorik koordiniert.
Gegenstand der ersten beiden Projekte ist die Theorie von A. D. Milner und
Goodale (1995). Laut diesen Autoren gibt es im visuellen System zwei unabhängige Verarbeitungspfade. Der dorsale Pfad verarbeitet visuelle Information
zum Zweck der Handlungssteuerung. Der ventrale Pfad vermittelt bewusste visuelle Wahrnehmung. Beide Pfade verfĂĽgen uber teils unterschiedliche Anteile der
gesamten visuellen Information. So soll der dorsale Pfad gegenĂĽber dem ventralen
zum Beispiel durch geringere Farbsensitivität sowie einen schnelleren Zerfall der
Information gekennzeichnet sein.
Im ersten Projekt wurde die Eignung von Farbinformation zur Handlungskontrolle getestet. Teilnehmer der Studie griffen nach farbigen Stimuli deren Helligkeit variiert wurde. Einige der Stimuli hatten die gleiche Helligkeit wie der Hintergrund vor dem sie präsentiert wurden. Diese Stimuli hoben sich also nur durch
ihre Farbe vom Hintergrund ab. Trotz der angenommenen Farbinsensitivität des
dorsalen Pfades konnten unsere Teilnehmer auch diese Stimuli problemlos greifen.
Gegenstand des zweiten Projektes waren die Unterschiede beider Pfade im
zeitlichen Verfall der visuellen Information. Einigen Patienten mit speziellen Hirn-
schädigungen soll es möglich sein zwischen den Repräsentationen beider Pfade
zu wechseln. Diese optischen Ataktiker zeigen starke Unsicherheit bei visuell
gefĂĽhrten Bewegungen wie Zeigen oder Greifen. Wiederholt wurde jedoch gezeigt,
dass ihre Bewegungen genauer werden wenn die AusfĂĽhrung einige Zeit nach der
Zielpräsentation erfolgt. Himmelbach und Karnath (2005) berichten, dass diese
Verbesserung beim Zeigen linear mit der Länge des zwischengeschalteten Intervalles
zunimmt. Wir versuchten dieses Ergebnis zu reproduzieren und auf das Greifen zu
generalisieren. Die zwei von uns gemessenen Patienten zeigten beim Greifen jedoch
keinen Effekt. Beim Zeigen zeigte sich eine Verbesserung nur bei einem Patienten
und nur in einem von mehreren MaĂźen fĂĽr die Zeigegenauigkeit.
Insgesamt betrachtet widersprechen die Ergebnisse des ersten und zweiten Projektes der Vorstellung zweier getrennter visueller Pfade. Die hier präsentierten Daten
lassen sich ebenso effektiv, aber deutlich effizienter, durch die Verarbeitung in einem
einzelnen visuellen Verarbeitungspfad erklären.
Das dritte Projekt soll eine LĂĽcke in bestehenden Modellen zur Beschreibung
der Kinematik des Greifens schlieĂźen. Alle diese Modelle sind darauf angewiesen,
dass ihnen die Zielpunkte der Bewegung vorgegeben werden. Aus der Literatur zu
menschlichem und maschinellem Greifen extrahierten wir die plausibelsten Regeln
zur Auswahl dieser Zielpunkte. Wir brachten diese Regeln experimentell in Konflikt
zueinander und schätzten auf diese Weise ihren relativen Einfluss. Das Modell wurde
anschlieĂźend validiert indem wir die besten Greifpunkte fĂĽr einen neuen Satz von
Objekten vorhersagten. Mit wenigen Regeln konnten wir so sehr erfolgreich im
Vorhinein die vom Menschen präferierten Greifpunkte bestimmen
Convolutional neural networks for the segmentation of small rodent brain MRI
Image segmentation is a common step in the analysis of preclinical brain MRI, often performed manually. This is a time-consuming procedure subject to inter- and intra- rater variability. A possible alternative is the use of automated, registration-based segmentation, which suffers from a bias owed to the limited capacity of registration to adapt to pathological conditions such as Traumatic Brain Injury (TBI). In this work a novel method is developed for the segmentation of small rodent brain MRI based on Convolutional Neural Networks (CNNs). The experiments here presented show how CNNs provide a fast, robust and accurate alternative to both manual and registration-based methods. This is demonstrated by accurately segmenting three large datasets of MRI scans of healthy and Huntington disease model mice, as well as TBI rats. MU-Net and MU-Net-R,
the CCNs here presented, achieve human-level accuracy while eliminating intra-rater variability, alleviating the biases of registration-based segmentation, and with an inference time of less than one second per scan. Using these segmentation masks I designed a geometric construction to extract 39 parameters describing the position and orientation of the hippocampus, and later used them to classify epileptic vs. non-epileptic rats with a balanced accuracy of 0.80, five months after TBI. This clinically transferable geometric
approach detects subjects at high-risk of post-traumatic epilepsy, paving the way towards subject stratification for antiepileptogenesis studies
- …