48 research outputs found
Multimodal Explainable Artificial Intelligence: A Comprehensive Review of Methodological Advances and Future Research Directions
The current study focuses on systematically analyzing the recent advances in
the field of Multimodal eXplainable Artificial Intelligence (MXAI). In
particular, the relevant primary prediction tasks and publicly available
datasets are initially described. Subsequently, a structured presentation of
the MXAI methods of the literature is provided, taking into account the
following criteria: a) The number of the involved modalities, b) The stage at
which explanations are produced, and c) The type of the adopted methodology
(i.e. mathematical formalism). Then, the metrics used for MXAI evaluation are
discussed. Finally, a comprehensive analysis of current challenges and future
research directions is provided.Comment: 26 pages, 11 figure
ImageNet Large Scale Visual Recognition Challenge
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in
object category classification and detection on hundreds of object categories
and millions of images. The challenge has been run annually from 2010 to
present, attracting participation from more than fifty institutions.
This paper describes the creation of this benchmark dataset and the advances
in object recognition that have been possible as a result. We discuss the
challenges of collecting large-scale ground truth annotation, highlight key
breakthroughs in categorical object recognition, provide a detailed analysis of
the current state of the field of large-scale image classification and object
detection, and compare the state-of-the-art computer vision accuracy with human
accuracy. We conclude with lessons learned in the five years of the challenge,
and propose future directions and improvements.Comment: 43 pages, 16 figures. v3 includes additional comparisons with PASCAL
VOC (per-category comparisons in Table 3, distribution of localization
difficulty in Fig 16), a list of queries used for obtaining object detection
images (Appendix C), and some additional reference
Human pose and action recognition
This thesis focuses on detection of persons and pose recognition using neural networks.
The goal is to detect human body poses in a visual scene with multiple
persons and to use this information in order to recognize human activity. This is
achieved by rst detecting persons in a scene and then by estimating their body
joints in order to infer articulated poses.
The work developed in this thesis explored neural networks and deep learning
methods. Deep learning allows to employ computational models that are composed
of multiple processing layers to learn representations of data with multiple levels
of abstraction. These methods have greatly improved the state-of-the-art in many
domains such as speech recognition and visual object detection and classi cation.
Deep learning discovers intricate structure in data by using the backpropagation
algorithm to indicate how a machine should change its internal parameters that are
used to compute the representation in each layer from the representation provided
by the previous one.
Person detection, in general, is a di cult task due to a large variability of representation
due to di erent factors such as scales, views and occlusion. An object
detection framework based on multi-stage convolutional features for pedestrian detection
is proposed in this thesis. This framework extends the Fast R-CNN framework
for the combination of several convolutional features from di erent stages of
a CNN (Convolutional Neural Network) to improve the detector's accuracy. This
provides high quality detections of persons in a visual scene, which are then used
as input in conjunction with a human pose estimation model in order to estimate
human body joint locations of multiple persons in an image.
Human pose estimation is done by a deep convolutional neural network composed
of a series of residual auto-encoders. These produce multiple predictions which are
later combined to provide a heatmap prediction of human body joints. In this network
topology, features are processed across all scales capturing the various spatial
relationships associated with the body. Repeated bottom-up and top-down processing
with intermediate supervision for each auto-encoder network is applied. This
results in very accurate 2D heatmaps of body joint predictions.
The methods presented in this thesis were benchmarked against other topperforming
methods on popular datasets for human pedestrian and pose estimation,
achieving good results compared with other state-of-the-art algorithms.Esta tese foca a detec c~ao de pessoas e o reconhecimento de poses usando redes neuronais.
O objectivo e detectar poses humanas num ambiente (cena) com m ultiplas
pessoas e usar essa informa c~ao para reconhecer actividade humana. Isto e alcan cado
ao detectar, em primeiro lugar, pessoas numa cena e, seguidamente, estimar as suas
juntas corporais de modo a inferir poses articuladas.
O trabalho desenvolvido nesta tese explorou m etodos de redes neuronais e de
aprendizagem profunda. A aprendizagem profunda permite que modelos computacionais
compostos por m ultiplas camadas de processamento aprendam representa
c~oes de dados com m ultiplos n veis de abstra c~ao. Estes m etodos t^em drasticamente
melhorado o estado-da-arte em muitos dom nios como o reconhecimento
de fala e a classi ca c~ao e o reconhecimento de objectos visuais. A aprendizagem
profunda descobre estruturas intr nsecas em conjuntos de dados ao usar algoritmos
de propaga c~ao inversa (backpropagation) para indicar como uma m aquina deve alterar
os seus par^ametros internos que, por sua vez, s~ao usados para processar a
representa c~ao em cada camada a partir da representa c~ao da camada anterior.
A detec c~ao de pessoas em geral e uma tarefa dif cil dado a grande variabilidade de
representa c~oes devido a diferentes escalas, vistas e oclus~oes. Uma estrutura de detec
c~ao de objectos baseada em caracter sticas convolucionais de m ultiplos est agios
para a detec c~ao de pedestres e proposta nesta tese. Esta estrutura estende a estrutura
Fast R-CNN com a combina c~ao de v arias caracter sticas convolucionais de
diferentes est agios da CNN (Convolutional Neural Network) usada de modo a melhorar
a precis~ao do detector. Isto proporciona detec c~oes de pessoas com elevada
abilidade numa cena, que s~ao posteriormente conjuntamente usadas como entrada
no modelo de estima c~ao de poses humanas de modo a estimar a localiza c~ao de
articula c~oes humanas para a detec c~ao de m ultiplas pessoas numa imagem.
A estima c~ao de poses humanas e obtido atrav es de redes neuronais convolucionais
profundas que s~ao compostas por uma s erie de auto-codi cadores residuais que
fornecem m ultiplas previs~oes que s~ao, posteriormente, combinadas para fornecer
um \mapa de calor" de articula c~oes corporais. Nesta topologia de rede, as caracter
sticas da imagem s~ao processadas ao longo de v arias escalas, capturando as
v arias rela c~oes espaciais associadas com o corpo humano. Repetidos processos de
baixo-para-cima e de cima-para-baixo com supervis~ao interm edia para cada autocodi
cador s~ao aplicados. Isto resulta em mapas de calor 2D muito precisos de
estima c~oes de articula c~oes corporais de pessoas.
Os m etodos apresentados nesta tese foram comparados com outros m etodos de
alto desempenho em bases de dados de detec c~ao de pessoas e de reconhecimento de
poses humanas, alcan cando muito bons resultados comparando com outros algoritmos
do estado-da-arte
Semantic Attributes for Transfer Learning in Visual Recognition
Angetrieben durch den Erfolg von Deep Learning Verfahren wurden in Bezug auf kĂŒnstliche Intelligenz erhebliche Fortschritte im Bereich des Maschinenverstehens gemacht. Allerdings sind Tausende von manuell annotierten Trainingsdaten zwingend notwendig, um die GeneralisierungsfĂ€higkeit solcher Modelle sicherzustellen. DarĂŒber hinaus muss das Modell jedes Mal komplett neu trainiert werden, sobald es auf eine neue Problemklasse angewandt werden muss. Dies fĂŒhrt wiederum dazu, dass der sehr kostenintensive Prozess des Sammelns und Annotierens von Trainingsdaten wiederholt werden muss, wodurch die Skalierbarkeit solcher Modelle erheblich begrenzt wird. Auf der anderen Seite bearbeiten wir Menschen neue Aufgaben nicht isoliert, sondern haben die bemerkenswerte FĂ€higkeit, auf bereits erworbenes Wissen bei der Lösung neuer Probleme zurĂŒckzugreifen. Diese FĂ€higkeit wird als Transfer-Learning bezeichnet. Sie ermöglicht es uns, schneller, besser und anhand nur sehr weniger Beispiele Neues zu lernen. Daher besteht ein groĂes Interesse, diese FĂ€higkeit durch Algorithmen nachzuahmen, insbesondere in Bereichen, in denen Trainingsdaten sehr knapp oder sogar nicht verfĂŒgbar sind.
In dieser Arbeit untersuchen wir Transfer-Learning im Kontext von Computer Vision. Insbesondere untersuchen wir, wie visuelle Erkennung (z.B. Objekt- oder Aktionsklassifizierung) durchgefĂŒhrt werden kann, wenn nur wenige oder keine Trainingsbeispiele existieren. Eine vielversprechende Lösung in dieser Richtung ist das Framework der semantischen Attribute. Dabei werden visuelle Kategorien in Form von Attributen wie Farbe, Muster und Form beschrieben. Diese Attribute können aus einer disjunkten Menge von Trainingsbeispielen gelernt werden. Da die Attribute eine doppelte, d.h. sowohl visuelle als auch semantische, Interpretation haben, kann Sprache effektiv genutzt werden, um den Ăbertragungsprozess zu steuern. Dies bedeutet, dass Modelle fĂŒr eine neue visuelle Kategorie nur anhand der sprachlichen Beschreibung erstellt werden können, indem relevante Attribute selektiert und auf die neue Kategorie ĂŒbertragen werden. Die Notwendigkeit von Trainingsbildern entfĂ€llt durch diesen Prozess jedoch vollstĂ€ndig. In dieser Arbeit stellen wir neue Lösungen vor, semantische Attribute zu modellieren, zu ĂŒbertragen, automatisch mit visuellen Kategorien zu assoziieren, und aus sprachlichen Beschreibungen zu erkennen. Zu diesem Zweck beleuchten wir die attributbasierte Erkennung aus den folgenden vier Blickpunkten:
1) Anders als das gĂ€ngige Modell, bei dem Attribute global gelernt werden mĂŒssen, stellen wir einen hierarchischen Ansatz vor, der es ermöglicht, die Attribute auf verschiedenen Abstraktionsebenen zu lernen. Wir zeigen zudem, wie die Struktur zwischen den Kategorien effektiv genutzt werden kann, um den Lern- und Transferprozess zu steuern und damit diskriminative Modelle fĂŒr neue Kategorien zu erstellen. Mit einer grĂŒndlichen experimentellen Analyse demonstrieren wir eine deutliche Verbesserung unseres Modells gegenĂŒber dem globalen Ansatz, insbesondere bei der Erkennung detailgenauer Kategorien.
2) In vorherrschend attributbasierten TransferansĂ€tzen ĂŒberwacht der Benutzer die Zuordnung zwischen den Attributen und den Kategorien. Wir schlagen in dieser Arbeit vor, die Verbindung zwischen den beiden automatisch und ohne Benutzereingriff herzustellen. Unser Modell erfasst die semantischen Beziehungen, welche die Attribute mit Objekten koppeln, um ihre Assoziationen vorherzusagen und unĂŒberwacht auszuwĂ€hlen welche Attribute ĂŒbertragen werden sollen.
3) Wir umgehen die Notwendigkeit eines vordefinierten Vokabulars von Attributen. Statt dessen schlagen wir vor, EnyzklopĂ€die-Artikel zu verwenden, die Objektkategorien in einem freien Text beschreiben, um automatisch eine Menge von diskriminanten, salienten und vielfĂ€ltigen Attributen zu entdecken. Diese Beseitigung des Bedarfs eines benutzerdefinierten Vokabulars ermöglicht es uns, das Potenzial attributbasierter Modelle im Kontext sehr groĂer Datenmengen vollends auszuschöpfen.
4) Wir prĂ€sentieren eine neuartige Anwendung semantischer Attribute in der realen Welt. Wir schlagen das erste Verfahren vor, welches automatisch Modestile lernt, und vorhersagt, wie sich ihre Beliebtheit in naher Zukunft entwickeln wird. Wir zeigen, dass semantische Attribute interpretierbare Modestile liefern und zu einer besseren Vorhersage der Beliebtheit von visuellen Stilen im Vergleich zu anderen Darstellungen fĂŒhren
Image context for object detection, object context for part detection
Objects and parts are crucial elements for achieving automatic image understanding.
The goal of the object detection task is to recognize and localize all the objects in an
image. Similarly, semantic part detection attempts to recognize and localize the object
parts. This thesis proposes four contributions. The first two make object detection
more efficient by using active search strategies guided by image context. The last two
involve parts. One of them explores the emergence of parts in neural networks trained
for object detection, whereas the other improves on part detection by adding object
context.
First, we present an active search strategy for efficient object class detection. Modern
object detectors evaluate a large set of windows using a window classifier. Instead,
our search sequentially chooses what window to evaluate next based on all the information
gathered before. This results in a significant reduction on the number of necessary
window evaluations to detect the objects in the image. We guide our search strategy
using image context and the score of the classifier.
In our second contribution, we extend this active search to jointly detect pairs of
object classes that appear close in the image, exploiting the valuable information that
one class can provide about the location of the other. This leads to an even further
reduction on the number of necessary evaluations for the smaller, more challenging
classes.
In the third contribution of this thesis, we study whether semantic parts emerge
in Convolutional Neural Networks trained for different visual recognition tasks, especially
object detection. We perform two quantitative analyses that provide a deeper
understanding of their internal representation by investigating the responses of the network
filters. Moreover, we explore several connections between discriminative power
and semantics, which provides further insights on the role of semantic parts in the
network.
Finally, the last contribution is a part detection approach that exploits object context.
We complement part appearance with the object appearance, its class, and the expected
relative location of the parts inside it. We significantly outperform approaches
that use part appearance alone in this challenging task
BNAIC 2008:Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference
Plant Seed Identification
Plant seed identification is routinely performed for seed certification in seed trade, phytosanitary certification for the import and export of agricultural commodities, and regulatory monitoring, surveillance, and enforcement. Current identification is performed manually by seed analysts with limited aiding tools. Extensive expertise and time is required, especially for small, morphologically similar seeds. Computers are, however, especially good at recognizing subtle differences that humans find difficult to perceive. In this thesis, a 2D, image-based computer-assisted approach is proposed.
The size of plant seeds is extremely small compared with daily objects. The microscopic images of plant seeds are usually degraded by defocus blur due to the high magnification of the imaging equipment. It is necessary and beneficial to differentiate the in-focus and blurred regions given that only sharp regions carry distinctive information usually for identification. If the object of interest, the plant seed in this case, is in- focus under a single image frame, the amount of defocus blur can be employed as a cue to separate the object and the cluttered background. If the defocus blur is too strong to obscure the object itself, sharp regions of multiple image frames acquired at different focal distance can be merged together to make an all-in-focus image. This thesis describes a novel non-reference sharpness metric which exploits the distribution difference of uniform LBP patterns in blurred and non-blurred image regions. It runs in realtime on a single core cpu and responses much better on low contrast sharp regions than the competitor metrics. Its benefits are shown both in defocus segmentation and focal stacking.
With the obtained all-in-focus seed image, a scale-wise pooling method is proposed to construct its feature representation. Since the imaging settings in lab testing are well constrained, the seed objects in the acquired image can be assumed to have measureable scale and controllable scale variance. The proposed method utilizes real pixel scale information and allows for accurate comparison of seeds across scales. By cross-validation on our high quality seed image dataset, better identification rate (95%) was achieved compared with pre- trained convolutional-neural-network-based models (93.6%). It offers an alternative method for image based identification with all-in-focus object images of limited scale variance.
The very first digital seed identification tool of its kind was built and deployed for test in the seed laboratory of Canadian food inspection agency (CFIA). The proposed focal stacking algorithm was employed to create all-in-focus images, whereas scale-wise pooling feature representation was used as the image signature. Throughput, workload, and identification rate were evaluated and seed analysts reported significantly lower mental demand (p = 0.00245) when using the provided tool compared with manual identification. Although the identification rate in practical test is only around 50%, I have demonstrated common mistakes that have been made in the imaging process and possible ways to deploy the tool to improve the recognition rate
Advances in detecting object classes and their semantic parts
Object classes are central to computer vision and have been the focus of substantial
research in the last fifteen years. This thesis addresses the tasks of localizing entire
objects in images (object class detection) and localizing their semantic parts (part detection).
We present four contributions, two for each task. The first two improve
existing object class detection techniques by using context and calibration. The other
two contributions explore semantic part detection in weakly-supervised settings.
First, the thesis presents a technique for predicting properties of objects in an image
based on its global appearance only. We demonstrate the method by predicting three
properties: aspect of appearance, location in the image and class membership. Overall,
the technique makes multi-component object detectors faster and improves their
performance.
The second contribution is a method for calibrating the popular Ensemble of Exemplar-
SVM object detector. Unlike the standard approach, which calibrates each Exemplar-
SVM independently, our technique optimizes their joint performance as an ensemble.
We devise an efficient optimization algorithm to find the global optimal solution of the
calibration problem. This leads to better object detection performance compared to
using independent calibration.
The third innovation is a technique to train part-based model of object classes using
data sourced from the web. We learn rich models incrementally. Our models encompass
the appearance of parts and their spatial arrangement on the object, specific to
each viewpoint. Importantly, it does not require any part location annotation, which is
one of the main limits to training many part detectors.
Finally, the last contribution is a study on whether semantic object parts emerge in
Convolutional Neural Networks trained for higher-level tasks, such as image classification.
While previous efforts studied this matter by visual inspection only, we perform
an extensive quantitative analysis based on ground-truth part location annotations. This
provides a more conclusive answer to the question
Novel deep learning architectures for marine and aquaculture applications
Alzayat Saleh's research was in the area of artificial intelligence and machine learning to autonomously recognise fish and their morphological features from digital images. Here he created new deep learning architectures that solved various computer vision problems specific to the marine and aquaculture context. He found that these techniques can facilitate aquaculture management and environmental protection. Fisheries and conservation agencies can use his results for better monitoring strategies and sustainable fishing practices