6 research outputs found
Multi-modal particle filtering tracking using appearance, motion and audio likelihoods
ABSTRACT We propose a multi-modal object tracking algorithm that combines appearance, motion and audio information in a particle filter. The proposed tracker fuses at the likelihood level the audio-visual observations captured with a video camera coupled with two microphones. Two video likelihoods are computed that are based on a 3D color histogram appearance model and on a color change detection, whereas an audio likelihood provides information about the direction of arrival of a target. The direction of arrival is computed based on a multi-band generalized cross-correlation function enhanced with a noise suppression and reverberation filtering that uses the precedence effect. We evaluate the tracker on single and multi-modality tracking and quantify the performance improvement introduced by integrating audio and visual information in the tracking process
Tracking interacting targets in multi-modal sensors
PhDObject tracking is one of the fundamental tasks in various applications such as surveillance,
sports, video conferencing and activity recognition. Factors such as occlusions,
illumination changes and limited field of observance of the sensor make tracking a challenging
task. To overcome these challenges the focus of this thesis is on using multiple
modalities such as audio and video for multi-target, multi-modal tracking. Particularly,
this thesis presents contributions to four related research topics, namely, pre-processing of
input signals to reduce noise, multi-modal tracking, simultaneous detection and tracking,
and interaction recognition.
To improve the performance of detection algorithms, especially in the presence
of noise, this thesis investigate filtering of the input data through spatio-temporal feature
analysis as well as through frequency band analysis. The pre-processed data from multiple
modalities is then fused within Particle filtering (PF). To further minimise the discrepancy
between the real and the estimated positions, we propose a strategy that associates the
hypotheses and the measurements with a real target, using a Weighted Probabilistic Data
Association (WPDA). Since the filtering involved in the detection process reduces the
available information and is inapplicable on low signal-to-noise ratio data, we investigate
simultaneous detection and tracking approaches and propose a multi-target track-beforedetect
Particle filtering (MT-TBD-PF). The proposed MT-TBD-PF algorithm bypasses
the detection step and performs tracking in the raw signal. Finally, we apply the proposed
multi-modal tracking to recognise interactions between targets in regions within, as well
as outside the cameras’ fields of view.
The efficiency of the proposed approaches are demonstrated on large uni-modal,
multi-modal and multi-sensor scenarios from real world detections, tracking and event
recognition datasets and through participation in evaluation campaigns
Audio-coupled video content understanding of unconstrained video sequences
Unconstrained video understanding is a difficult task. The main aim of this thesis is to
recognise the nature of objects, activities and environment in a given video clip using
both audio and video information. Traditionally, audio and video information has not
been applied together for solving such complex task, and for the first time we propose,
develop, implement and test a new framework of multi-modal (audio and video) data
analysis for context understanding and labelling of unconstrained videos.
The framework relies on feature selection techniques and introduces a novel algorithm
(PCFS) that is faster than the well-established SFFS algorithm. We use the framework for
studying the benefits of combining audio and video information in a number of different
problems. We begin by developing two independent content recognition modules. The
first one is based on image sequence analysis alone, and uses a range of colour, shape,
texture and statistical features from image regions with a trained classifier to recognise
the identity of objects, activities and environment present. The second module uses audio
information only, and recognises activities and environment. Both of these approaches
are preceded by detailed pre-processing to ensure that correct video segments containing
both audio and video content are present, and that the developed system can be made
robust to changes in camera movement, illumination, random object behaviour etc. For
both audio and video analysis, we use a hierarchical approach of multi-stage
classification such that difficult classification tasks can be decomposed into simpler and
smaller tasks.
When combining both modalities, we compare fusion techniques at different levels of
integration and propose a novel algorithm that combines advantages of both feature and
decision-level fusion. The analysis is evaluated on a large amount of test data comprising
unconstrained videos collected for this work. We finally, propose a decision correction
algorithm which shows that further steps towards combining multi-modal classification
information effectively with semantic knowledge generates the best possible results
Audio-coupled video content understanding of unconstrained video sequences
Unconstrained video understanding is a difficult task. The main aim of this thesis is to recognise the nature of objects, activities and environment in a given video clip using both audio and video information. Traditionally, audio and video information has not been applied together for solving such complex task, and for the first time we propose, develop, implement and test a new framework of multi-modal (audio and video) data analysis for context understanding and labelling of unconstrained videos. The framework relies on feature selection techniques and introduces a novel algorithm (PCFS) that is faster than the well-established SFFS algorithm. We use the framework for studying the benefits of combining audio and video information in a number of different problems. We begin by developing two independent content recognition modules. The first one is based on image sequence analysis alone, and uses a range of colour, shape, texture and statistical features from image regions with a trained classifier to recognise the identity of objects, activities and environment present. The second module uses audio information only, and recognises activities and environment. Both of these approaches are preceded by detailed pre-processing to ensure that correct video segments containing both audio and video content are present, and that the developed system can be made robust to changes in camera movement, illumination, random object behaviour etc. For both audio and video analysis, we use a hierarchical approach of multi-stage classification such that difficult classification tasks can be decomposed into simpler and smaller tasks. When combining both modalities, we compare fusion techniques at different levels of integration and propose a novel algorithm that combines advantages of both feature and decision-level fusion. The analysis is evaluated on a large amount of test data comprising unconstrained videos collected for this work. We finally, propose a decision correction algorithm which shows that further steps towards combining multi-modal classification information effectively with semantic knowledge generates the best possible results.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Modellierung primärer multisensorischer Mechanismen der räumlichen Wahrnehmung
Abstract
The presented work concerns visual, aural, and multimodal aspects of spatial
perception as well as their relevance to the design of artificial systems. The
scientific approach chosen here, has an interdisciplinary character combining
the perspectives of neurobiology, psychology, and computer science. As a result,
new insights and interpretations of neurological findings are achieved and
deficits of known models and applications are named and negotiated.
In chapter one, the discussion starts with a review on established models of
attention, which largely disregard early neural mechanisms. In the following
investigations and experiments, the basic idea can be expressed as a conceptual
differentiation between early spatial attention and higher cognitive functions.
All neural mechanisms that are modelled within the scope of this work, can be
regarded as primary and object-independent sensory processing.
In chapter two and three the visual and binaural spatial representations of the
brain and the specific concept of the computational topography in the central
auditory system are discussed. Given the restriction of early neural processes,
the aim of the actual multisensory integration, as it is described in chapter
four, is not object classification or tracking but primary spatial attention.
Without task- or object-related requirements all specifications of the model are
derived from findings about certain multisensory structures of the midbrain. In
chapter five emphasis is placed on a novel method of evaluation and parameter
optimization based on biologically inspired
specifications and real-world experiments.
The importance of early perceptional processes to orienting behaviour and the
consequences to technical applications are discussed.In der vorliegenden Arbeit werden visuelle, auditive und multimodale Formen der räumlichen Wahrnehmung und deren Relevanz für den Entwurf technischer Systeme erörtert. Der dabei vertretene wissenschaftliche Ansatz hat interdisziplinären Charakter und berücksichtigt im Umfeld der Neuroinformatik und Robotik
methodische Aspekte der Neurobiologie, Wahrnehmungspsychologie und Informatik gleichermaßen. Im Ergebnis sind einerseits neue und weitergehende Interpretationen der Befunde über die natürliche Wahrnehmung möglich. Andererseits werden Defizite bestehender Simulationsmodelle und technischer Anwendungen benannt und überwunden. Den Ausgangspunkt der Untersuchungen bildet in Kapitel 1 die Diskussion und kritische Wertung etablierter Aufmerksamkeitsmodelle der Wahrnehmung, in denen frühe multisensorische Hirnfunktionen weitgehend unbeachtet bleiben. Als
Grundgedanke der folgenden Untersuchungen wird die These formuliert, dass eine konzeptionelle Trennung zwischen primärer Aufmerksamkeit und höheren kognitiven Leistungen sowohl die Einordnung von sensorischen Merkmalen und neurologischen
Mechanismen als auch die Modellierung und Simulation erleichtert.
In den Kapiteln 2 und 3 werden zunächst die primären räumlichen Kodierungen der zentralen Hörbahn und des visuellen Systems vorgestellt und die Spezifika von projizierten und berechneten sensorischen Topographien beschrieben. Die anschließende Modellierung von auditorisch-visuellen Integrationsmechanismen
in Kapitel 4 dient ausdrücklich nicht der Klassifikation oder dem Tracking von Objekten sondern einer frühen räumlichen Steuerung der Aufmerksamkeit, die im biologischen Vorbild unbewusst und auf subkortikalem Niveau stattfindet. Nach einer Erörterung der wenigen bekannten Modellkonzepte werden zwei eigene
multisensorische Simulationssysteme auf Basis kĂĽnstlicher neuronaler Netze und probabilistischer Methoden entwickelt.
Kapitel 5 widmet sich der systematischen experimentellen Untersuchung und Optimierung der Modelle und zeigt, wie unbewusste Wahrnehmungsleistungen und deren Simulation unter Bezugnahme auf qualitative und quantitative Befunde ĂĽber
multisensorische Effekte im Mittelhirn evaluiert werden können. Die Diskussion des Modellverhaltens in realen audio-visuellen Szenarien soll unterstreichen, dass die frühe Steuerung der Aufmerksamkeit noch vor der Objekterkennung einen wichtigen Beitrag zur räumlichen Orientierung leistet