1,139 research outputs found
Multi modality fusion
Dissertation supervisor: Dr. Ye Duan.Includes vita.2D images and 3D LIDAR range scans provide very different but complementing information about a single subject and, when registered, can be used for a variety of exciting applications. Video sets can be fused with a 3D model and played in a single multi-dimensional environment. Imagery with temporal changes can be visualized simultaneously, unveiling changes in architecture, foliage, and human activity. Depth information for 2D photos and videos can be computed. Real-world measurements can be provided to users through simple interactions with traditional photographs. However, fusing multi-modality data is a very challenging task given the repetition and ambiguity that often occur in man-made scenes as well as the variety of properties different renderings of the same subject can possess. Image sets collected over a period of time during which the lighting conditions and scene content may have changed, different artistic renderings, varying sensor types, focal lengths, and exposure values can all contribute to visual variations in data sets. This dissertation addresses these obstacles using the common theme of incorporating contextual information to visualize regional properties that intuitively exist in each imagery source. We combine hard features that quantify the strong, stable edges that are often present in imagery along object boundaries and depth changes with soft features that capture distinctive texture information that can be unique to specific areas. We show that our detector and descriptor techniques can provide more accurate keypoint match sets between highly varying imagery than many traditional and state-of-the-art techniques, allowing us to fuse and align photographs, videos, and range scans containing both man-made and natural content.Includes bibliographical references (pages 227-250)
Bio-Inspired Modality Fusion for Active Speaker Detection
Human beings have developed fantastic abilities to integrate information from
various sensory sources exploring their inherent complementarity. Perceptual
capabilities are therefore heightened enabling, for instance, the well known
"cocktail party" and McGurk effects, i.e. speech disambiguation from a panoply
of sound signals. This fusion ability is also key in refining the perception of
sound source location, as in distinguishing whose voice is being heard in a
group conversation. Furthermore, Neuroscience has successfully identified the
superior colliculus region in the brain as the one responsible for this
modality fusion, with a handful of biological models having been proposed to
approach its underlying neurophysiological process. Deriving inspiration from
one of these models, this paper presents a methodology for effectively fusing
correlated auditory and visual information for active speaker detection. Such
an ability can have a wide range of applications, from teleconferencing systems
to social robotics. The detection approach initially routes auditory and visual
information through two specialized neural network structures. The resulting
embeddings are fused via a novel layer based on the superior colliculus, whose
topological structure emulates spatial neuron cross-mapping of unimodal
perceptual fields. The validation process employed two publicly available
datasets, with achieved results confirming and greatly surpassing initial
expectations.Comment: Submitted to IEEE RA-L with IROS option, 202
Evaluating Two-Stream CNN for Video Classification
Videos contain very rich semantic information. Traditional hand-crafted
features are known to be inadequate in analyzing complex video semantics.
Inspired by the huge success of the deep learning methods in analyzing image,
audio and text data, significant efforts are recently being devoted to the
design of deep nets for video analytics. Among the many practical needs,
classifying videos (or video clips) based on their major semantic categories
(e.g., "skiing") is useful in many applications. In this paper, we conduct an
in-depth study to investigate important implementation options that may affect
the performance of deep nets on video classification. Our evaluations are
conducted on top of a recent two-stream convolutional neural network (CNN)
pipeline, which uses both static frames and motion optical flows, and has
demonstrated competitive performance against the state-of-the-art methods. In
order to gain insights and to arrive at a practical guideline, many important
options are studied, including network architectures, model fusion, learning
parameters and the final prediction methods. Based on the evaluations, very
competitive results are attained on two popular video classification
benchmarks. We hope that the discussions and conclusions from this work can
help researchers in related fields to quickly set up a good basis for further
investigations along this very promising direction.Comment: ACM ICMR'1
Multi-Modal Trip Hazard Affordance Detection On Construction Sites
Trip hazards are a significant contributor to accidents on construction and
manufacturing sites, where over a third of Australian workplace injuries occur
[1]. Current safety inspections are labour intensive and limited by human
fallibility,making automation of trip hazard detection appealing from both a
safety and economic perspective. Trip hazards present an interesting challenge
to modern learning techniques because they are defined as much by affordance as
by object type; for example wires on a table are not a trip hazard, but can be
if lying on the ground. To address these challenges, we conduct a comprehensive
investigation into the performance characteristics of 11 different colour and
depth fusion approaches, including 4 fusion and one non fusion approach; using
colour and two types of depth images. Trained and tested on over 600 labelled
trip hazards over 4 floors and 2000m in an active construction
site,this approach was able to differentiate between identical objects in
different physical configurations (see Figure 1). Outperforming a colour-only
detector, our multi-modal trip detector fuses colour and depth information to
achieve a 4% absolute improvement in F1-score. These investigative results and
the extensive publicly available dataset moves us one step closer to assistive
or fully automated safety inspection systems on construction sites.Comment: 9 Pages, 12 Figures, 2 Tables, Accepted to Robotics and Automation
Letters (RA-L
Multi-Modal Human-Machine Communication for Instructing Robot Grasping Tasks
A major challenge for the realization of intelligent robots is to supply them
with cognitive abilities in order to allow ordinary users to program them
easily and intuitively. One way of such programming is teaching work tasks by
interactive demonstration. To make this effective and convenient for the user,
the machine must be capable to establish a common focus of attention and be
able to use and integrate spoken instructions, visual perceptions, and
non-verbal clues like gestural commands. We report progress in building a
hybrid architecture that combines statistical methods, neural networks, and
finite state machines into an integrated system for instructing grasping tasks
by man-machine interaction. The system combines the GRAVIS-robot for visual
attention and gestural instruction with an intelligent interface for speech
recognition and linguistic interpretation, and an modality fusion module to
allow multi-modal task-oriented man-machine communication with respect to
dextrous robot manipulation of objects.Comment: 7 pages, 8 figure
- …