5 research outputs found
Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds
Humans can robustly recognize and localize objects by integrating visual and
auditory cues. While machines are able to do the same now with images, less
work has been done with sounds. This work develops an approach for dense
semantic labelling of sound-making objects, purely based on binaural sounds. We
propose a novel sensor setup and record a new audio-visual dataset of street
scenes with eight professional binaural microphones and a 360 degree camera.
The co-existence of visual and audio cues is leveraged for supervision
transfer. In particular, we employ a cross-modal distillation framework that
consists of a vision `teacher' method and a sound `student' method -- the
student method is trained to generate the same results as the teacher method.
This way, the auditory system can be trained without using human annotations.
We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound
Super-resolution to increase the spatial resolution of sounds, and b) dense
depth prediction of the scene. We then formulate the three tasks into one
end-to-end trainable multi-tasking network aiming to boost the overall
performance. Experimental results on the dataset show that 1) our method
achieves promising results for semantic prediction and the two auxiliary tasks;
and 2) the three tasks are mutually beneficial -- training them together
achieves the best performance and 3) the number and orientations of microphones
are both important. The data and code will be released to facilitate the
research in this new direction.Comment: Project page:
https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm
Evaluation of depth-camera-systems for usage in semi-controlled assembly environments
With the availability of affordable depth-camera-systems like the Microsoft Kinect, Depth Imaging has seen a fast-growing number of applications in many different fields over the last years. Such systems can however be based on different measurement principles with widely differing parameters and hence are difficult to evaluate against a single benchmark. While accuracy and precision of depth-camera-systems inherently vary significantly with measuring distance and changing environments, and therefore impose heavy constraints on real world applications, they even allow for automated quality assurance in controlled environments. Context aware assistive systems in manual assembly environments push these boundaries by employing quality assurance in more open environments, where distracting influences by the worker or the work-space environment cannot be ruled out. The thesis concerns itself with the exploration and evaluation of different depth measuring approaches (e.g. Time of Flight, Structured Light, Stereo Vision) for usage in semi-controlled assembly environments. The still underexplored effects of material properties on measurements are experimentally evaluated and the resulting limitations of each approach for usage in assembly environments are discussed
Context-driven Object Detection and Segmentation with Auxiliary Information
One fundamental problem in computer vision and robotics is to
localize objects of interest in an image. The task can either be
formulated as an object detection problem if the objects are
described by a set of pose parameters, or an object segmentation
one if we recover object boundary precisely. A key issue in
object detection and segmentation concerns exploiting the spatial
context, as local evidence is often insufficient to determine
object pose in the presence of heavy occlusions or large object
appearance variations. This thesis addresses the object detection
and segmentation problem in such adverse conditions with
auxiliary depth data provided by RGBD cameras. We focus on four
main issues in context-aware object detection and segmentation:
1) what are the effective context representations? 2) how can we
work with limited and imperfect depth data? 3) how to design
depth-aware features and integrate depth cues into conventional
visual inference tasks? 4) how to make use of unlabeled data to
relax the labeling requirements for training data?
We discuss three object detection and segmentation scenarios
based on varying amounts of available auxiliary information. In
the first case, depth data are available for model training but
not available for testing. We propose a structured Hough voting
method for detecting objects with heavy occlusion in indoor
environments, in which we extend the Hough hypothesis space to
include both the object's location, and its visibility pattern.
We design a new score function that accumulates votes for object
detection and occlusion prediction. In addition, we explore the
correlation between objects and their environment, building a
depth-encoded object-context model based on RGBD data. In the
second case, we address the problem of localizing glass objects
with noisy and incomplete depth data. Our method integrates the
intensity and depth information from a single view point, and
builds a Markov Random Field that predicts glass boundary and
region jointly. In addition, we propose a nonparametric,
data-driven label transfer scheme for local glass boundary
estimation. A weighted voting scheme based on a joint feature
manifold is adopted to integrate depth and appearance cues, and
we learn a distance metric on the depth-encoded feature manifold.
In the third case, we make use of unlabeled data to relax the
annotation requirements for object detection and segmentation,
and propose a novel data-dependent margin distribution learning
criterion for boosting, which utilizes the intrinsic geometric
structure of datasets. One key aspect of this method is that it
can seamlessly incorporate unlabeled data by including a graph
Laplacian regularizer. We demonstrate the performance of our
models and compare with baseline methods on several real-world
object detection and segmentation tasks, including indoor object
detection, glass object segmentation and foreground segmentation
in video