Search CORE

20 research outputs found

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Author: A Geiger
A Owens
A Owens
BC Russell
C Rascon
Computational auditory scene analysis
D Li
F Antonacci
H Wallach
H Zhao
I Dokmanic
J Delmerico
J Tiete
KI McAnally
L-C Chen
LC Chen
LD Rosenblum
R Fendrich
R Gao
S Argentieri
S Hecker
U Klee
W Huang
WR Thurlow
WW Gaver
Y Tian
Publication venue
Publication date: 09/03/2020
Field of study

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves promising results for semantic prediction and the two auxiliary tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and orientations of microphones are both important. The data and code will be released to facilitate the research in this new direction.Comment: Project page: https://www.trace.ethz.ch/publications/2020/sound_perception/index.htm

arXiv.org e-Print Archive

Repository for Publications and Research Data

Crossref

Convolution on the $n$ -Sphere With Application to PDF Modeling

Author: D. Petrinovic
I. Dokmanic
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Acoustic echoes reveal room shape

Author: A. Walther
I. Dokmanic
M. Vetterli
R. Parhizkar
Y. M. Lu
Publication venue: 'Proceedings of the National Academy of Sciences'
Publication date
Field of study

Crossref

Shape object selection using the chi-square method

Author: Dokmanic I
F Kurnia
Fitrilina
Hall M A
Mihu I Z
R Kurnia
Rastogi R
Sural S
Wang L
Publication venue: 'IOP Publishing'
Publication date
Field of study

Crossref

SSLB: Self-Similarity-Based Load Balancing for Large-Scale Fog Computing

Author: AV Dastjerdi
Flavio Bonomi
H Hawilo
H Kim
H Rivaz
I Dokmanic
M Mitzenmacher
M Verma
P Demestichas
Q Ye
RN Calheiros
S Ningning
Salim Bitam
V Stantchev
W Shi
Z Inayat
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Semantic Object Prediction and Spatial Sound Super-Resolution with Binaural Sounds

Author: A Geiger
A Owens
A Owens
BC Russell
C Rascon
Computational auditory scene analysis
D Li
F Antonacci
H Wallach
H Zhao
I Dokmanic
J Delmerico
J Tiete
KI McAnally
L-C Chen
LC Chen
LD Rosenblum
R Fendrich
R Gao
S Argentieri
S Hecker
U Klee
W Huang
WR Thurlow
WW Gaver
Y Tian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/11/2020
Field of study

Humans can robustly recognize and localize objects by integrating visual and auditory cues. While machines are able to do the same now with images, less work has been done with sounds. This work develops an approach for dense semantic labelling of sound-making objects, purely based on binaural sounds. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight professional binaural microphones and a 360 degree camera. The co-existence of visual and audio cues is leveraged for supervision transfer. In particular, we employ a cross-modal distillation framework that consists of a vision `teacher' method and a sound `student' method -- the student method is trained to generate the same results as the teacher method. This way, the auditory system can be trained without using human annotations. We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound Super-resolution to increase the spatial resolution of sounds, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end trainable multi-tasking network aiming to boost the overall performance. Experimental results on the dataset show that 1) our method achieves good results for all the three tasks; and 2) the three tasks are mutually beneficial -- training them together achieves the best performance and 3) the number and the orientations of microphones are both important.ISSN:0302-9743ISSN:1611-334

Repository for Publications and Research Data

Crossref