14,837 research outputs found
The Sound Manifesto
Computing practice today depends on visual output to drive almost all user
interaction. Other senses, such as audition, may be totally neglected, or used
tangentially, or used in highly restricted specialized ways. We have excellent
audio rendering through D-A conversion, but we lack rich general facilities for
modeling and manipulating sound comparable in quality and flexibility to
graphics. We need co-ordinated research in several disciplines to improve the
use of sound as an interactive information channel.
Incremental and separate improvements in synthesis, analysis, speech
processing, audiology, acoustics, music, etc. will not alone produce the
radical progress that we seek in sonic practice. We also need to create a new
central topic of study in digital audio research. The new topic will assimilate
the contributions of different disciplines on a common foundation. The key
central concept that we lack is sound as a general-purpose information channel.
We must investigate the structure of this information channel, which is driven
by the co-operative development of auditory perception and physical sound
production. Particular audible encodings, such as speech and music, illuminate
sonic information by example, but they are no more sufficient for a
characterization than typography is sufficient for a characterization of visual
information.Comment: To appear in the conference on Critical Technologies for the Future
of Computing, part of SPIE's International Symposium on Optical Science and
Technology, 30 July to 4 August 2000, San Diego, C
Investigating the Design Space of Diffusion Models for Speech Enhancement
Diffusion models are a new class of generative models that have shown
outstanding performance in image generation literature. As a consequence,
studies have attempted to apply diffusion models to other tasks, such as speech
enhancement. A popular approach in adapting diffusion models to speech
enhancement consists in modelling a progressive transformation between the
clean and noisy speech signals. However, one popular diffusion model framework
previously laid in image generation literature did not account for such a
transformation towards the system input, which prevents from relating the
existing diffusion-based speech enhancement systems with the aforementioned
diffusion model framework. To address this, we extend this framework to account
for the progressive transformation between the clean and noisy speech signals.
This allows us to apply recent developments from image generation literature,
and to systematically investigate design aspects of diffusion models that
remain largely unexplored for speech enhancement, such as the neural network
preconditioning, the training loss weighting, the stochastic differential
equation (SDE), or the amount of stochasticity injected in the reverse process.
We show that the performance of previous diffusion-based speech enhancement
systems cannot be attributed to the progressive transformation between the
clean and noisy speech signals. Moreover, we show that a proper choice of
preconditioning, training loss weighting, SDE and sampler allows to outperform
a popular diffusion-based speech enhancement system in terms of perceptual
metrics while using fewer sampling steps, thus reducing the computational cost
by a factor of four
Self-Supervised Vision-Based Detection of the Active Speaker as Support for Socially-Aware Language Acquisition
This paper presents a self-supervised method for visual detection of the
active speaker in a multi-person spoken interaction scenario. Active speaker
detection is a fundamental prerequisite for any artificial cognitive system
attempting to acquire language in social settings. The proposed method is
intended to complement the acoustic detection of the active speaker, thus
improving the system robustness in noisy conditions. The method can detect an
arbitrary number of possibly overlapping active speakers based exclusively on
visual information about their face. Furthermore, the method does not rely on
external annotations, thus complying with cognitive development. Instead, the
method uses information from the auditory modality to support learning in the
visual domain. This paper reports an extensive evaluation of the proposed
method using a large multi-person face-to-face interaction dataset. The results
show good performance in a speaker dependent setting. However, in a speaker
independent setting the proposed method yields a significantly lower
performance. We believe that the proposed method represents an essential
component of any artificial cognitive system or robotic platform engaging in
social interactions.Comment: 10 pages, IEEE Transactions on Cognitive and Developmental System
Accelerating recurrent neural network training using sequence bucketing and multi-GPU data parallelization
An efficient algorithm for recurrent neural network training is presented.
The approach increases the training speed for tasks where a length of the input
sequence may vary significantly. The proposed approach is based on the optimal
batch bucketing by input sequence length and data parallelization on multiple
graphical processing units. The baseline training performance without sequence
bucketing is compared with the proposed solution for a different number of
buckets. An example is given for the online handwriting recognition task using
an LSTM recurrent neural network. The evaluation is performed in terms of the
wall clock time, number of epochs, and validation loss value.Comment: 4 pages, 5 figures, Comments, 2016 IEEE First International
Conference on Data Stream Mining & Processing (DSMP), Lviv, 201
On Macroscopic Complexity and Perceptual Coding
The theoretical limits of 'lossy' data compression algorithms are considered.
The complexity of an object as seen by a macroscopic observer is the size of
the perceptual code which discards all information that can be lost without
altering the perception of the specified observer. The complexity of this
macroscopically observed state is the simplest description of any microstate
comprising that macrostate. Inference and pattern recognition based on
macrostate rather than microstate complexities will take advantage of the
complexity of the macroscopic observer to ignore irrelevant noise
The evolution of auditory contrast
This paper reconciles the standpoint that language users do not aim at improving their sound systems with the observation that languages seem to improve their sound systems. Computer simulations of inventories of sibilants show that Optimality-Theoretic learners who optimize their perception grammars automatically introduce a so-called prototype effect, i.e. the phenomenon that the learner’s preferred auditory realization of a certain phonological category is more peripheral than the average auditory realization of this category in her language environment. In production, however, this prototype effect is counteracted by an articulatory effect that limits the auditory form to something that is not too difficult to pronounce. If the prototype effect and the articulatory effect are of a different size, the learner must end up with an auditorily different sound system from that of her language environment. The computer simulations show that, independently of the initial auditory sound system, a stable equilibrium is reached within a small number of generations. In this stable state, the dispersion of the sibilants of the language strikes an optimal balance between articulatory ease and auditory contrast. The important point is that this is derived within a model without any goal-oriented elements such as dispersion constraints
- …