253 research outputs found
A Generative Product-of-Filters Model of Audio
We propose the product-of-filters (PoF) model, a generative model that
decomposes audio spectra as sparse linear combinations of "filters" in the
log-spectral domain. PoF makes similar assumptions to those used in the classic
homomorphic filtering approach to signal processing, but replaces hand-designed
decompositions built of basic signal processing operations with a learned
decomposition based on statistical inference. This paper formulates the PoF
model and derives a mean-field method for posterior inference and a variational
EM algorithm to estimate the model's free parameters. We demonstrate PoF's
potential for audio processing on a bandwidth expansion task, and show that PoF
can serve as an effective unsupervised feature extractor for a speaker
identification task.Comment: ICLR 2014 conference-track submission. Added link to the source cod
Optimization of data-driven filterbank for automatic speaker verification
Most of the speech processing applications use triangular filters spaced in
mel-scale for feature extraction. In this paper, we propose a new data-driven
filter design method which optimizes filter parameters from a given speech
data. First, we introduce a frame-selection based approach for developing
speech-signal-based frequency warping scale. Then, we propose a new method for
computing the filter frequency responses by using principal component analysis
(PCA). The main advantage of the proposed method over the recently introduced
deep learning based methods is that it requires very limited amount of
unlabeled speech-data. We demonstrate that the proposed filterbank has more
speaker discriminative power than commonly used mel filterbank as well as
existing data-driven filterbank. We conduct automatic speaker verification
(ASV) experiments with different corpora using various classifier back-ends. We
show that the acoustic features created with proposed filterbank are better
than existing mel-frequency cepstral coefficients (MFCCs) and
speech-signal-based frequency cepstral coefficients (SFCCs) in most cases. In
the experiments with VoxCeleb1 and popular i-vector back-end, we observe 9.75%
relative improvement in equal error rate (EER) over MFCCs. Similarly, the
relative improvement is 4.43% with recently introduced x-vector system. We
obtain further improvement using fusion of the proposed method with standard
MFCC-based approach.Comment: Published in Digital Signal Processing journal (Elsevier
Differential encoding techniques applied to speech signals
The increasing use of digital communication systems has
produced a continuous search for efficient methods of speech
encoding.
This thesis describes investigations of novel differential
encoding systems. Initially Linear First Order DPCM systems
employing a simple delayed encoding algorithm are examined.
The systems detect an overload condition in the encoder, and
through a simple algorithm reduce the overload noise at the
expense of some increase in the quantization (granular) noise.
The signal-to-noise ratio (snr) performance of such d codec has
1 to 2 dB's advantage compared to the First Order Linear DPCM
system.
In order to obtain a large improvement in snr the high
correlation between successive pitch periods as well as the
correlation between successive samples in the voiced speech
waveform is exploited. A system called "Pitch Synchronous
First Order DPCM" (PSFOD) has been developed. Here the difference
Sequence formed between the samples of the input sequence in the
current pitch period and the samples of the stored decoded
sequence from the previous pitch period are encoded. This
difference sequence has a smaller dynamic range than the original
input speech sequence enabling a quantizer with better resolution
to be used for the same transmission bit rate. The snr is increased
by 6 dB compared with the peak snr of a First Order DPCM codea.
A development of the PSFOD system called a Pitch Synchronous
Differential Predictive Encoding system (PSDPE) is next investigated.
The principle of its operation is to predict the next sample in
the voiced-speech waveform, and form the prediction error which
is then subtracted from the corresponding decoded prediction
error in the previous pitch period. The difference is then
encoded and transmitted. The improvement in snr is approximately
8 dB compared to an ADPCM codea, when the PSDPE system uses an
adaptive PCM encoder. The snr of the system increases further
when the efficiency of the predictors used improve. However,
the performance of a predictor in any differential system is
closely related to the quantizer used. The better the quantization
the more information is available to the predictor and the better
the prediction of the incoming speech samples. This leads
automatically to the investigation in techniques of efficient
quantization. A novel adaptive quantization technique called
Dynamic Ratio quantizer (DRQ) is then considered and its theory
presented. The quantizer uses an adaptive non-linear element
which transforms the input samples of any amplitude to samples
within a defined amplitude range. A fixed uniform quantizer
quantizes the transformed signal. The snr for this quantizer
is almost constant over a range of input power limited in practice
by the dynamia range of the adaptive non-linear element, and it
is 2 to 3 dB's better than the snr of a One Word Memory adaptive
quantizer.
Digital computer simulation techniques have been used widely
in the above investigations and provide the necessary experimental
flexibility. Their use is described in the text
Recommended from our members
Automatic sound synthesizer programming: techniques and applications
The aim of this thesis is to investigate techniques for, and applications of automatic sound synthesizer programming. An automatic sound synthesizer programmer is a system which removes the requirement to explicitly specify parameter settings for a sound synthesis algorithm from the user. Two forms of these systems are discussed in this thesis:
tone matching programmers and synthesis space explorers. A tone matching programmer takes at its input a sound synthesis algorithm and a desired target sound. At its output it produces a configuration for the sound synthesis algorithm which causes it to emit a
similar sound to the target. The techniques for achieving this that are investigated are
genetic algorithms, neural networks, hill climbers and data driven approaches. A synthesis
space explorer provides a user with a representation of the space of possible sounds
that a synthesizer can produce and allows them to interactively explore this space. The
applications of automatic sound synthesizer programming that are investigated include
studio tools, an autonomous musical agent and a self-reprogramming drum machine. The
research employs several methodologies: the development of novel software frameworks
and tools, the examination of existing software at the source code and performance levels
and user trials of the tools and software. The main contributions made are: a method
for visualisation of sound synthesis space and low dimensional control of sound synthesizers; a general purpose framework for the deployment and testing of sound synthesis and optimisation algorithms in the SuperCollider language sclang; a comparison of a variety of optimisation techniques for sound synthesizer programming; an analysis of sound synthesizer error surfaces; a general purpose sound synthesizer programmer compatible with industry standard tools; an automatic improviser which passes a loose equivalent of the Turing test for Jazz musicians, i.e. being half of a man-machine duet which was rated as one of the best sessions of 2009 on the BBC's 'Jazz on 3' programme
The development of speech coding and the first standard coder for public mobile telephony
This thesis describes in its core chapter (Chapter 4) the original algorithmic and design features of the ??rst coder for public mobile telephony, the GSM full-rate speech coder, as standardized in 1988. It has never been described in so much detail as presented here. The coder is put in a historical perspective by two preceding chapters on the history of speech production models and the development of speech coding techniques until the mid 1980s, respectively. In the epilogue a brief review is given of later developments in speech coding. The introductory Chapter 1 starts with some preliminaries. It is de- ??ned what speech coding is and the reader is introduced to speech coding standards and the standardization institutes which set them. Then, the attributes of a speech coder playing a role in standardization are explained. Subsequently, several applications of speech coders - including mobile telephony - will be discussed and the state of the art in speech coding will be illustrated on the basis of some worldwide recognized standards. Chapter 2 starts with a summary of the features of speech signals and their source, the human speech organ. Then, historical models of speech production which form the basis of di??erent kinds of modern speech coders are discussed. Starting with a review of ancient mechanical models, we will arrive at the electrical source-??lter model of the 1930s. Subsequently, the acoustic-tube models as they arose in the 1950s and 1960s are discussed. Finally the 1970s are reviewed which brought the discrete-time ??lter model on the basis of linear prediction. In a unique way the logical sequencing of these models is exposed, and the links are discussed. Whereas the historical models are discussed in a narrative style, the acoustic tube models and the linear prediction tech nique as applied to speech, are subject to more mathematical analysis in order to create a sound basis for the treatise of Chapter 4. This trend continues in Chapter 3, whenever instrumental in completing that basis. In Chapter 3 the reader is taken by the hand on a guided tour through time during which successive speech coding methods pass in review. In an original way special attention is paid to the evolutionary aspect. Speci??cally, for each newly proposed method it is discussed what it added to the known techniques of the time. After presenting the relevant predecessors starting with Pulse Code Modulation (PCM) and the early vocoders of the 1930s, we will arrive at Residual-Excited Linear Predictive (RELP) coders, Analysis-by-Synthesis systems and Regular- Pulse Excitation in 1984. The latter forms the basis of the GSM full-rate coder. In Chapter 4, which constitutes the core of this thesis, explicit forms of Multi-Pulse Excited (MPE) and Regular-Pulse Excited (RPE) analysis-by-synthesis coding systems are developed. Starting from current pulse-amplitude computation methods in 1984, which included solving sets of equations (typically of order 10-16) two hundred times a second, several explicit-form designs are considered by which solving sets of equations in real time is avoided. Then, the design of a speci??c explicitform RPE coder and an associated eÆcient architecture are described. The explicit forms and the resulting architectural features have never been published in so much detail as presented here. Implementation of such a codec enabled real-time operation on a state-of-the-art singlechip digital signal processor of the time. This coder, at a bit rate of 13 kbit/s, has been selected as the Full-Rate GSM standard in 1988. Its performance is recapitulated. Chapter 5 is an epilogue brie y reviewing the major developments in speech coding technology after 1988. Many speech coding standards have been set, for mobile telephony as well as for other applications, since then. The chapter is concluded by an outlook
Object-based audio for interactive football broadcast
An end-to-end AV broadcast system providing an immersive, interactive experience for live events is the development aim for the EU FP7 funded project, FascinatE. The project has developed real time audio object event detection and localisation, scene modelling and processing methods for multimedia data including 3D audio, which will allow users to navigate the event by creating their own unique user-defined scene. As part of the first implementation of the system a test shoot was carried out capturing a live Premier League football game and methods have been developed to detect, analyse, extract and localise salient audio events from a range of sensors and represent them within an audio scene in order to allow free navigation within the scene
- …