36 research outputs found
Frequency-warped autoregressive modeling and filtering
This thesis consists of an introduction and nine articles. The articles are related to the application of frequency-warping techniques to audio signal processing, and in particular, predictive coding of wideband audio signals. The introduction reviews the literature and summarizes the results of the articles.
Frequency-warping, or simply warping techniques are based on a modification of a conventional signal processing system so that the inherent frequency representation in the system is changed. It is demonstrated that this may be done for basically all traditional signal processing algorithms. In audio applications it is beneficial to modify the system so that the new frequency representation is close to that of human hearing. One of the articles is a tutorial paper on the use of warping techniques in audio applications.
Majority of the articles studies warped linear prediction, WLP, and its use in wideband audio coding. It is proposed that warped linear prediction would be particularly attractive method for low-delay wideband audio coding. Warping techniques are also applied to various modifications of classical linear predictive coding techniques. This was made possible partly by the introduction of a class of new implementation techniques for recursive filters in one of the articles. The proposed implementation algorithm for recursive filters having delay-free loops is a generic technique. This inspired to write an article which introduces a generalized warped linear predictive coding scheme. One example of the generalized approach is a linear predictive algorithm using almost logarithmic frequency representation.reviewe
Mise en forme du bruit de codage dans la norme G.722.1 ITU-T
Le projet décrit dans ce mémoire traite de la mise en forme du bruit de codage dans la norme G.722.1 ITU-T. Cette étude a deux volets: d'une part, il s'agit d'incorporer trois techniques de mise en forme du bruit dans la norme G.722.1, à savoir, le gauchissement fréquentiel (frequency warping ), le fenêtrage adaptatif ( windows switching ), et la mise en forme temporelle du bruit ( temporal noise shaping ), dans le but d'étudier l'effet de chaque modification sur la qualité de codage à 16 kbit/s des signaux échantillonnés à 16 kHz. L'autre volet du projet a pour objet la substitution de la quantification scalaire et du codage Huffman par une quantification vectorielle algébrique. Une quantification sphérique se basant sur un réseau de Gosset à 8 dimensions, E[indice inférieur 8] , sera utilisée à cet effet pour quantifier les coefficients du spectre générés dans la norme G.722.1. Cette application va tenter de réaliser un codage des signaux audio large bande (50 Hz-7 kHz) à un taux binaire de 16 kbit/s
Codificação digital de áudio baseada em retroadaptação perceptual
Doutoramento em Engenharia ElectrónicaFaz-se uma análise do problema da codificação digital de sinais áudio de alta qualidade e identifica-se o princípio de codificação perceptual como a solução mais satisfatória. Apresenta-se uma síntese dos sistemas de codificação perceptual encontrados na literatura, e identificam-se, comparam-se e relacionam-se as técnicas usadas em cada um. Pela sua relevância para a
codificação de áudio, faz-se um estudo mais aprofundado das transformadas e bancos de filtros multifrequência, da quantização, dos códigos reversíveis e dos modelos matemáticos da percepção auditiva. Propõe-se um sistema de codificação composto por um banco de filtros multi-resolução, quantizadores logarítmicos adaptativos, codificação aritmética, e um modelo psicoacústico explícito para adaptar os quantizadores de acordo com critérios perceptuais. Ao contrário de outros codificadores perceptuais, o sistema proposto é retroadaptativo, isto é: a adaptação depende exclusivamente de amostras já quantizadas, e não do sinal original. Discutimos as vantagens do uso de retroadaptação e mostramos que esta técnica pode ser aplicada com sucesso à codificação perceptual.The problem of digital coding of high quality audio signals is analised, and the principles of perceptual coding are identified as the most satisfactory
approach. We present a synthesis of the perceptual coding systems found in the literature, and we identify, compare and relate the techniques used in each one. Given their relevance for audio coding, transforms and multifrequency filter banks as well as quantization, lossless coding, and mathematical models of auditory perception are subject to a more thorough study. We propose a coding system consisting of a multirate filter bank, logarithmic quantizers, arithmetic entropy coding and an explicit psychoacoustic model to adapt the quantization according to perceptual considerations. Unlike other perceptual coders, the proposed system is backward-adaptive, that
is: adaptation depends exclusively on already quantized samples, not on the original signal. We discuss the advantages of backward-adaptation and show that it can be successfully applied to perceptual coding
Wavelet Filter Banks in Perceptual Audio Coding
This thesis studies the application of the wavelet filter bank (WFB) in perceptual audio coding by providing brief overviews of perceptual coding, psychoacoustics, wavelet theory, and existing wavelet coding algorithms. Furthermore, it describes the poor frequency localization property of the WFB and explores one filter design method, in particular, for improving channel separation between the wavelet bands. A wavelet audio coder has also been developed by the author to test the new filters. Preliminary tests indicate that the new filters provide some improvement over other wavelet filters when coding audio signals that are stationary-like and contain only a few harmonic components, and similar results for other types of audio signals that contain many spectral and temporal components.
It has been found that the WFB provides a flexible decomposition scheme through the choice of the tree structure and basis filter, but at the cost of poor localization properties. This flexibility can be a benefit in the context of audio coding but the poor localization properties represent a drawback. Determining ways to fully utilize this flexibility, while minimizing the effects of poor time-frequency localization, is an area that is still very much open for research
Speaker normalisation for large vocabulary multiparty conversational speech recognition
One of the main problems faced by automatic speech recognition is the variability of
the testing conditions. This is due both to the acoustic conditions (different transmission
channels, recording devices, noises etc.) and to the variability of speech
across different speakers (i.e. due to different accents, coarticulation of phonemes
and different vocal tract characteristics). Vocal tract length normalisation (VTLN)
aims at normalising the acoustic signal, making it independent from the vocal tract
length. This is done by a speaker specific warping of the frequency axis parameterised
through a warping factor. In this thesis the application of VTLN to multiparty
conversational speech was investigated focusing on the meeting domain. This
is a challenging task showing a great variability of the speech acoustics both across
different speakers and across time for a given speaker. VTL, the distance between
the lips and the glottis, varies over time. We observed that the warping factors estimated
using Maximum Likelihood seem to be context dependent: appearing to be
influenced by the current conversational partner and being correlated with the behaviour
of formant positions and the pitch. This is because VTL also influences the
frequency of vibration of the vocal cords and thus the pitch. In this thesis we also
investigated pitch-adaptive acoustic features with the goal of further improving the
speaker normalisation provided by VTLN.
We explored the use of acoustic features obtained using a pitch-adaptive analysis
in combination with conventional features such as Mel frequency cepstral coefficients.
These spectral representations were combined both at the acoustic feature
level using heteroscedastic linear discriminant analysis (HLDA), and at the system
level using ROVER. We evaluated this approach on a challenging large vocabulary
speech recognition task: multiparty meeting transcription. We found that VTLN
benefits the most from pitch-adaptive features. Our experiments also suggested that
combining conventional and pitch-adaptive acoustic features using HLDA results in
a consistent, significant decrease in the word error rate across all the tasks. Combining
at the system level using ROVER resulted in a further significant improvement.
Further experiments compared the use of pitch adaptive spectral representation with
the adoption of a smoothed spectrogram for the extraction of cepstral coefficients.
It was found that pitch adaptive spectral analysis, providing a representation which
is less affected by pitch artefacts (especially for high pitched speakers), delivers features with an improved speaker independence. Furthermore this has also shown to
be advantageous when HLDA is applied. The combination of a pitch adaptive spectral
representation and VTLN based speaker normalisation in the context of LVCSR
for multiparty conversational speech led to more speaker independent acoustic models
improving the overall recognition performances
Recent Advances in Signal Processing
The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
Speech Recognition
Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes
Robust speech recognition with spectrogram factorisation
Communication by speech is intrinsic for humans. Since the breakthrough of mobile devices and wireless communication, digital transmission of speech has become ubiquitous. Similarly distribution and storage of audio and video data has increased rapidly. However, despite being technically capable to record and process audio signals, only a fraction of digital systems and services are actually able to work with spoken input, that is, to operate on the lexical content of speech. One persistent obstacle for practical deployment of automatic speech recognition systems is inadequate robustness against noise and other interferences, which regularly corrupt signals recorded in real-world environments.
Speech and diverse noises are both complex signals, which are not trivially separable. Despite decades of research and a multitude of different approaches, the problem has not been solved to a sufficient extent. Especially the mathematically ill-posed problem of separating multiple sources from a single-channel input requires advanced models and algorithms to be solvable. One promising path is using a composite model of long-context atoms to represent a mixture of non-stationary sources based on their spectro-temporal behaviour. Algorithms derived from the family of non-negative matrix factorisations have been applied to such problems to separate and recognise individual sources like speech.
This thesis describes a set of tools developed for non-negative modelling of audio spectrograms, especially involving speech and real-world noise sources. An overview is provided to the complete framework starting from model and feature definitions, advancing to factorisation algorithms, and finally describing different routes for separation, enhancement, and recognition tasks. Current issues and their potential solutions are discussed both theoretically and from a practical point of view. The included publications describe factorisation-based recognition systems, which have been evaluated on publicly available speech corpora in order to determine the efficiency of various separation and recognition algorithms. Several variants and system combinations that have been proposed in literature are also discussed. The work covers a broad span of factorisation-based system components, which together aim at providing a practically viable solution to robust processing and recognition of speech in everyday situations