134 research outputs found

    Frequency-warped autoregressive modeling and filtering

    Get PDF
    This thesis consists of an introduction and nine articles. The articles are related to the application of frequency-warping techniques to audio signal processing, and in particular, predictive coding of wideband audio signals. The introduction reviews the literature and summarizes the results of the articles. Frequency-warping, or simply warping techniques are based on a modification of a conventional signal processing system so that the inherent frequency representation in the system is changed. It is demonstrated that this may be done for basically all traditional signal processing algorithms. In audio applications it is beneficial to modify the system so that the new frequency representation is close to that of human hearing. One of the articles is a tutorial paper on the use of warping techniques in audio applications. Majority of the articles studies warped linear prediction, WLP, and its use in wideband audio coding. It is proposed that warped linear prediction would be particularly attractive method for low-delay wideband audio coding. Warping techniques are also applied to various modifications of classical linear predictive coding techniques. This was made possible partly by the introduction of a class of new implementation techniques for recursive filters in one of the articles. The proposed implementation algorithm for recursive filters having delay-free loops is a generic technique. This inspired to write an article which introduces a generalized warped linear predictive coding scheme. One example of the generalized approach is a linear predictive algorithm using almost logarithmic frequency representation.reviewe

    Wavenet based low rate speech coding

    Full text link
    Traditional parametric coding of speech facilitates low rate but provides poor reconstruction quality because of the inadequacy of the model used. We describe how a WaveNet generative speech model can be used to generate high quality speech from the bit stream of a standard parametric coder operating at 2.4 kb/s. We compare this parametric coder with a waveform coder based on the same generative model and show that approximating the signal waveform incurs a large rate penalty. Our experiments confirm the high performance of the WaveNet based coder and show that the speech produced by the system is able to additionally perform implicit bandwidth extension and does not significantly impair recognition of the original speaker for the human listener, even when that speaker has not been used during the training of the generative model.Comment: 5 pages, 2 figure

    A Comparison Between STRAIGHT, Glottal, an Sinusoidal Vocoding in Statistical Parametric Speech Synthesis

    Get PDF
    Speech is a fundamental method of human communication that allows conveying information between people. Even though the linguistic content is commonly regarded as the main information in speech, the signal contains a richness of other information, such as prosodic cues that shape the intended meaning of a sentence. This information is largely generated by quasi-periodic glottal excitation, which is the acoustic speech excitation airflow originating from the lungs that makes the vocal folds oscillate in the production of voiced speech. By regulating the sub-glottal pressure and the tension of the vocal folds, humans learn to affect the characteristics of the glottal excitation in order to signal the emotional state of the speaker for example. Glottal inverse filtering (GIF) is an estimation method for the glottal excitation of a recorded speech signal. Various cues about the speech signal, such as the mode of phonation, can be detected and analyzed from an estimate of the glottal flow, both instantaneously and as a function of time. Aside from its use in fundamental speech research, such as phonetics, the recent advances in GIF and machine learning enable a wider variety of GIF applications, such as emotional speech synthesis and the detection of paralinguistic information. However, GIF is a difficult inverse problem where the target algorithm output is generally unattainable with direct measurements. Thus the algorithms and their evaluation need to rely on some prior assumptions about the properties of the speech signal. A common thread utilized in most of the studies in this thesis is the estimation of the vocal tract transfer function (the key problem in GIF) by temporally weighting the optimization criterion in GIF so that the effect of the main excitation peak is attenuated. This thesis studies GIF from various perspectives---including the development of two new GIF methods that improve GIF performance over the state-of-the-art methods---and furthers basic research in the automated estimation of glottal excitation. The estimation of the GIF-based vocal tract transfer function for formant tracking and perceptually weighted speech envelope estimation is also studied. The central speech technology application of GIF addressed in the thesis is the use of GIF-based spectral envelope models and glottal excitation waveforms as target training data for the generative neural network models used in statistical parametric speech synthesis. The obtained results show that even though the presented studies provide improvements to the previous methodology for all voice types, GIF-based speech processing continues to mainly benefit male voices in speech synthesis applications.Puhe on olennainen osa ihmistenvälistä informaation siirtoa. Vaikka kielellistä sisältöä pidetään yleisesti puheen tärkeimpänä ominaisuutena, puhesignaali sisältää myös runsaasti muuta informaatiota kuten prosodisia vihjeitä, jotka muokkaavat siirrettävän informaation merkitystä. Tämä informaatio tuotetaan suurilta osin näennäisjaksollisella glottisherätteellä, joka on puheen herätteenä toimiva akustinen virtaussignaali. Säätämällä äänihuulten alapuolista painetta ja äänihuulten kireyttä ihmiset muuttavat glottisherätteen ominaisuuksia viestittääkseen esimerkiksi tunnetilaa. Glottaalinen käänteissuodatus (GKS) on laskennallinen menetelmä glottisherätteen estimointiin nauhoitetusta puhesignaalista. Glottisherätteen perusteella puheen laadusta voidaan tunnistaa useita piirteitä kuten ääntötapa, sekä hetkellisesti että ajan funktiona. Puheen perustutkimuksen, kuten fonetiikan, lisäksi viimeaikaiset edistykset GKS:ssä ja koneoppimisessa ovat avaamassa mahdollisuuksia laajempaan GKS:n soveltamiseen puheteknologiassa, kuten puhesynteesissä ja puheen biopiirteistämisessä paralingvistisiä sovelluksia varten. Haasteena on kuitenkin se, että GKS on vaikea käänteisongelma, jossa todellista puhetta vastaavan glottisherätteen suora mittaus on mahdotonta. Tästä johtuen GKS:ssä käytettävien algoritmien kehitystyö ja arviointi perustuu etukäteisoletuksiin puhesignaalin ominaisuuksista. Tässä väitöskirjassa esitetyissä menetelmissä on yhteisenä oletuksena se, että ääntöväylän siirtofunktio voidaan arvioida (joka on GKS:n pääongelma) aikapainottamalla GKS:n optimointikriteeriä niin, että glottisherätteen pääeksitaatiopiikkin vaikutus vaimenee. Tässä väitöskirjassa GKS:ta tutkitaan useasta eri näkökulmasta, jotka sisältävät kaksi uutta GKS-menetelmää, jotka parantavat arviointituloksia aikaisempiin menetelmiin verrattuna, sekä perustutkimusta käänteissuodatusprosessin automatisointiin liittyen. Lisäksi GKS-pohjaista ääntöväylän siirtofunktiota käytetään formanttiestimoinnissa sekä kuulohavaintopainotettuna versiona puheen spektrin verhokäyrän arvioinnissa. Tämän väitöskirjan keskeisin puheteknologiasovellus on GKS-pohjaisten puheen spektrin verhokäyrämallien sekä glottisheräteaaltomuotojen käyttö kohdedatana neuroverkkomalleille tilastollisessa parametrisessa puhesynteesissä. Saatujen tulosten perusteella kehitetyt menetelmät parantavat GKS-pohjaisten menetelmien laatua kaikilla äänityypeillä, mutta puhesynteesisovelluksissa GKS-pohjaiset ratkaisut hyödyttävät edelleen lähinnä matalia miesääniä

    Model-based analysis of noisy musical recordings with application to audio restoration

    Get PDF
    This thesis proposes digital signal processing algorithms for noise reduction and enhancement of audio signals. Approximately half of the work concerns signal modeling techniques for suppression of localized disturbances in audio signals, such as impulsive noise and low-frequency pulses. In this regard, novel algorithms and modifications to previous propositions are introduced with the aim of achieving a better balance between computational complexity and qualitative performance, in comparison with other schemes presented in the literature. The main contributions related to this set of articles are: an efficient algorithm for suppression of low-frequency pulses in audio signals; a scheme for impulsive noise detection that uses frequency-warped linear prediction; and two methods for reconstruction of audio signals within long gaps of missing samples. The remaining part of the work discusses applications of sound source modeling (SSM) techniques to audio restoration. It comprises application examples, such as a method for bandwidth extension of guitar tones, and discusses the challenge of model calibration based on noisy recorded sources. Regarding this matter, a frequency-selective spectral analysis technique called frequency-zooming ARMA (FZ-ARMA) modeling is proposed as an effective way to estimate the frequency and decay time of resonance modes associated with the partials of a given tone, despite the presence of corrupting noise in the observable signal.reviewe

    Stereo linear predictive coding of audio

    Get PDF

    Anthropomorphic Coding of Speech and Audio: A Model Inversion Approach

    Get PDF
    Auditory modeling is a well-established methodology that provides insight into human perception and that facilitates the extraction of signal features that are most relevant to the listener. The aim of this paper is to provide a tutorial on perceptual speech and audio coding using an invertible auditory model. In this approach, the audio signal is converted into an auditory representation using an invertible auditory model. The auditory representation is quantized and coded. Upon decoding, it is then transformed back into the acoustic domain. This transformation converts a complex distortion criterion into a simple one, thus facilitating quantization with low complexity. We briefly review past work on auditory models and describe in more detail the components of our invertible model and its inversion procedure, that is, the method to reconstruct the signal from the output of the auditory model. We summarize attempts to use the auditory representation for low-bit-rate coding. Our approach also allows the exploitation of the inherent redundancy of the human auditory system for the purpose of multiple description (joint source-channel) coding

    Recent Advances in Signal Processing

    Get PDF
    The signal processing task is a very critical issue in the majority of new technological inventions and challenges in a variety of applications in both science and engineering fields. Classical signal processing techniques have largely worked with mathematical models that are linear, local, stationary, and Gaussian. They have always favored closed-form tractability over real-world accuracy. These constraints were imposed by the lack of powerful computing tools. During the last few decades, signal processing theories, developments, and applications have matured rapidly and now include tools from many areas of mathematics, computer science, physics, and engineering. This book is targeted primarily toward both students and researchers who want to be exposed to a wide variety of signal processing techniques and algorithms. It includes 27 chapters that can be categorized into five different areas depending on the application at hand. These five categories are ordered to address image processing, speech processing, communication systems, time-series analysis, and educational packages respectively. The book has the advantage of providing a collection of applications that are completely independent and self-contained; thus, the interested reader can choose any chapter and skip to another without losing continuity
    corecore