    Time-Varying Quasi-Closed-Phase Analysis for Accurate Formant Tracking in Speech Signals

    In this paper, we propose a new method for the accurate estimation and tracking of formants in speech signals using time-varying quasi-closed-phase (TVQCP) analysis. Conventional formant tracking methods typically adopt a two-stage estimate-and-track strategy wherein an initial set of formant candidates are estimated using short-time analysis (e.g., 10--50 ms), followed by a tracking stage based on dynamic programming or a linear state-space model. One of the main disadvantages of these approaches is that the tracking stage, however good it may be, cannot improve upon the formant estimation accuracy of the first stage. The proposed TVQCP method provides a single-stage formant tracking that combines the estimation and tracking stages into one. TVQCP analysis combines three approaches to improve formant estimation and tracking: (1) it uses temporally weighted quasi-closed-phase analysis to derive closed-phase estimates of the vocal tract with reduced interference from the excitation source, (2) it increases the residual sparsity by using the L1L_1 optimization and (3) it uses time-varying linear prediction analysis over long time windows (e.g., 100--200 ms) to impose a continuity constraint on the vocal tract model and hence on the formant trajectories. Formant tracking experiments with a wide variety of synthetic and natural speech signals show that the proposed TVQCP method performs better than conventional and popular formant tracking tools, such as Wavesurfer and Praat (based on dynamic programming), the KARMA algorithm (based on Kalman filtering), and DeepFormants (based on deep neural networks trained in a supervised manner). Matlab scripts for the proposed method can be found at: https://github.com/njaygowda/ftrac

    Computationally efficient music synthesis : methods and sound design

    Tässä diplomityössä esitetään musiikkisyntetisaattorin suunnittelua systeemille, jonka laskentateho ja muistikapasiteetti ovat rajoitettuja. Ensiksi kerrataan mahdollisia synteesitekniikoita sekä arvioidaan niiden käyttökelpoisuutta laskennallisesti tehokkaassa musiikkisynteesissä. Käytännössä käyttökelpoiset tekniikat ovat lisäävä ja lähde-suodinsynteesit, ja erikoistapauksissa taajuusmodulaatio-, aaltotaulukko- ja samplaussynteesit. Tämän jälkeen käyttökelpoisten tekniikoiden rakenteiden suunnittelua esitetään tarkemmin, sekä esitetään näiden rakenteiden ominaisuuksia ja suunnitteluongelmia. Suurin ongelma kohdataan digitaalisessa lähde-suodinsynteesissä, jossa klassisten aaltomuotojen, kuten saha-aallon käyttö lähdesignaalina on ongelmallista laskostumisen takia, joka johtuu aaltomuodossa olevista epäjatkuvuuksista. Olemassa olevia kaistarajoitettuja aaltomuotosynteesimenetelmiä kerrataan, ja polynomimuotoiseen kaistarajoitetuun askelfunktioon perustuvaa menetelmää esitellään tarkemmin antamalla suunnittelusääntöjä käyttökelpoisille polynomeille. Menetelmää testataan lisäksi kahdella kolmannen asteen polynomilla. Nämä polynomit vähentävät laskostumista korkeilla taajuuksilla enemmän verrattuna ensimmäisen asteen polynomiin, mutta pienillä taajuksilla ensimmäisen asteen polynomi tuottaa parempia tuloksia. Lisäksi kerrataan muita mahdollisia ääniefektialgoritmeja ja arvioidaan niiden käyttökelpoisuutta laskennallisesti tehokkaassa musiikkisynteesissä. Useasti äänisynteesisysteemin täytyy pystyä generoimaan musiikkia, jossa käytetään monia erilaisia ääniä, jotka ulottuvat oikeista akustisista soittimista elektronisiin soittimiin ja luonnon ääniin. Siksi tällainen systeemi tarvitsee huolellista äänten suunnittelua. Tässä diplomityössä esitetään suunnittelusääntöjä erilaisten äänien imitoimiseksi. Lisäksi esitellään synteesimenetelmien parametrien vaikutus äänivarianttien suunnitteluun.In this thesis, the design of a music synthesizer for systems suffering from limitations in computing power and memory capacity is presented. First, different possible synthesis techniques are reviewed and their applicability in computationally efficient music synthesis is discussed. In practice, the applicable techniques are limited to additive and source-filter synthesis, and, in special cases, to frequency modulation, wavetable and sampling synthesis. Next, the design of the structures of the applicable techniques are presented in detail, and properties and design issues of these structures are discussed. A major implementation problem is raised in digital source-filter synthesis, where the use of classic waveforms, such as sawtooth wave, as the source signal is challenging due to aliasing caused by waveform discontinuities. Methods for existing bandlimited waveform synthesis are reviewed, and a new approach using polynomial bandlimited step function is presented in detail with design rules for the applicable polynomials. The approach is also tested with two different third-order polynomials. They reduce aliasing more at high frequencies, but at low frequencies their performance is worse than with the first-order polynomial. In addition, some commonly used sound effect algorithms are reviewed with respect to their applicability in computationally efficient music synthesis. In many cases the sound synthesis system must be capable of producing music consisting of various different sounds ranging from real acoustic instruments to electronic instruments and sounds from nature. Therefore, the music synthesis system requires careful sound design. In this thesis, sound design rules for imitation of various sounds using the computationally efficient synthesis techniques are presented. In addition, the effects of the parameter variation for the design of sound variants are presented

    Developing a flexible and expressive realtime polyphonic wave terrain synthesis instrument based on a visual and multidimensional methodology

    The Jitter extended library for Max/MSP is distributed with a gamut of tools for the generation, processing, storage, and visual display of multidimensional data structures. With additional support for a wide range of media types, and the interaction between these mediums, the environment presents a perfect working ground for Wave Terrain Synthesis. This research details the practical development of a realtime Wave Terrain Synthesis instrument within the Max/MSP programming environment utilizing the Jitter extended library. Various graphical processing routines are explored in relation to their potential use for Wave Terrain Synthesis

    A Parametric Sound Object Model for Sound Texture Synthesis

    This thesis deals with the analysis and synthesis of sound textures based on parametric sound objects. An overview is provided about the acoustic and perceptual principles of textural acoustic scenes, and technical challenges for analysis and synthesis are considered. Four essential processing steps for sound texture analysis are identifi ed, and existing sound texture systems are reviewed, using the four-step model as a guideline. A theoretical framework for analysis and synthesis is proposed. A parametric sound object synthesis (PSOS) model is introduced, which is able to describe individual recorded sounds through a fi xed set of parameters. The model, which applies to harmonic and noisy sounds, is an extension of spectral modeling and uses spline curves to approximate spectral envelopes, as well as the evolution of parameters over time. In contrast to standard spectral modeling techniques, this representation uses the concept of objects instead of concatenated frames, and it provides a direct mapping between sounds of diff erent length. Methods for automatic and manual conversion are shown. An evaluation is presented in which the ability of the model to encode a wide range of di fferent sounds has been examined. Although there are aspects of sounds that the model cannot accurately capture, such as polyphony and certain types of fast modulation, the results indicate that high quality synthesis can be achieved for many different acoustic phenomena, including instruments and animal vocalizations. In contrast to many other forms of sound encoding, the parametric model facilitates various techniques of machine learning and intelligent processing, including sound clustering and principal component analysis. Strengths and weaknesses of the proposed method are reviewed, and possibilities for future development are discussed

    Astronomical component estimation (ACE v.1) by time-variant sinusoidal modeling

    Accurately deciphering periodic variations in paleoclimate proxy signals is essential for cyclostratigraphy. Classical spectral analysis often relies on methods based on (fast) Fourier transformation. This technique has no unique solution separating variations in amplitude and frequency. This characteristic can make it difficult to correctly interpret a proxy's power spectrum or to accurately evaluate simultaneous changes in amplitude and frequency in evolutionary analyses. This drawback is circumvented by using a polynomial approach to estimate instantaneous amplitude and frequency in orbital components. This approach was proven useful to characterize audio signals (music and speech), which are non-stationary in nature. Paleoclimate proxy signals and audio signals share similar dynamics; the only difference is the frequency relationship between the different components. A harmonic-frequency relationship exists in audio signals, whereas this relation is non-harmonic in paleoclimate signals. However, this difference is irrelevant for the problem of separating simultaneous changes in amplitude and frequency. Using an approach with overlapping analysis frames, the model (Astronomical Component Estimation, version 1: ACE v.1) captures time variations of an orbital component by modulating a stationary sinusoid centered at its mean frequency, with a single polynomial. Hence, the parameters that determine the model are the mean frequency of the orbital component and the polynomial coefficients. The first parameter depends on geologic interpretations, whereas the latter are estimated by means of linear least-squares. As output, the model provides the orbital component waveform, either in the depth or time domain. Uncertainty analyses of the model estimates are performed using Monte Carlo simulations. Furthermore, it allows for a unique decomposition of the signal into its instantaneous amplitude and frequency. Frequency modulation patterns reconstruct changes in accumulation rate, whereas amplitude modulation identifies eccentricity-modulated precession. The functioning of the time-variant sinusoidal model is illustrated and validated using a synthetic insolation signal. The new modeling approach is tested on two case studies: (1) a Pliocene-Pleistocene benthic delta O-18 record from Ocean Drilling Program (ODP) Site 846 and (2) a Danian magnetic susceptibility record from the Contessa Highway section, Gubbio, Italy

    Hidden Markov model based Finnish text-to-speech system utilizing glottal inverse filtering

    Tässä työssä esitetään uusi Markovin piilomalleihin (hidden Markov model, HMM) perustuva äänilähteen käänteissuodatusta hyödyntävä suomenkielinen puhesynteesijärjestelmä. Uuden puhesynteesimenetelmän päätavoite on tuottaa luonnolliselta kuulostavaa synteettistä puhetta, jonka ominaisuuksia voidaan muuttaa eri puhujien, puhetyylien tai jopa äänen emootiosisällön mukaan. Näiden tavoitteiden mahdollistamiseksi uudessa puhesynteesimenetelmässä mallinnetaan ihmisen äänentuottojärjestelmää äänilähteen käänteissuodatuksen ja HMM-mallinnuksen avulla. Uusi puhesynteesijärjestelmä hyödyntää äänilähteen käänteissuodatusmenetelmää, joka mahdollistaa äänilähteen ominaisuuksien parametrisoinnin erillään muista puheen parametreista, ja siten näiden parametrien mallintamisen erikseen HMM-järjestelmässä. Synteesivaiheessa luonnollisesta puheesta laskettuja glottispulsseja käytetään äänilähteen luomiseen, ja äänilähteen ominaisuuksia muokataan edelleen tilastollisen HMM-järjestelmän tuottaman parametrisen kuvauksen avulla, mikä imitoi oikeassa puheessa esiintyvää luonnollista äänilähteen ominaisuuksien vaihtelua. Subjektiivisten kuuntelukokeiden tulokset osoittavat, että uuden puhesynteesimenetelmän laatu on huomattavasti parempi verrattuna perinteiseen HMM-pohjaiseen puhesynteesijärjestelmään. Lisäksi tulokset osoittavat, että uusi puhesynteesimenetelmä pystyy tuottamaan luonnolliselta kuulostavaa puhetta eri puhujien ominaisuuksilla.In this work, a new hidden Markov model (HMM) based text-to-speech (TTS) system utilizing glottal inverse filtering is described. The primary goal of the new TTS system is to enable producing natural sounding synthetic speech in different speaking styles with different speaker characteristics and emotions. In order to achieve these goals, the function of the real human voice production mechanism is modeled with the help of glottal inverse filtering embedded in a statistical framework of HMM. The new TTS system uses a glottal inverse filtering based parametrization method that enables the extraction of voice source characteristics separate from other speech parameters, and thus the individual modeling of these characteristics in the HMM system. In the synthesis stage, natural glottal flow pulses are used for creating the voice source, and the voice source characteristics are further modified according to the adaptive all-pole model generated by the HMM system in order to imitate the natural variation in the real voice source. Subjective listening tests show that the quality of the new TTS system is considerably better compared to a traditional HMM-based speech synthesizer. Moreover, the new system is clearly able to produce natural sounding synthetic speech with specific speaker characteristics

    A computational framework for sound segregation in music signals

    Tese de doutoramento. Engenharia Electrotécnica e de Computadores. Faculdade de Engenharia. Universidade do Porto. 200

    Designing sound : procedural audio research based on the book by Andy Farnell

    In procedural media, data normally acquired by measuring something, commonly described as sampling, is replaced by a set of computational rules (procedure) that defines the typical structure and/or behaviour of that thing. Here, a general approach to sound as a definable process, rather than a recording, is developed. By analysis of their physical and perceptual qualities, natural objects or processes that produce sound are modelled by digital Sounding Objects for use in arts and entertainments. This Thesis discusses different aspects of Procedural Audio introducing several new approaches and solutions to this emerging field of Sound Design.Em Media Procedimental, os dados os dados normalmente adquiridos através da medição de algo habitualmente designado como amostragem, são substituídos por um conjunto de regras computacionais (procedimento) que definem a estrutura típica, ou comportamento, desse elemento. Neste caso é desenvolvida uma abordagem ao som definível como um procedimento em vez de uma gravação. Através da análise das suas características físicas e perceptuais , objetos naturais ou processos que produzem som, são modelados como objetos sonoros digitais para utilização nas Artes e Entretenimento. Nesta Tese são discutidos diferentes aspectos de Áudio Procedimental, sendo introduzidas várias novas abordagens e soluções para o campo emergente do Design Sonoro