49 research outputs found
Speech Enhancement Using Speech Synthesis Techniques
Traditional speech enhancement systems reduce noise by modifying the noisy signal to make it more like a clean signal, which suffers from two problems: under-suppression of noise and over-suppression of speech. These problems create distortions in enhanced speech and hurt the quality of the enhanced signal. We propose to utilize speech synthesis techniques for a higher quality speech enhancement system. Synthesizing clean speech based on the noisy signal could produce outputs that are both noise-free and high quality. We first show that we can replace the noisy speech with its clean resynthesis from a previously recorded clean speech dictionary from the same speaker (concatenative resynthesis). Next, we show that using a speech synthesizer (vocoder) we can create a clean resynthesis of the noisy speech for more than one speaker. We term this parametric resynthesis (PR). PR can generate better prosody from noisy speech than a TTS system which uses textual information only. Additionally, we can use the high quality speech generation capability of neural vocoders for better quality speech enhancement. When trained on data from enough speakers, these vocoders can generate speech from unseen speakers, both male, and female, with similar quality as seen speakers in training. Finally, we show that using neural vocoders we can achieve better objective signal and overall quality than the state-of-the-art speech enhancement systems and better subjective quality than an oracle mask-based system
Observations on the dynamic control of an articulatory synthesizer using speech production data
This dissertation explores the automatic generation of gestural score based control structures for a three-dimensional articulatory speech synthesizer. The gestural scores are optimized in an articulatory resynthesis paradigm using a dynamic programming algorithm and a cost function which measures the deviation from a gold standard in the form of natural speech production data. This data had been recorded using electromagnetic articulography, from the same speaker to which the synthesizer\u27s vocal tract model had previously been adapted. Future work to create an English voice for the synthesizer and integrate it into a text-to-speech platform is outlined.Die vorliegende Dissertation untersucht die automatische Erzeugung von gesturalpartiturbasierten Steuerdaten fĂŒr ein dreidimensionales artikulatorisches Sprachsynthesesystem. Die gesturalen Partituren werden in einem artikulatorischen Resynthese-Paradigma mittels dynamischer Programmierung optimiert, unter Zuhilfenahme einer Kostenfunktion, die den Abstand zu einem "Gold Standard" in Form natĂŒrlicher Sprachproduktionsdaten miĂt. Diese Daten waren mit elektromagnetischer Artikulographie am selben Sprecher aufgenommen worden, an den zuvor das Vokaltraktmodell des Synthesesystems angepaĂt worden war. WeiterfĂŒhrende Forschung, eine englische Stimme fĂŒr das Synthesesystem zu erzeugen und sie in eine Text-to-Speech-Plattform einzubetten, wird umrissen
Controllable music performance synthesis via hierarchical modelling
Lâexpression musicale requiert le contrĂŽle sur quelles notes sont jouĂ©es ainsi que comment elles se jouent.
Les synthétiseurs audios conventionnels offrent des contrÎles expressifs détaillés, cependant au détriment du réalisme.
La synthÚse neuronale en boßte noire des audios et les échantillonneurs concaténatifs sont capables de produire un son réaliste, pourtant, nous avons peu de mécanismes de contrÎle.
Dans ce travail, nous introduisons MIDI-DDSP, un modÚle hiérarchique des instruments musicaux qui permet tant la synthÚse neuronale réaliste des audios que le contrÎle sophistiqué de la part des utilisateurs.
Ă partir des paramĂštres interprĂ©tables de synthĂšse provenant du traitement diffĂ©rentiable des signaux numĂ©riques (Differentiable Digital Signal Processing, DDSP), nous infĂ©rons les notes musicales et la propriĂ©tĂ© de haut niveau de leur performance expressive (telles que le timbre, le vibrato, lâintensitĂ© et lâarticulation).
Ceci donne naissance Ă une hiĂ©rarchie de trois niveaux (notes, performance, synthĂšse) qui laisse aux individus la possibilitĂ© dâintervenir Ă chaque niveau, ou dâutiliser la distribution prĂ©alable entraĂźnĂ©e (notes Ă©tant donnĂ© performance, synthĂšse Ă©tant donnĂ© performance) pour une assistance crĂ©ative. Ă lâaide des expĂ©riences quantitatives et des tests dâĂ©coute, nous dĂ©montrons que cette hiĂ©rarchie permet de reconstruire des audios de haute fidĂ©litĂ©, de prĂ©dire avec prĂ©cision les attributs de performance dâune sĂ©quence de notes, mais aussi de manipuler indĂ©pendamment les attributs Ă©tant donnĂ© la performance. Comme il sâagit dâun systĂšme complet, la hiĂ©rarchie peut aussi gĂ©nĂ©rer des audios rĂ©alistes Ă partir dâune nouvelle sĂ©quence de notes.
En utilisant une hiĂ©rarchie interprĂ©table avec de multiples niveaux de granularitĂ©, MIDI-DDSP ouvre la porte aux outils auxiliaires qui renforce la capacitĂ© des individus Ă travers une grande variĂ©tĂ© dâexpĂ©rience musicale.Musical expression requires control of both what notes are played, and how they are performed.
Conventional audio synthesizers provide detailed expressive controls, but at the cost of realism.
Black-box neural audio synthesis and concatenative samplers can produce realistic audio, but have few mechanisms for control.
In this work, we introduce MIDI-DDSP a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control.
Starting from interpretable Differentiable Digital Signal Processing (DDSP) synthesis parameters, we infer musical notes and high-level properties of their expressive performance (such as timbre, vibrato, dynamics, and articulation).
This creates a 3-level hierarchy (notes, performance, synthesis) that affords individuals the option to intervene at each level, or utilize trained priors (performance given notes, synthesis given performance) for creative assistance. Through quantitative experiments and listening tests, we demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence.
By utilizing an interpretable hierarchy, with multiple levels of granularity, MIDI-DDSP opens the door to assistive tools to empower individuals across a diverse range of musical experience
An Investigation of nonlinear speech synthesis and pitch modification techniques
Speech synthesis technology plays an important role in many aspects of manâmachine interaction,
particularly in telephony applications. In order to be widely accepted, the synthesised
speech quality should be as humanâlike as possible. This thesis investigates novel techniques
for the speech signal generation stage in a speech synthesiser, based on concepts from nonlinear
dynamical theory. It focuses on naturalâsounding synthesis for voiced speech, coupled with the
ability to generate the sound at the required pitch.
The oneâdimensional voiced speech timeâdomain signals are embedded into an appropriate
higher dimensional space, using Takensâ method of delays. These reconstructed state space
representations have approximately the same dynamical properties as the original speech generating
system and are thus effective models.
A new technique for marking epoch points in voiced speech that operates in the state space
domain is proposed. Using the fact that one revolution of the state space representation is equal
to one pitch period, pitch synchronous points can be found using a PoincarÂŽe map. Evidently the
epoch pulses are pitch synchronous and therefore can be marked.
The same state space representation is also used in a locallyâlinear speech synthesiser. This
models the nonlinear dynamics of the speech signal by a series of local approximations, using
the original signal as a template. The synthesised speech is naturalâsounding because, rather
than simply copying the original data, the technique makes use of the local dynamics to create
a new, unique signal trajectory. Pitch modification within this synthesis structure is also investigated,
with an attempt made to exploit the Ë Silnikovâtype orbit of voiced speech state space
reconstructions. However, this technique is found to be incompatible with the locallyâlinear
modelling technique, leaving the pitch modification issue unresolved.
A different modelling strategy, using a radial basis function neural network to model the state
space dynamics, is then considered. This produces a parametric model of the speech sound.
Synthesised speech is obtained by connecting a delayed version of the network output back to
the input via a global feedback loop. The network then synthesises speech in a freeârunning
manner. Stability of the output is ensured by using regularisation theory when learning the
weights. Complexity is also kept to a minimum because the network centres are fixed on a
dataâindependent hyperâlattice, so only the linearâinâtheâparameters weights need to be learnt
for each vowel realisation. Pitch modification is again investigated, based around the idea of
interpolating the weight vector between different realisations of the same vowel, but at differing
pitch values. However modelling the interâpitch weight vector variations is very difficult, indicating
that further study of pitch modification techniques is required before a complete nonlinear
synthesiser can be implemented
Making music through real-time voice timbre analysis: machine learning and timbral control
PhDPeople can achieve rich musical expression through vocal sound { see for example
human beatboxing, which achieves a wide timbral variety through a range of
extended techniques. Yet the vocal modality is under-exploited as a controller
for music systems. If we can analyse a vocal performance suitably in real time,
then this information could be used to create voice-based interfaces with the
potential for intuitive and ful lling levels of expressive control.
Conversely, many modern techniques for music synthesis do not imply any
particular interface. Should a given parameter be controlled via a MIDI keyboard,
or a slider/fader, or a rotary dial? Automatic vocal analysis could provide
a fruitful basis for expressive interfaces to such electronic musical instruments.
The principal questions in applying vocal-based control are how to extract
musically meaningful information from the voice signal in real time, and how
to convert that information suitably into control data. In this thesis we address
these questions, with a focus on timbral control, and in particular we
develop approaches that can be used with a wide variety of musical instruments
by applying machine learning techniques to automatically derive the mappings
between expressive audio input and control output. The vocal audio signal is
construed to include a broad range of expression, in particular encompassing
the extended techniques used in human beatboxing.
The central contribution of this work is the application of supervised and
unsupervised machine learning techniques to automatically map vocal timbre
to synthesiser timbre and controls. Component contributions include a delayed
decision-making strategy for low-latency sound classi cation, a regression-tree
method to learn associations between regions of two unlabelled datasets, a fast
estimator of multidimensional di erential entropy and a qualitative method for
evaluating musical interfaces based on discourse analysis
Corpus-based unit selection for natural-sounding speech synthesis
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 179-196).This electronic version was submitted by the student author. The certified thesis is available in the Institute Archives and Special Collections.Speech synthesis is an automatic encoding process carried out by machine through which symbols conveying linguistic information are converted into an acoustic waveform. In the past decade or so, a recent trend toward a non-parametric, corpus-based approach has focused on using real human speech as source material for producing novel natural-sounding speech. This work proposes a communication-theoretic formulation in which unit selection is a noisy channel through which an input sequence of symbols passes and an output sequence, possibly corrupted due to the coverage limits of the corpus, emerges. The penalty of approximation is quantified by substitution and concatenation costs which grade what unit contexts are interchangeable and where concatenations are not perceivable. These costs are semi-automatically derived from data and are found to agree with acoustic-phonetic knowledge. The implementation is based on a finite-state transducer (FST) representation that has been successfully used in speech and language processing applications including speech recognition. A proposed constraint kernel topology connects all units in the corpus with associated substitution and concatenation costs and enables an efficient Viterbi search that operates with low latency and scales to large corpora. An A* search can be applied in a second, rescoring pass to incorporate finer acoustic modelling. Extensions to this FST-based search include hierarchical and paralinguistic modelling. The search can also be used in an iterative feedback loop to record new utterances to enhance corpus coverage. This speech synthesis framework has been deployed across various domains and languages in many voices, a testament to its flexibility and rapid prototyping capability.(cont.) Experimental subjects completing tasks in a given air travel planning scenario by interacting in real time with a spoken dialogue system over the telephone have found the system "easiest to understand" out of eight competing systems. In more detailed listening evaluations, subjective opinions garnered from human participants are found to be correlated with objective measures calculable by machine.by Jon Rong-Wei Yi.Ph.D
Recommended from our members
Automatic sound synthesizer programming: techniques and applications
The aim of this thesis is to investigate techniques for, and applications of automatic sound synthesizer programming. An automatic sound synthesizer programmer is a system which removes the requirement to explicitly specify parameter settings for a sound synthesis algorithm from the user. Two forms of these systems are discussed in this thesis:
tone matching programmers and synthesis space explorers. A tone matching programmer takes at its input a sound synthesis algorithm and a desired target sound. At its output it produces a configuration for the sound synthesis algorithm which causes it to emit a
similar sound to the target. The techniques for achieving this that are investigated are
genetic algorithms, neural networks, hill climbers and data driven approaches. A synthesis
space explorer provides a user with a representation of the space of possible sounds
that a synthesizer can produce and allows them to interactively explore this space. The
applications of automatic sound synthesizer programming that are investigated include
studio tools, an autonomous musical agent and a self-reprogramming drum machine. The
research employs several methodologies: the development of novel software frameworks
and tools, the examination of existing software at the source code and performance levels
and user trials of the tools and software. The main contributions made are: a method
for visualisation of sound synthesis space and low dimensional control of sound synthesizers; a general purpose framework for the deployment and testing of sound synthesis and optimisation algorithms in the SuperCollider language sclang; a comparison of a variety of optimisation techniques for sound synthesizer programming; an analysis of sound synthesizer error surfaces; a general purpose sound synthesizer programmer compatible with industry standard tools; an automatic improviser which passes a loose equivalent of the Turing test for Jazz musicians, i.e. being half of a man-machine duet which was rated as one of the best sessions of 2009 on the BBC's 'Jazz on 3' programme