213 research outputs found
Mandarin Singing Voice Synthesis Based on Harmonic Plus Noise Model and Singing Expression Analysis
The purpose of this study is to investigate how humans interpret musical
scores expressively, and then design machines that sing like humans. We
consider six factors that have a strong influence on the expression of human
singing. The factors are related to the acoustic, phonetic, and musical
features of a real singing signal. Given real singing voices recorded following
the MIDI scores and lyrics, our analysis module can extract the expression
parameters from the real singing signals semi-automatically. The expression
parameters are used to control the singing voice synthesis (SVS) system for
Mandarin Chinese, which is based on the harmonic plus noise model (HNM). The
results of perceptual experiments show that integrating the expression factors
into the SVS system yields a notable improvement in perceptual naturalness,
clearness, and expressiveness. By one-to-one mapping of the real singing signal
and expression controls to the synthesizer, our SVS system can simulate the
interpretation of a real singer with the timbre of a speaker.Comment: 8 pages, technical repor
Normal-to-Lombard Adaptation of Speech Synthesis Using Long Short-Term Memory Recurrent Neural Networks
In this article, three adaptation methods are compared based on how well they change the speaking style of a neural network based text-to-speech (TTS) voice. The speaking style conversion adopted here is from normal to Lombard speech. The selected adaptation methods are: auxiliary features (AF), learning hidden unit contribution (LHUC), and fine-tuning (FT). Furthermore, four state-of-the-art TTS vocoders are compared in the same context. The evaluated vocoders are: GlottHMM, GlottDNN, STRAIGHT, and pulse model in log-domain (PML). Objective and subjective evaluations were conducted to study the performance of both the adaptation methods and the vocoders. In the subjective evaluations, speaking style similarity and speech intelligibility were assessed. In addition to acoustic model adaptation, phoneme durations were also adapted from normal to Lombard with the FT adaptation method. In objective evaluations and speaking style similarity tests, we found that the FT method outperformed the other two adaptation methods. In speech intelligibility tests, we found that there were no significant differences between vocoders although the PML vocoder showed slightly better performance compared to the three other vocoders.Peer reviewe
Description and performance of a digital mobile satellite terminal
A major goal of the Mobile Satellite Experiment (MSAT-X) program at the Jet Propulsion Lab (JPL) is the development of an advanced digital terminal for use in land mobile satellite communication. The terminal has been developed to minimize the risk of applying advanced technologies to future commercial mobile satellite systems (MSS). Testing with existing L band satellites was performed in fixed, land mobile and aeronautical mobile environments. JPL's development and tests of its mobile terminal have demonstrated the viability of narrowband digital voice communications in a land mobile environment through geostationary satellites. This paper provides a consolidated description of the terminal architecture and the performance of its individual elements
Differential encoding techniques applied to speech signals
The increasing use of digital communication systems has
produced a continuous search for efficient methods of speech
encoding.
This thesis describes investigations of novel differential
encoding systems. Initially Linear First Order DPCM systems
employing a simple delayed encoding algorithm are examined.
The systems detect an overload condition in the encoder, and
through a simple algorithm reduce the overload noise at the
expense of some increase in the quantization (granular) noise.
The signal-to-noise ratio (snr) performance of such d codec has
1 to 2 dB's advantage compared to the First Order Linear DPCM
system.
In order to obtain a large improvement in snr the high
correlation between successive pitch periods as well as the
correlation between successive samples in the voiced speech
waveform is exploited. A system called "Pitch Synchronous
First Order DPCM" (PSFOD) has been developed. Here the difference
Sequence formed between the samples of the input sequence in the
current pitch period and the samples of the stored decoded
sequence from the previous pitch period are encoded. This
difference sequence has a smaller dynamic range than the original
input speech sequence enabling a quantizer with better resolution
to be used for the same transmission bit rate. The snr is increased
by 6 dB compared with the peak snr of a First Order DPCM codea.
A development of the PSFOD system called a Pitch Synchronous
Differential Predictive Encoding system (PSDPE) is next investigated.
The principle of its operation is to predict the next sample in
the voiced-speech waveform, and form the prediction error which
is then subtracted from the corresponding decoded prediction
error in the previous pitch period. The difference is then
encoded and transmitted. The improvement in snr is approximately
8 dB compared to an ADPCM codea, when the PSDPE system uses an
adaptive PCM encoder. The snr of the system increases further
when the efficiency of the predictors used improve. However,
the performance of a predictor in any differential system is
closely related to the quantizer used. The better the quantization
the more information is available to the predictor and the better
the prediction of the incoming speech samples. This leads
automatically to the investigation in techniques of efficient
quantization. A novel adaptive quantization technique called
Dynamic Ratio quantizer (DRQ) is then considered and its theory
presented. The quantizer uses an adaptive non-linear element
which transforms the input samples of any amplitude to samples
within a defined amplitude range. A fixed uniform quantizer
quantizes the transformed signal. The snr for this quantizer
is almost constant over a range of input power limited in practice
by the dynamia range of the adaptive non-linear element, and it
is 2 to 3 dB's better than the snr of a One Word Memory adaptive
quantizer.
Digital computer simulation techniques have been used widely
in the above investigations and provide the necessary experimental
flexibility. Their use is described in the text
Statistical parametric speech synthesis based on sinusoidal models
This study focuses on improving the quality of statistical speech synthesis based on sinusoidal
models. Vocoders play a crucial role during the parametrisation and reconstruction process,
so we first lead an experimental comparison of a broad range of the leading vocoder types.
Although our study shows that for analysis / synthesis, sinusoidal models with complex amplitudes
can generate high quality of speech compared with source-filter ones, component
sinusoids are correlated with each other, and the number of parameters is also high and varies
in each frame, which constrains its application for statistical speech synthesis.
Therefore, we first propose a perceptually based dynamic sinusoidal model (PDM) to decrease
and fix the number of components typically used in the standard sinusoidal model.
Then, in order to apply the proposed vocoder with an HMM-based speech synthesis system
(HTS), two strategies for modelling sinusoidal parameters have been compared. In the first
method (DIR parameterisation), features extracted from the fixed- and low-dimensional PDM
are statistically modelled directly. In the second method (INT parameterisation), we convert
both static amplitude and dynamic slope from all the harmonics of a signal, which we term
the Harmonic Dynamic Model (HDM), to intermediate parameters (regularised cepstral coefficients
(RDC)) for modelling. Our results show that HDM with intermediate parameters can
generate comparable quality to STRAIGHT.
As correlations between features in the dynamic model cannot be modelled satisfactorily
by a typical HMM-based system with diagonal covariance, we have applied and tested a deep
neural network (DNN) for modelling features from these two methods. To fully exploit DNN
capabilities, we investigate ways to combine INT and DIR at the level of both DNN modelling
and waveform generation. For DNN training, we propose to use multi-task learning to
model cepstra (from INT) and log amplitudes (from DIR) as primary and secondary tasks. We
conclude from our results that sinusoidal models are indeed highly suited for statistical parametric
synthesis. The proposed method outperforms the state-of-the-art STRAIGHT-based
equivalent when used in conjunction with DNNs.
To further improve the voice quality, phase features generated from the proposed vocoder
also need to be parameterised and integrated into statistical modelling. Here, an alternative
statistical model referred to as the complex-valued neural network (CVNN), which treats complex coefficients as a whole, is proposed to model complex amplitude explicitly. A complex-valued
back-propagation algorithm using a logarithmic minimisation criterion which includes
both amplitude and phase errors is used as a learning rule. Three parameterisation methods
are studied for mapping text to acoustic features: RDC / real-valued log amplitude, complex-valued
amplitude with minimum phase and complex-valued amplitude with mixed phase. Our
results show the potential of using CVNNs for modelling both real and complex-valued acoustic
features. Overall, this thesis has established competitive alternative vocoders for speech
parametrisation and reconstruction. The utilisation of proposed vocoders on various acoustic
models (HMM / DNN / CVNN) clearly demonstrates that it is compelling to apply them for
the parametric statistical speech synthesis
Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources
[ES] En los últimos años, el aprendizaje profundo ha cambiado significativamente el panorama en diversas áreas del campo de la inteligencia artificial, entre las que se incluyen la visión por computador, el procesamiento del lenguaje natural, robótica o teoría de juegos. En particular, el sorprendente éxito del aprendizaje profundo en múltiples aplicaciones del campo del procesamiento del lenguaje natural tales como el reconocimiento automático del habla (ASR), la traducción automática (MT) o la síntesis de voz (TTS), ha supuesto una mejora drástica en la precisión de estos sistemas, extendiendo así su implantación a un mayor rango de aplicaciones en la vida real. En este momento, es evidente que las tecnologías de reconocimiento automático del habla y traducción automática pueden ser empleadas para producir, de forma efectiva, subtítulos multilingües de alta calidad de contenidos audiovisuales. Esto es particularmente cierto en el contexto de los vídeos educativos, donde las condiciones acústicas son normalmente favorables para los sistemas de ASR y el discurso está gramaticalmente bien formado. Sin embargo, en el caso de TTS, aunque los sistemas basados en redes neuronales han demostrado ser capaces de sintetizar voz de un realismo y calidad sin precedentes, todavía debe comprobarse si esta tecnología está lo suficientemente madura como para mejorar la accesibilidad y la participación en el aprendizaje en línea. Además, existen diversas tareas en el campo de la síntesis de voz que todavía suponen un reto, como la clonación de voz inter-lingüe, la síntesis incremental o la adaptación zero-shot a nuevos locutores. Esta tesis aborda la mejora de las prestaciones de los sistemas actuales de síntesis de voz basados en redes neuronales, así como la extensión de su aplicación en diversos escenarios, en el contexto de mejorar la accesibilidad en el aprendizaje en línea. En este sentido, este trabajo presta especial atención a la adaptación a nuevos locutores y a la clonación de voz inter-lingüe, ya que los textos a sintetizar se corresponden, en este caso, a traducciones de intervenciones originalmente en otro idioma.[CA] Durant aquests darrers anys, l'aprenentatge profund ha canviat significativament el panorama en diverses àrees del camp de la intel·ligència artificial, entre les quals s'inclouen la visió per computador, el processament del llenguatge natural, robòtica o la teoria de jocs. En particular, el sorprenent èxit de l'aprenentatge profund en múltiples aplicacions del camp del processament del llenguatge natural, com ara el reconeixement automàtic de la parla (ASR), la traducció automàtica (MT) o la síntesi de veu (TTS), ha suposat una millora dràstica en la precisió i qualitat d'aquests sistemes, estenent així la seva implantació a un ventall més ampli a la vida real. En aquest moment, és evident que les tecnologies de reconeixement automàtic de la parla i traducció automàtica poden ser emprades per a produir, de forma efectiva, subtítols multilingües d'alta qualitat de continguts audiovisuals. Això és particularment cert en el context dels vídeos educatius, on les condicions acústiques són normalment favorables per als sistemes d'ASR i el discurs està gramaticalment ben format. No obstant això, al cas de TTS, encara que els sistemes basats en xarxes neuronals han demostrat ser capaços de sintetitzar veu d'un realisme i qualitat sense precedents, encara s'ha de comprovar si aquesta tecnologia és ja prou madura com per millorar l'accessibilitat i la participació en l'aprenentatge en línia. A més, hi ha diverses tasques al camp de la síntesi de veu que encara suposen un repte, com ara la clonació de veu inter-lingüe, la síntesi incremental o l'adaptació zero-shot a nous locutors. Aquesta tesi aborda la millora de les prestacions dels sistemes actuals de síntesi de veu basats en xarxes neuronals, així com l'extensió de la seva aplicació en diversos escenaris, en el context de millorar l'accessibilitat en l'aprenentatge en línia. En aquest sentit, aquest treball presta especial atenció a l'adaptació a nous locutors i a la clonació de veu interlingüe, ja que els textos a sintetitzar es corresponen, en aquest cas, a traduccions d'intervencions originalment en un altre idioma.[EN] In recent years, deep learning has fundamentally changed the landscapes of a number of areas in artificial intelligence, including computer vision, natural language processing, robotics, and game theory. In particular, the striking success of deep learning in a large variety of natural language processing (NLP) applications, including automatic speech recognition (ASR), machine translation (MT), and text-to-speech (TTS), has resulted in major accuracy improvements, thus widening the applicability of these technologies in real-life settings. At this point, it is clear that ASR and MT technologies can be utilized to produce cost-effective, high-quality multilingual subtitles of video contents of different kinds. This is particularly true in the case of transcription and translation of video lectures and other kinds of educational materials, in which the audio recording conditions are usually favorable for the ASR task, and there is a grammatically well-formed speech. However, although state-of-the-art neural approaches to TTS have shown to drastically improve the naturalness and quality of synthetic speech over conventional concatenative and parametric systems, it is still unclear whether this technology is already mature enough to improve accessibility and engagement in online learning, and particularly in the context of higher education. Furthermore, advanced topics in TTS such as cross-lingual voice cloning, incremental TTS or zero-shot speaker adaptation remain an open challenge in the field. This thesis is about enhancing the performance and widening the applicability of modern neural TTS technologies in real-life settings, both in offline and streaming conditions, in the context of improving accessibility and engagement in online learning. Thus, particular emphasis is placed on speaker adaptation and cross-lingual voice cloning, as the input text corresponds to a translated utterance in this context.Pérez González De Martos, AM. (2022). Deep Neural Networks for Automatic Speech-To-Speech Translation of Open Educational Resources [Tesis doctoral]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/184019TESISPremios Extraordinarios de tesis doctorale
- …