Search CORE

278 research outputs found

GLOTTAL EXCITATION EXTRACTION OF VOICED SPEECH - JOINTLY PARAMETRIC AND NONPARAMETRIC APPROACHES

Author: Chen Yiqiao
Publication venue: Clemson University Libraries
Publication date: 01/05/2012
Field of study

The goal of this dissertation is to develop methods to recover glottal flow pulses, which contain biometrical information about the speaker. The excitation information estimated from an observed speech utterance is modeled as the source of an inverse problem. Windowed linear prediction analysis and inverse filtering are first used to deconvolve the speech signal to obtain a rough estimate of glottal flow pulses. Linear prediction and its inverse filtering can largely eliminate the vocal-tract response which is usually modeled as infinite impulse response filter. Some remaining vocal-tract components that reside in the estimate after inverse filtering are next removed by maximum-phase and minimum-phase decomposition which is implemented by applying the complex cepstrum to the initial estimate of the glottal pulses. The additive and residual errors from inverse filtering can be suppressed by higher-order statistics which is the method used to calculate cepstrum representations. Some features directly provided by the glottal source\u27s cepstrum representation as well as fitting parameters for estimated pulses are used to form feature patterns that were applied to a minimum-distance classifier to realize a speaker identification system with very limited subjects

Clemson University: TigerPrints

Expressive visual text to speech and expression adaptation using deep neural networks

Author: Cipolla R
Maia R
Parker J
Stylianou Y
Publication venue: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Publication date: 16/06/2017
Field of study

In this paper, we present an expressive visual text to speech system (VTTS) based on a deep neural network (DNN). Given an input text sentence and a set of expression tags, the VTTS is able to produce not only the audio speech, but also the accompanying facial movements. The expressions can either be one of the expressions in the training corpus or a blend of expressions from the training corpus. Furthermore, we present a method of adapting a previously trained DNN to include a new expression using a small amount of training data. Experiments show that the proposed DNN-based VTTS is preferred by 57.9% over the baseline hidden Markov model based VTTS which uses cluster adaptive training

Crossref

Apollo (Cambridge)

Pre-processing of Speech Signals for Robust Parameter Estimation

Author: Esquivel Jaramillo Alfredo
Publication venue: Aalborg Universitetsforlag
Publication date: 01/01/2021
Field of study

VBN

Single Channel Music Sound Separation Based on Spectrogram Decomposition and Note Classification

Author: A. Webb
C. Fevotte
D.D. Lee
G.J. Brown
K. Fukunage
M.R. Every
M.S. Pedersen
P. Smaragdis
P.A. Devijver
T. Virtanen
W. Wang
Y. Li
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Separating multiple music sources from a single channel mixture is a challenging problem. We present a new approach to this problem based on non-negative matrix factorization (NMF) and note classification, assuming that the instruments used to play the sound signals are known a priori. The spectrogram of the mixture signal is first decomposed into building components (musical notes) using an NMF algorithm. The Mel frequency cepstrum coefficients (MFCCs) of both the decomposed components and the signals in the training dataset are extracted. The mean squared errors (MSEs) between the MFCC feature space of the decomposed music component and those of the training signals are used as the similarity measures for the decomposed music notes. The notes are then labelled to the corresponding type of instruments by the K nearest neighbors (K-NN) classification algorithm based on the MSEs. Finally, the source signals are reconstructed from the classified notes and the weighting matrices obtained from the NMF algorithm. Simulations are provided to show the performance of the proposed system. © 2011 Springer-Verlag Berlin Heidelberg

Crossref

University of Surrey

Surrey Research Insight

Context-dependent acoustic modeling based on hidden maximum entropy model for statistical parametric speech synthesis

Author: A Borthwick
A Ratnaparkhi
AL Berger
AW Black
B Picart
CJ Leggetter
Fahimeh Bahmaninezhad
H Kawahara
H Liang
H Zen
H Zen
H Zen
H Zen
H Zen
H Zen
Hossein Sameti
J Ghomeshi
J Nocedal
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
J Yamagishi
JJ Odell
K Hashimoto
K Hashimoto
K Oura
K Shinoda
K Tokuda
K Tokuda
K Tokuda
K Yu
K Yu
L Qin
M Bijankhan
M Gibson
MJ Gales
R Kubichek
S Sakai
S Takaki
S Takaki
Simon King
SJ Young
Soheil Khorram
T Drugman
T Drugman
T Koriyama
T Toda
T Toda
T Yoshimura
T Yoshimura
Thomas Drugman
V Rangarajan
VV Digalakis
Y Qian
YJ Wu
YJ Wu
Publication venue: Springer Nature
Publication date: 01/01/2014
Field of study

Crossref

Springer - Publisher Connector

Edinburgh Research Explorer

Probabilistic generative modeling of speech

Author: Zhang Yang
Publication venue
Publication date: 01/12/2015
Field of study

Speech processing refers to a set of tasks that involve speech analysis and synthesis. Most speech processing algorithms model a subset of speech parameters of interest and blur the rest using signal processing techniques and feature extraction. However, evidence shows that many speech parameters can be more accurately estimated if they are modeled jointly; speech synthesis also benefits from joint modeling. This thesis proposes a probabilistic generative model for speech called the Probabilistic Acoustic Tube (PAT). The highlights of the model are threefold. First, it is among the very first works to build a complete probabilistic model for speech. Second, it has a well-designed model for the phase spectrum of speech, which has been hard to model and often neglected. Third, it models the AM-FM effects in speech, which are perceptually significant but often ignored in frame-based speech processing algorithms. Experiment shows that the proposed model has good potential for a number of speech processing tasks

Illinois Digital Environment for Access to Learning and Scholarship Repository

Singing information processing: techniques and applications

Author: Molina Martinez Emilio
Publication venue: UMA Editorial
Publication date: 01/01/2017
Field of study

Por otro lado, se presenta un método para el cambio realista de intensidad de voz cantada. Esta transformación se basa en un modelo paramétrico de la envolvente espectral, y mejora sustancialmente la percepción de realismo al compararlo con software comerciales como Melodyne o Vocaloid. El inconveniente del enfoque propuesto es que requiere intervención manual, pero los resultados conseguidos arrojan importantes conclusiones hacia la modificación automática de intensidad con resultados realistas. Por último, se propone un método para la corrección de disonancias en acordes aislados. Se basa en un análisis de múltiples F0, y un desplazamiento de la frecuencia de su componente sinusoidal. La evaluación la ha realizado un grupo de músicos entrenados, y muestra un claro incremento de la consonancia percibida después de la transformación propuesta.La voz cantada es una componente esencial de la música en todas las culturas del mundo, ya que se trata de una forma increíblemente natural de expresión musical. En consecuencia, el procesado automático de voz cantada tiene un gran impacto desde la perspectiva de la industria, la cultura y la ciencia. En este contexto, esta Tesis contribuye con un conjunto variado de técnicas y aplicaciones relacionadas con el procesado de voz cantada, así como con un repaso del estado del arte asociado en cada caso. En primer lugar, se han comparado varios de los mejores estimadores de tono conocidos para el caso de uso de recuperación por tarareo. Los resultados demuestran que \cite{Boersma1993} (con un ajuste no obvio de parámetros) y \cite{Mauch2014}, tienen un muy buen comportamiento en dicho caso de uso dada la suavidad de los contornos de tono extraídos. Además, se propone un novedoso sistema de transcripción de voz cantada basada en un proceso de histéresis definido en tiempo y frecuencia, así como una herramienta para evaluación de voz cantada en Matlab. El interés del método propuesto es que consigue tasas de error cercanas al estado del arte con un método muy sencillo. La herramienta de evaluación propuesta, por otro lado, es un recurso útil para definir mejor el problema, y para evaluar mejor las soluciones propuestas por futuros investigadores. En esta Tesis también se presenta un método para evaluación automática de la interpretación vocal. Usa alineamiento temporal dinámico para alinear la interpretación del usuario con una referencia, proporcionando de esta forma una puntuación de precisión de afinación y de ritmo. La evaluación del sistema muestra una alta correlación entre las puntuaciones dadas por el sistema, y las puntuaciones anotadas por un grupo de músicos expertos

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Málaga