Search CORE

772 research outputs found

Data mining Mandarin tone contour shapes

Author: Zhang Shuo
Publication venue
Publication date: 01/01/2019
Field of study

In spontaneous speech, Mandarin tones that belong to the same tone category may exhibit many different contour shapes. We explore the use of data mining and NLP techniques for understanding the variability of tones in a large corpus of Mandarin newscast speech. First, we adapt a graph-based approach to characterize the clusters (fuzzy types) of tone contour shapes observed in each tone n-gram category. Second, we show correlations between these realized contour shape types and a bag of automatically extracted linguistic features. We discuss the implications of the current study within the context of phonological and information theory

arXiv.org e-Print Archive

Crossref

Generation of prosody and speech for Mandarin Chinese

Author: DONG MINGHUI
Publication venue
Publication date: 19/02/2004
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

FastGraphTTS: An Ultrafast Syntax-Aware Speech Synthesis Framework

Author: Cheng Ning
Sun Aolan
Wang Jianzong
Xiao Jing
Zhang Xulong
Publication venue
Publication date: 15/09/2023
Field of study

This paper integrates graph-to-sequence into an end-to-end text-to-speech framework for syntax-aware modelling with syntactic information of input text. Specifically, the input text is parsed by a dependency parsing module to form a syntactic graph. The syntactic graph is then encoded by a graph encoder to extract the syntactic hidden information, which is concatenated with phoneme embedding and input to the alignment and flow-based decoding modules to generate the raw audio waveform. The model is experimented on two languages, English and Mandarin, using single-speaker, few samples of target speakers, and multi-speaker datasets, respectively. Experimental results show better prosodic consistency performance between input text and generated audio, and also get higher scores in the subjective prosodic evaluation, and show the ability of voice conversion. Besides, the efficiency of the model is largely boosted through the design of the AI chip operator with 5x acceleration.Comment: Accepted by The 35th IEEE International Conference on Tools with Artificial Intelligence. (ICTAI 2023

arXiv.org e-Print Archive

A Hierarchical Context-aware Modeling Approach for Multi-aspect and Multi-granular Pronunciation Assessment

Author: Chao Fu-An
Chen Berlin
Lo Tien-Hong
Sung Yao-Ting
Wu Tzu-I
Publication venue
Publication date: 01/06/2023
Field of study

Automatic Pronunciation Assessment (APA) plays a vital role in Computer-assisted Pronunciation Training (CAPT) when evaluating a second language (L2) learner's speaking proficiency. However, an apparent downside of most de facto methods is that they parallelize the modeling process throughout different speech granularities without accounting for the hierarchical and local contextual relationships among them. In light of this, a novel hierarchical approach is proposed in this paper for multi-aspect and multi-granular APA. Specifically, we first introduce the notion of sup-phonemes to explore more subtle semantic traits of L2 speakers. Second, a depth-wise separable convolution layer is exploited to better encapsulate the local context cues at the sub-word level. Finally, we use a score-restraint attention pooling mechanism to predict the sentence-level scores and optimize the component models with a multitask learning (MTL) framework. Extensive experiments carried out on a publicly-available benchmark dataset, viz. speechocean762, demonstrate the efficacy of our approach in relation to some cutting-edge baselines.Comment: Accepted to Interspeech 202

arXiv.org e-Print Archive

The role of tone in Swedish speech segmentation

Author: Lulaci Tugba
Roll Mikael
Söderström Pelle
Publication venue
Publication date: 02/06/2023
Field of study

Lund University Publications

Auditory prediction due to coarticulatory cues

Author: Lulaci Tugba
Roll Mikael
Söderström Pelle
Tronnier Mechtild
Publication venue
Publication date: 01/06/2023
Field of study

Lund University Publications

Subspace Gaussian Mixture Models for Language Identification and Dysarthric Speech Intelligibility Assessment

Author: Lleida Solano Eduardo
Martínez González David
Miguel Artiaga Antonio
Publication venue: Universidad de Zaragoza, Prensas de la Universidad
Publication date: 01/01/2015
Field of study

En esta Tesis se ha investigado la aplicación de técnicas de modelado de subespacios de mezclas de Gaussianas en dos problemas relacionados con las tecnologías del habla, como son la identificación automática de idioma (LID, por sus siglas en inglés) y la evaluación automática de inteligibilidad en el habla de personas con disartria. Una de las técnicas más importantes estudiadas es el análisis factorial conjunto (JFA, por sus siglas en inglés). JFA es, en esencia, un modelo de mezclas de Gaussianas en el que la media de cada componente se expresa como una suma de factores de dimensión reducida, y donde cada factor representa una contribución diferente a la señal de audio. Esta factorización nos permite compensar nuestros modelos frente a contribuciones indeseadas presentes en la señal, como la información de canal. JFA se ha investigado como clasficador y como extractor de parámetros. En esta última aproximación se modela un solo factor que representa todas las contribuciones presentes en la señal. Los puntos en este subespacio se denominan i-Vectors. Así, un i-Vector es un vector de baja dimensión que representa una grabación de audio. Los i-Vectors han resultado ser muy útiles como vector de características para representar señales en diferentes problemas relacionados con el aprendizaje de máquinas. En relación al problema de LID, se han investigado dos sistemas diferentes de acuerdo al tipo de información extraída de la señal. En el primero, la señal se parametriza en vectores acústicos con información espectral a corto plazo. En este caso, observamos mejoras de hasta un 50% con el sistema basado en i-Vectors respecto al sistema que utilizaba JFA como clasificador. Se comprobó que el subespacio de canal del modelo JFA también contenía información del idioma, mientras que con los i-Vectors no se descarta ningún tipo de información, y además, son útiles para mitigar diferencias entre los datos de entrenamiento y de evaluación. En la fase de clasificación, los i-Vectors de cada idioma se modelaron con una distribución Gaussiana en la que la matriz de covarianza era común para todos. Este método es simple y rápido, y no requiere de ningún post-procesado de los i-Vectors. En el segundo sistema, se introdujo el uso de información prosódica y formántica en un sistema de LID basado en i-Vectors. La precisión de éste estaba por debajo de la del sistema acústico. Sin embargo, los dos sistemas son complementarios, y se obtuvo hasta un 20% de mejora con la fusión de los dos respecto al sistema acústico solo. Tras los buenos resultados obtenidos para LID, y dado que, teóricamente, los i-Vectors capturan toda la información presente en la señal, decidimos usarlos para la evaluar de manera automática la inteligibilidad en el habla de personas con disartria. Los logopedas están muy interesados en esta tecnología porque permitiría evaluar a sus pacientes de una manera objetiva y consistente. En este caso, los i-Vectors se obtuvieron a partir de información espectral a corto plazo de la señal, y la inteligibilidad se calculó a partir de los i-Vectors obtenidos para un conjunto de palabras dichas por el locutor evaluado. Comprobamos que los resultados eran mucho mejores si en el entrenamiento del sistema se incorporaban datos de la persona que iba a ser evaluada. No obstante, esta limitación podría aliviarse utilizando una mayor cantidad de datos para entrenar el sistema.In this Thesis, we investigated how to effciently apply subspace Gaussian mixture modeling techniques onto two speech technology problems, namely automatic spoken language identification (LID) and automatic intelligibility assessment of dysarthric speech. One of the most important of such techniques in this Thesis was joint factor analysis (JFA). JFA is essentially a Gaussian mixture model where the mean of the components is expressed as a sum of low-dimension factors that represent different contributions to the speech signal. This factorization makes it possible to compensate for undesired sources of variability, like the channel. JFA was investigated as final classiffer and as feature extractor. In the latter approach, a single subspace including all sources of variability is trained, and points in this subspace are known as i-Vectors. Thus, one i-Vector is defined as a low-dimension representation of a single utterance, and they are a very powerful feature for different machine learning problems. We have investigated two different LID systems according to the type of features extracted from speech. First, we extracted acoustic features representing short-time spectral information. In this case, we observed relative improvements with i-Vectors with respect to JFA of up to 50%. We realized that the channel subspace in a JFA model also contains language information whereas i-Vectors do not discard any language information, and moreover, they help to reduce mismatches between training and testing data. For classification, we modeled the i-Vectors of each language with a Gaussian distribution with covariance matrix shared among languages. This method is simple and fast, and it worked well without any post-processing. Second, we introduced the use of prosodic and formant information with the i-Vectors system. The performance was below the acoustic system but both were found to be complementary and we obtained up to a 20% relative improvement with the fusion with respect to the acoustic system alone. Given the success in LID and the fact that i-Vectors capture all the information that is present in the data, we decided to use i-Vectors for other tasks, specifically, the assessment of speech intelligibility in speakers with different types of dysarthria. Speech therapists are very interested in this technology because it would allow them to objectively and consistently rate the intelligibility of their patients. In this case, the input features were extracted from short-term spectral information, and the intelligibility was assessed from the i-Vectors calculated from a set of words uttered by the tested speaker. We found that the performance was clearly much better if we had available data for training of the person that would use the application. We think that this limitation could be relaxed if we had larger databases for training. However, the recording process is not easy for people with disabilities, and it is difficult to obtain large datasets of dysarthric speakers open to the research community. Finally, the same system architecture for intelligibility assessment based on i-Vectors was used for predicting the accuracy that an automatic speech recognizer (ASR) system would obtain with dysarthric speakers. The only difference between both was the ground truth label set used for training. Predicting the performance response of an ASR system would increase the confidence of speech therapists in these systems and would diminish health related costs. The results were not as satisfactory as in the previous case, probably because an ASR is a complex system whose accuracy can be very difficult to be predicted only with acoustic information. Nonetheless, we think that we opened a door to an interesting research direction for the two problems

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Universidad de Zaragoza

Autoregressive neural F0 model for statistical parametric speech synthesis

Author: Takaki Shinji
Wang Xin
Yamagishi Junichi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 19/04/2018
Field of study

Crossref

Edinburgh Research Explorer

A Sound Approach to Language Matters: In Honor of Ocke-Schwen Bohn

Author: Avesani Cinzia
Baker Brett Joseph
Balling Laura Winther
Behne Dawn M.
Best Catherine
Bundgaard-Nielsen Rikke
Carlet Angélica
Cebrian Juli
Christensen Ken Ramshøj
Cooper Angela
Flege James Emil
Hejná Michaela
Hejná Mísa
Horslund Camilla Søballe
Hua Congehao
Højen Anders
Højen Anders
Jespersen Anna
Jespersen Anna Bothe
Jongman Allard
Jørgensen Henrik
Karmeli Sophia
Kizach Johannes
Kluge Denise Cristina
Lee Goun
Li Bin
Li Yingjie
Masapollo Matthew
Mooshammer Christine
Mora Joan C.
Mora-Plaza Ingrid
Niebuhr Oliver
Nyvad Anne Mette
Nyvad Anne Mette
Piske Thorsten
Polka Linda
Rasmussen Sidsel
Ruan Yufang
Sereno Joan A.
Steinlen Anja
Sørensen Mette Hjortshøj
Sørensen Mette Hjortshøj
Tyler Michael
Vayra Mario
Vikner Sten
Wang Yue
Wayland Ratree
Whalen D. H.
Wood Johanna
Yan Mengzhu
Publication venue: 'Aarhus University Library'
Publication date: 16/05/2019
Field of study

The contributions in this Festschrift were written by Ocke’s current and former PhD-students, colleagues and research collaborators. The Festschrift is divided into six sections, moving from the smallest building blocks of language, through gradually expanding objects of linguistic inquiry to the highest levels of description - all of which have formed a part of Ocke’s career, in connection with his teaching and/or his academic productions: “Segments”, “Perception of Accent”, “Between Sounds and Graphemes”, “Prosody”, “Morphology and Syntax” and “Second Language Acquisition”. Each one of these illustrates a sound approach to language matters

AU Library Scholarly Publishing Services: E-books (Aarhus University)