Search CORE

10 research outputs found

Developing Sparse Representations for Anchor-Based Voice Conversion

Author: Liberatore Christopher Bryant
Publication venue
Publication date: 25/05/2022
Field of study

Voice conversion is the task of transforming speech from one speaker to sound as if it was produced by another speaker, changing the identity while retaining the linguistic content. There are many methods for performing voice conversion, but oftentimes these methods have onerous training requirements or fail in instances where one speaker has a nonnative accent. To address these issues, this dissertation presents and evaluates a novel “anchor-based” representation of speech that separates speaker content from speaker identity by modeling how speakers form English phonemes. We call the proposed method Sparse, Anchor-Based Representation of Speech (SABR), and explore methods for optimizing the parameters of this model in native-to-native and native-to-nonnative voice conversion contexts. We begin the dissertation by demonstrating how sparse coding in combination with a compact, phoneme-based dictionary can be used to separate speaker identity from content in objective and subjective tests. The formulation of the representation then presents several research questions. First, we propose a method for improving the synthesis quality by using the sparse coding residual in combination with a frequency warping algorithm to convert the residual from the source to target speaker’s space, and add it to the target speaker’s estimated spectrum. Experimentally, we find that synthesis quality is significantly improved via this transform. Second, we propose and evaluate two methods for selecting and optimizing SABR anchors in native-to-native and native-to-nonnative voice conversion. We find that synthesis quality is significantly improved by the proposed methods, especially in native-to- nonnative voice conversion over baseline algorithms. In a detailed analysis of the algorithms, we find they focus on phonemes that are difficult for nonnative speakers of English or naturally have multiple acoustic states. Following this, we examine methods for adding in temporal constraints to SABR via the Fused Lasso. The proposed method significantly reduces the inter-frame variance in the sparse codes over other methods that incorporate temporal features into sparse coding representations. Finally, in a case study, we examine the use of the SABR methods and optimizations in the context of a computer aided pronunciation training system for building “Golden Speakers”, or ideal models for nonnative speakers of a second language to learn correct pronunciation. Under the hypothesis that the optimal “Golden Speaker” was the learner’s voice, synthesized with a native accent, we used SABR to build voice models for nonnative speakers and evaluated the resulting synthesis in terms of quality, identity, and accentedness. We found that even when deployed in the field, the SABR method generated synthesis with low accentedness and similar acoustic identity to the target speaker, validating the use of the method for building “golden speakers”

Texas A&M Repository

Grid-based approximation for voice conversion in low resource environments

Author
Publication venue: Springer
Publication date: 21/01/2016
Field of study

Springer - Publisher Connector

Формирование персональной модели голоса диктора с универсальным фонетическим пространством признаков на основе искусственной нейронной сети

Author: Азаров Илья Сергеевич
Петровский Александр Александрович
Publication venue: СПб ФИЦ РАН
Publication date: 16/12/2014
Field of study

The paper investigates possibility of creating a personal voice model using transcribed speech samples of a specified speaker. The paper presents a practical way of building such speech model and some experimental results of applying the model to voice conversion. The model uses an artificial neural network organized as autoencoder that establishes correspondence between space of speech parameters and space of possible phonetic states, unified for any voice.В работе исследуется возможность формирования модели голоса заданного диктора на основе записей образцов его голоса с транскрипцией. В работе предлагается практический способ построения голосовой модели и результаты экспериментов ее применения к задаче конверсии голоса. Модель использует искусственную нейронную сеть, устроенную по принципу автоматического кодера, устанавливающую соответствие между пространством речевых параметров и пространством возможных фонетических состояний, унифицированным для произвольного голоса

Информатика и автоматизация

Development of a Two-Level Warping Algorithm and Its Application to Speech Signal Processing

Author: Al-Dulaimi Al-Waled H.
Publication venue: DigitalCommons@USU
Publication date: 01/05/2021
Field of study

In many different fields there are signals that need to be aligned or “warped” in order to measure the similarity between them. When two time signals are compared, or when a pattern is sought in a larger stream of data, it may be necessary to warp one of the signals in a nonlinear way by compressing or stretching it to fit the other. Simple point-to-point comparison may give inadequate results, because one part of the signal might be comparing different relative parts of the other signal/pattern. Such cases need some sort of alignment todo the comparison. Dynamic Time Warping (DTW) is a powerful and widely used technique of time series analysis which performs such nonlinear warping in temporal domain. The work in this dissertation develops in two directions. The first direction is to extend the this dynamic time warping to produce a two-level dynamic warping algorithm, with warping in both temporal and spectral domains. While there have been hundreds of research efforts in the last two decades that have applied and used the one-dimensional warping process idea between time series, extending DTW method to two or more dimensions poses a more involved problem. The two-dimensional dynamic warping algorithm developed here for a variety of speech signal processing is ideally suited. The second direction is focused on two speech signal applications. The First application is the evaluation of dysarthric speech. Dysarthria is a neurological motor speech disorder, which characterized by spectral and temporal degradation in speech production. Dysarthria management has focused primarily teaching patients to improve their ability to produce speech or strategies to compensate for their deficits. However, many individuals with dysarthria are not well-suited for traditional speaker-oriented intervention. Recent studies have shown that speech intelligibility can be improved by training the listener to better understand the degraded speech signal. A computer-based training tool was developed using a two-level dynamic warping algorithm to eventually be incorporated into a program that trains listeners to learn to imitate dysarthric speech by providing subjects with feedback about the accuracy of their imitation attempts during training. The second application is voice transformation. Voice transformation techniques aims to modify a subject’s voice characteristics to make them sound like someone else, for example from a male speaker to female speaker. The approach taken here avoids the need to find acoustic parameters as many voice transformation methods do, and instead deals directly with spectral information. Based on the two-Level DW it is straightforward to map the source speech to target speech when both are available. The resulted spectral warping signal produced as described above introduces significant processing artifacts. Phase reconstruction was applied to the transformed signal to improve the quality of the final sound. Neural networks are trained to perform the voice transformation

DigitalCommons@USU

Mapping Techniques for Voice Conversion

Author: Helander Elina
Publication venue: Tampere University of Technology
Publication date: 01/01/2012
Field of study

Speaker identity plays an important role in human communication. In addition to the linguistic content, speech utterances contain acoustic information of the speaker characteristics. This thesis focuses on voice conversion, a technique that aims at changing the voice of one speaker (a source speaker) into the voice of another specific speaker (a target speaker) without changing the linguistic information. The relationship between the source and target speaker characteristics is learned from the training data. Voice conversion can be used in various applications and fields: text-to-speech systems, dubbing, speech-to-speech translation, games, voice restoration, voice pathology, etc. Voice conversion offers many challenges: which features to extract from speech, how to find linguistic correspondences (alignment) between source and target features, which machine learning techniques to use for creating a mapping function between the features of the speakers, and finally, how to make the desired modifications to the speech waveform. The features can be any parameters that describe the speech and the speaker identity, e.g. spectral envelope, excitation, fundamental frequency, and phone durations. The main focus of the thesis is on the design of suitable mapping techniques between frame-level source and target features, but also aspects related to parallel data alignment and prosody conversion are addressed. The perception of the quality and the success of the identity conversion are largely subjective. Conventional statistical techniques are able to produce good similarity between the original and the converted target voices but the quality is usually degraded. The objective of this thesis is to design conversion techniques that enable successful identity conversion while maintaining the original speech quality. Due to the limited amount of data, statistical techniques are usually utilized in extracting the mapping function. The most popular technique is based on a Gaussian mixture model (GMM). However, conventional GMM-based conversion suffers from many problems that result in degraded speech quality. The problems are analyzed in this thesis, and a technique that combines GMM-based conversion with partial least squares regression is introduced to alleviate these problems. Additionally, approaches to solve the time-independent mapping problem associated with many algorithms are proposed. The most significant contribution of the thesis is the proposed novel dynamic kernel partial least squares regression technique that allows creating a non-linear mapping function and improves temporal correlation. The technique is straightforward, efficient and requires very little tuning. It is shown to outperform the state-of-the-art GMM-based technique using both subjective and objective tests over a variety of speaker pairs. In addition, quality is further improved when aperiodicity and binary voicing values are predicted using the same technique. The vast majority of the existing voice conversion algorithms concern the transformation of the spectral envelopes. However, prosodic features, such as fundamental frequency movements and speaking rhythm, also contain important cues of identity. It is shown in the thesis that pure prosody alone can be used, to some extent, to recognize speakers that are familiar to the listeners. Furthermore, a prosody conversion technique is proposed that transforms fundamental frequency contours and durations at syllable level. The technique is shown to improve similarity to the target speaker’s prosody and reduce roboticness compared to a conventional frame-based conversion technique. Recently, the trend has shifted from text-dependent to text-independent use cases meaning that there is no parallel data available. The techniques proposed in the thesis currently assume parallel data, i.e. that the same texts have been spoken by both speakers. However, excluding the prosody conversion algorithm, the proposed techniques require no phonetic information and are applicable for a small amount of training data. Moreover, many text-independent approaches are based on extracting a sort of alignment as a pre-processing step. Thus the techniques proposed in the thesis can be exploited after the alignment process

Trepo - Institutional Repository of Tampere University

Técnicas de personalización de voces sintéticas para su uso por personas con discapacidad oral

Author: Alonso Burguera Agustín
Publication venue
Publication date: 08/11/2023
Field of study

151 p.Esta tesis presenta avances realizados en la personalización de voces sintéticas que emplean los sistemas de conversión de texto a voz utilizados por personas con alguna discapacidad oral. Se presenta un nuevo algoritmo de adaptación de locutor para voces sintéticas basadas en síntesis estadístico paramétrica. Este algoritmo hace uso únicamente de fragmentos vocálicos para imitar la voz del locutor objetivo y se ha demostrado que es robusto frente a la escasez de datos y que tiene un desempeño similar a otros algoritmos del estado del arte.También se describe el diseño e implementación de un banco de voces en el cual cualquier persona puede realizar grabaciones de su voz real para generar una voz sintética que posteriormente puede ser empleada por otro usuario. De esta manera las personas pueden ¿donar¿ su voz.Por último, se presenta una metodología que hace uso de diversas medidas objetivas de evaluación de señales de voz para puntuar la calidad de las voces disponibles en el banco de voces

Archivo Digital para la Docencia y la Investigación

Aeronautical engineering: A cumulative index to a continuing bibliography (supplement 274)

Author
Publication venue
Publication date
Field of study

This publication is a cumulative index to the abstracts contained in supplements 262 through 273 of Aeronautical Engineering: A Continuing Bibliography. The bibliographic series is compiled through the cooperative efforts of the American Institute of Aeronautics and Astronautics (AIAA) and the National Aeronautics and Space Administration (NASA). Seven indexes are included: subject, personal author, corporate source, foreign technology, contract number, report number, and accession number

NASA Technical Reports Server

Aeronautical engineering: A continuing bibliography with indexes (supplement 289)

Author
Publication venue
Publication date
Field of study

This bibliography lists 792 reports, articles, and other documents introduced into the NASA scientific and technical information system in Mar. 1993. Subject coverage includes: design, construction and testing of aircraft and aircraft engines; aircraft components, equipment, and systems; ground support systems; and theoretical and applied aspects of aerodynamics and general fluid dynamics

NASA Technical Reports Server

Aeronautical engineering: A continuing bibliography with indexes (supplement 284)

Author
Publication venue
Publication date
Field of study

This bibliography lists 974 reports, articles, and other documents introduced into the NASA scientific and technical information system in Oct. 1992. The coverage includes documents on design, construction, evaluation, testing, operation, and performance of aircraft (including aircraft engines) and associated components, equipment, and systems. It also includes research and development in aerodynamics, aeronautics, and ground support equipment for aeronautical vehicles

NASA Technical Reports Server