24 research outputs found

    Bias and consistency of individual lingual articulatory behavior and its relationship to the first and second formants

    Full text link
    Individual variation in articulatory behavior can be characterized by bias and consistency in movement outcome. Consistency can be indicated by variable error (VE) representing precision of individual performance and bias by constant error (CE) representing tendency in movement outcome. The present study employs CE and VE to characterize individual articulatory behavior, and assesses the relationship between consistency and bias in the articulatory and acoustic domains. We computed CE and VE of tongue blade and dorsum kinematic trajectories and the first two formants’ curves in the production of /æ/ and /ɑ/ by 20 native U.S. English speakers. The relationship between acoustic and kinematic VE and CE were revealed using gradient boosting machines. Results indicate that individual CE and VE vary over the time course of a vowel and that movement outcome is affected by linguistic constraints

    A cross-linguistic study of between-speaker variability in intensity dynamics in L1 and L2 spontaneous speech

    Full text link
    Dynamic aspects of the amplitude envelope appear to reflect speaker-specific information. Intensity dynamics characterized as the temporal displacement of acoustic energy associated to articulatory mouth opening (positive) and closing (negative) gestures was able to explain between-speaker variability in read productions of native speakers of Zürich German. This study examines positive and negative intensity dynamics in spontaneous speech produced by Dutch speakers using their native language and English. Acoustic analysis of informal monologues was performed to examine between-speaker variability. Negative dynamics explained a larger quantity of inter-speaker variability, strengthening the idea of a lesser prosodic control over the mouth closing movement. Furthermore, there was a significant effect of language on intensity dynamics. These findings suggest that speaker-specific information may still be embedded in these time-bound measures despite the language in use

    Mapping Techniques for Voice Conversion

    Get PDF
    Speaker identity plays an important role in human communication. In addition to the linguistic content, speech utterances contain acoustic information of the speaker characteristics. This thesis focuses on voice conversion, a technique that aims at changing the voice of one speaker (a source speaker) into the voice of another specific speaker (a target speaker) without changing the linguistic information. The relationship between the source and target speaker characteristics is learned from the training data. Voice conversion can be used in various applications and fields: text-to-speech systems, dubbing, speech-to-speech translation, games, voice restoration, voice pathology, etc. Voice conversion offers many challenges: which features to extract from speech, how to find linguistic correspondences (alignment) between source and target features, which machine learning techniques to use for creating a mapping function between the features of the speakers, and finally, how to make the desired modifications to the speech waveform. The features can be any parameters that describe the speech and the speaker identity, e.g. spectral envelope, excitation, fundamental frequency, and phone durations. The main focus of the thesis is on the design of suitable mapping techniques between frame-level source and target features, but also aspects related to parallel data alignment and prosody conversion are addressed. The perception of the quality and the success of the identity conversion are largely subjective. Conventional statistical techniques are able to produce good similarity between the original and the converted target voices but the quality is usually degraded. The objective of this thesis is to design conversion techniques that enable successful identity conversion while maintaining the original speech quality. Due to the limited amount of data, statistical techniques are usually utilized in extracting the mapping function. The most popular technique is based on a Gaussian mixture model (GMM). However, conventional GMM-based conversion suffers from many problems that result in degraded speech quality. The problems are analyzed in this thesis, and a technique that combines GMM-based conversion with partial least squares regression is introduced to alleviate these problems. Additionally, approaches to solve the time-independent mapping problem associated with many algorithms are proposed. The most significant contribution of the thesis is the proposed novel dynamic kernel partial least squares regression technique that allows creating a non-linear mapping function and improves temporal correlation. The technique is straightforward, efficient and requires very little tuning. It is shown to outperform the state-of-the-art GMM-based technique using both subjective and objective tests over a variety of speaker pairs. In addition, quality is further improved when aperiodicity and binary voicing values are predicted using the same technique. The vast majority of the existing voice conversion algorithms concern the transformation of the spectral envelopes. However, prosodic features, such as fundamental frequency movements and speaking rhythm, also contain important cues of identity. It is shown in the thesis that pure prosody alone can be used, to some extent, to recognize speakers that are familiar to the listeners. Furthermore, a prosody conversion technique is proposed that transforms fundamental frequency contours and durations at syllable level. The technique is shown to improve similarity to the target speaker’s prosody and reduce roboticness compared to a conventional frame-based conversion technique. Recently, the trend has shifted from text-dependent to text-independent use cases meaning that there is no parallel data available. The techniques proposed in the thesis currently assume parallel data, i.e. that the same texts have been spoken by both speakers. However, excluding the prosody conversion algorithm, the proposed techniques require no phonetic information and are applicable for a small amount of training data. Moreover, many text-independent approaches are based on extracting a sort of alignment as a pre-processing step. Thus the techniques proposed in the thesis can be exploited after the alignment process

    Investigating the build-up of precedence effect using reflection masking

    Get PDF
    The auditory processing level involved in the build‐up of precedence [Freyman et al., J. Acoust. Soc. Am. 90, 874–884 (1991)] has been investigated here by employing reflection masked threshold (RMT) techniques. Given that RMT techniques are generally assumed to address lower levels of the auditory signal processing, such an approach represents a bottom‐up approach to the buildup of precedence. Three conditioner configurations measuring a possible buildup of reflection suppression were compared to the baseline RMT for four reflection delays ranging from 2.5–15 ms. No buildup of reflection suppression was observed for any of the conditioner configurations. Buildup of template (decrease in RMT for two of the conditioners), on the other hand, was found to be delay dependent. For five of six listeners, with reflection delay=2.5 and 15 ms, RMT decreased relative to the baseline. For 5‐ and 10‐ms delay, no change in threshold was observed. It is concluded that the low‐level auditory processing involved in RMT is not sufficient to realize a buildup of reflection suppression. This confirms suggestions that higher level processing is involved in PE buildup. The observed enhancement of reflection detection (RMT) may contribute to active suppression at higher processing levels

    Sophisticated strategies of voice disguise and their phonetic character

    Get PDF
    V řečovém projevu lze nalézt některé prvky, které jsou charakteristické pro daného mluvčího, tzv. idiosynkratické rysy. Tato práce se zaměřuje na podobu těchto rysů při záměrném maskování hlasu - zda jsou mluvčí schopni je zásadním způsobem měnit, či zda spíše zůstávají stabilní i navzdory záměrným modifikacím řečového projevu. Bylo také sledováno, zda u zkoumaných rysů při maskování hlasu existují obecné tendence k podobným změnám napříč mluvčími. Analyzovány byly statistické ukazatele základní frekvence, kontury f0, vokalické formanty, dlouhodobé formantové distribuce, spektrální vlastnosti sibilant, intenzita řečového projevu, intenzitní kontury, mluvní a artikulační tempo, %V a kontury lokálního artikulačního tempa. U mediánu a směrodatné odchylky f0, vokalických formantů, dlouhodobých formantových distribucí, intenzity, artikulačního tempa i %V byly při maskování hlasu obecně zaznamenány znatelné posuny; u většiny z těchto parametrů se tyto posuny lišily mezi mluvčími. Bylo však zjištěno, že hodnota %V má obecnou tendenci při maskování hlasu stoupat. Také intenzita ve většině případů maskování hlasu vykazovala nárůst. U kontur f0 byly pozorovány podobné vzorce napříč mluvčími při nemaskovaném projevu; při projevu maskovaném se však v užití různých kontur f0 napříč jednotlivci objevily větší...Speech contains certain attributes characteristic for a speaker, so-called idiosyncratic features. This thesis focuses on the form of these features in intentional voice disguise - whether speakers are able to change them in a substantial way, or if they tend to remain stable in spite of intentional speech modifications. It was also investigated whether any general tendencies to similar changes of such features under voice disguise exist among the speakers. The observed features were statistical f0 indicators, f0 contours, vowel formants, long- term formant distributions, spectral characteristics of sibilants, intensity, intensity contours, speech and articulation rate, %V and local articulation rate contours. In f0 median and standard deviation, vowel formants, LTFDs, intensity, articulation rate, and %V, prominent shifts under voice disguise were observed in general; in the majority of these parameters, the shifts differed among speakers. However it was found that the value of %V generally tends to rise under voice disguise. Also, intensity showed an increase in majority of cases. In f0 contours, similar patterns were observed among speakers in normal speech, however, in disguised speech, greater differences appeared among speakers; speakers tend to employ nonstandard dynamic f0 patterns more...Fonetický ústavInstitute of PhoneticsFaculty of ArtsFilozofická fakult

    Temporal processes involved in simultaneous reflection masking

    Get PDF
    corecore