2,356 research outputs found
Automatic Measurement of Pre-aspiration
Pre-aspiration is defined as the period of glottal friction occurring in
sequences of vocalic/consonantal sonorants and phonetically voiceless
obstruents. We propose two machine learning methods for automatic measurement
of pre-aspiration duration: a feedforward neural network, which works at the
frame level; and a structured prediction model, which relies on manually
designed feature functions, and works at the segment level. The input for both
algorithms is a speech signal of an arbitrary length containing a single
obstruent, and the output is a pair of times which constitutes the
pre-aspiration boundaries. We train both models on a set of manually annotated
examples. Results suggest that the structured model is superior to the
frame-based model as it yields higher accuracy in predicting the boundaries and
generalizes to new speakers and new languages. Finally, we demonstrate the
applicability of our structured prediction algorithm by replicating linguistic
analysis of pre-aspiration in Aberystwyth English with high correlation
Machine Assisted Analysis of Vowel Length Contrasts in Wolof
Growing digital archives and improving algorithms for automatic analysis of
text and speech create new research opportunities for fundamental research in
phonetics. Such empirical approaches allow statistical evaluation of a much
larger set of hypothesis about phonetic variation and its conditioning factors
(among them geographical / dialectal variants). This paper illustrates this
vision and proposes to challenge automatic methods for the analysis of a not
easily observable phenomenon: vowel length contrast. We focus on Wolof, an
under-resourced language from Sub-Saharan Africa. In particular, we propose
multiple features to make a fine evaluation of the degree of length contrast
under different factors such as: read vs semi spontaneous speech ; standard vs
dialectal Wolof. Our measures made fully automatically on more than 20k vowel
tokens show that our proposed features can highlight different degrees of
contrast for each vowel considered. We notably show that contrast is weaker in
semi-spontaneous speech and in a non standard semi-spontaneous dialect.Comment: Accepted to Interspeech 201
Reducing Audible Spectral Discontinuities
In this paper, a common problem in diphone synthesis is discussed, viz., the occurrence of audible discontinuities at diphone boundaries. Informal observations show that spectral mismatch is most likely the cause of this phenomenon.We first set out to find an objective spectral measure for discontinuity. To this end, several spectral distance measures are related to the results of a listening experiment. Then, we studied the feasibility of extending the diphone database with context-sensitive diphones to reduce the occurrence of audible discontinuities. The number of additional diphones is limited by clustering consonant contexts that have a similar effect on the surrounding vowels on the basis of the best performing distance measure. A listening experiment has shown that the addition of these context-sensitive diphones significantly reduces the amount of audible discontinuities
Structured heterogeneity in Scottish stops over the 20th Century
How and why speakers differ in the phonetic implementation of phonological contrasts, and the relationship of this âstructured heterogeneityâ to language change, has been a key focus over fifty years of variationist sociolinguistics. In phonetics, interest has recently grown in uncovering âstructured variabilityââhow speakers can differ greatly in phonetic realization in nonrandom waysâas part of the long-standing goal of understanding variability in speech. The English stop voicing contrast, which combines extensive phonetic variability with phonological stability, provides an ideal setting for an approach to understanding structured variation in the sounds of a communityâs language that illuminates both synchrony and diachrony. This article examines the voicing contrast in a vernacular dialect (Glasgow Scots) in spontaneous speech, focusing on individual speaker variability within and across cues, including over time. Speakers differ greatly in the use of each of three phonetic cues to the contrast, while reliably using each one to differentiate voiced and voiceless stops. Interspeaker variability is highly structured: speakers lie along a continuum of use of each cue, as well as correlated use of two cuesâvoice onset time and closure voicingâalong a single axis. Diachronic change occurs along this axis, toward a more aspiration-based and less voicing-based phonetic realization of the contrast, suggesting an important connection between synchronic and diachronic speaker variation
The analysis of breathing and rhythm in speech
Speech rhythm can be described as the temporal patterning by which speech events, such as vocalic onsets, occur. Despite efforts to quantify and model speech rhythm across languages, it remains a scientifically enigmatic aspect of prosody. For instance, one challenge lies in determining how to best quantify and analyse speech rhythm. Techniques range from manual phonetic annotation to the automatic extraction of acoustic features. It is currently unclear how closely these differing approaches correspond to one another. Moreover, the primary means of speech rhythm research has been the analysis of the acoustic signal only. Investigations of speech rhythm may instead benefit from a range of complementary measures, including physiological recordings, such as of respiratory effort. This thesis therefore combines acoustic recording with inductive plethysmography (breath belts) to capture temporal characteristics of speech and speech breathing rhythms. The first part examines the performance of existing phonetic and algorithmic techniques for acoustic prosodic analysis in a new corpus of rhythmically diverse English and Mandarin speech. The second part addresses the need for an automatic speech breathing annotation technique by developing a novel function that is robust to the noisy plethysmography typical of spontaneous, naturalistic speech production. These methods are then applied in the following section to the analysis of English speech and speech breathing in a second, larger corpus. Finally, behavioural experiments were conducted to investigate listeners' perception of speech breathing using a novel gap detection task. The thesis establishes the feasibility, as well as limits, of automatic methods in comparison to manual annotation. In the speech breathing corpus analysis, they help show that speakers maintain a normative, yet contextually adaptive breathing style during speech. The perception experiments in turn demonstrate that listeners are sensitive to the violation of these speech breathing norms, even if unconsciously so. The thesis concludes by underscoring breathing as a necessary, yet often overlooked, component in speech rhythm planning and production
Self-Supervised Contrastive Learning for Unsupervised Phoneme Segmentation
We propose a self-supervised representation learning model for the task of
unsupervised phoneme boundary detection. The model is a convolutional neural
network that operates directly on the raw waveform. It is optimized to identify
spectral changes in the signal using the Noise-Contrastive Estimation
principle. At test time, a peak detection algorithm is applied over the model
outputs to produce the final boundaries. As such, the proposed model is trained
in a fully unsupervised manner with no manual annotations in the form of target
boundaries nor phonetic transcriptions. We compare the proposed approach to
several unsupervised baselines using both TIMIT and Buckeye corpora. Results
suggest that our approach surpasses the baseline models and reaches
state-of-the-art performance on both data sets. Furthermore, we experimented
with expanding the training set with additional examples from the Librispeech
corpus. We evaluated the resulting model on distributions and languages that
were not seen during the training phase (English, Hebrew and German) and showed
that utilizing additional untranscribed data is beneficial for model
performance.Comment: Interspeech 2020 pape
- âŠ