228 research outputs found
Spectral Characteristics of Schwa in Czech Accented English
The English central mid lax vowel (i.e., schwa) often contributes considerably to the sound differences between native and non-native speech. Many foreign speakers of English fail to reduce certain underlying vowels to schwa, which, on the suprasegmental level of description, affects the perceived rhythm of their speech. However, the problem of capturing quantitatively the differences between native and non-native schwa poses difficulties that, to this day, have been tackled only partially. We offer a technique of measurement in the acoustic domain that has not been probed properly as yet: the distribution of acoustic energy in the vowel spectrum. Our results show that spectral slope features measured in weak vowels discriminate between Czech and British speakers of English quite reliably. Moreover, the measurements of formant bandwidths turned out to be useful for the same task, albeit less direc
Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?
ASR systems are generally built for the spoken 'standard', and their
performance declines for non-standard dialects/varieties. This is a problem for
a language like Irish, where there is no single spoken standard, but rather
three major dialects: Ulster (Ul), Connacht (Co) and Munster (Mu). As a
diagnostic to quantify the effect of the speaker's dialect on recognition
performance, 12 ASR systems were trained, firstly using baseline
dialect-balanced training corpora, and then using modified versions of the
baseline corpora, where dialect-specific materials were either subtracted or
added. Results indicate that dialect-balanced corpora do not yield a similar
performance across the dialects: the Ul dialect consistently underperforms,
whereas Mu yields lowest WERs. There is a close relationship between Co and Mu
dialects, but one that is not symmetrical. These results will guide future
corpus collection and system building strategies to optimise for cross-dialect
performance equity.Comment: Accepted to Interspeech 2023, Dubli
Towards spoken dialect identification of Irish
The Irish language is rich in its diversity of dialects and accents. This
compounds the difficulty of creating a speech recognition system for the
low-resource language, as such a system must contend with a high degree of
variability with limited corpora. A recent study investigating dialect bias in
Irish ASR found that balanced training corpora gave rise to unequal dialect
performance, with performance for the Ulster dialect being consistently worse
than for the Connacht or Munster dialects. Motivated by this, the present
experiments investigate spoken dialect identification of Irish, with a view to
incorporating such a system into the speech recognition pipeline. Two acoustic
classification models are tested, XLS-R and ECAPA-TDNN, in conjunction with a
text-based classifier using a pretrained Irish-language BERT model. The
ECAPA-TDNN, particularly a model pretrained for language identification on the
VoxLingua107 dataset, performed best overall, with an accuracy of 73%. This was
further improved to 76% by fusing the model's outputs with the text-based
model. The Ulster dialect was most accurately identified, with an accuracy of
94%, however the model struggled to disambiguate between the Connacht and
Munster dialects, suggesting a more nuanced approach may be necessary to
robustly distinguish between the dialects of Irish.Comment: Accepted to Interspeech 2023 Workshop of the 2nd Annual Meeting of
the Special Interest Group of Under-resourced Languages Workshop, Dublin
(SiGUL
HMM-based synthesis of creaky voice
Creaky voice, also referred to as vocal fry, is a voice quality frequently produced in many languages, in both read and conversational speech. To enhance the naturalness of speech synthesis, these latter should be able to generate speech in all its expressive diversity, including creaky voice. The present study looks to exploit our recent developments, including creaky voice detection, prediction of creaky voice from context, and rendering of the creaky excitation, into a fully functioning and automatic HMM-based synthesis system. HMM-based synthetic creaky voices are built and evaluated in subjective listening tests, which show that the best synthetic creaky voices are rated more natural and more creaky compared to a conventional voice. A noncreaky voice is also successfully transformed to use creak by modifying the F0 contour and excitation of the predicted creaky parts. The transformed voice is rated equal in terms of naturalness and clearly more creaky compared to the original voice. Index Terms: speech synthesis, creaky voice, contextual factors, F0 estimation, excitation modelin
Pitch Patterns in Vocal Expression of “Happiness” and “Sadness” in the Reading Aloud of Prose on the Basis of Selected Audiobooks
The primary focus of this paper is to examine the way the emotional categories of “happiness” and “sadness” are expressed vocally in the reading aloud of prose. In particular, the two semantic categories were analysed in terms of the pitch level and the pitch variability on a corpus based on 28 works written by Charles Dickens. passages with the intended emotional colouring were selected and the fragments found in the corresponding audiobooks. They were then analysed acoustically in terms of the mean F0 and the standard deviation of F0. The results for individual emotional passages were compared with a particular reader’s mean pitch and standard deviation of pitch. The differences obtained in this way supported the initial assumptions that the pitch level and its standard deviation would raise in “happy” extracts but lower in “sad” ones. Nevertheless, not all of these tendencies could be statistically validated and additional examples taken from a selection of random novels by other nineteenth century writers were added. The statistical analysis of the larger samples confirmed the assumed tendencies but also indicated that the two semantic domains may utilise the acoustic parameters under discussion to varying degrees. While “happiness” tends to be signalled primarily by raising F0, “sadness” is communicated mostly by lowering the variability of F0. Changes in the variability of F0 seem to be of less importance in the former case, and shifts in the F0 level less significant in the latter
Lack of mutations of exon 2 of the MEN1 gene in endocrine and nonendocrine sporadic tumors
Laugh Like You Mean It:Authenticity Modulates Acoustic, Physiological and Perceptual Properties of Laughter
Several authors have recently presented evidence for perceptual and neural distinctions between genuine and acted expressions of emotion. Here, we describe how differences in authenticity affect the acoustic and perceptual properties of laughter. In an acoustic analysis, we contrasted spontaneous, authentic laughter with volitional, fake laughter, finding that spontaneous laughter was higher in pitch, longer in duration, and had different spectral characteristics from volitional laughter that was produced under full voluntary control. In a behavioral experiment, listeners perceived spontaneous and volitional laughter as distinct in arousal, valence, and authenticity. Multiple regression analyses further revealed that acoustic measures could significantly predict these affective and authenticity judgements, with the notable exception of authenticity ratings for spontaneous laughter. The combination of acoustic predictors differed according to the laughter type, where volitional laughter ratings were uniquely predicted by harmonics-to-noise ratio (HNR). To better understand the role of HNR in terms of the physiological effects on vocal tract configuration as a function of authenticity during laughter production, we ran an additional experiment in which phonetically trained listeners rated each laugh for breathiness, nasality, and mouth opening. Volitional laughter was found to be significantly more nasal than spontaneous laughter, and the item-wise physiological ratings also significantly predicted affective judgements obtained in the first experiment. Our findings suggest that as an alternative to traditional acoustic measures, ratings of phonatory and articulatory features can be useful descriptors of the acoustic qualities of nonverbal emotional vocalizations, and of their perceptual implications
- …