32 research outputs found

    Representing Low-Resource Languages and Dialects: Improved Neural Methods for Spoken Language Processing

    Get PDF
    Languages are fundamental to human communication and serve as a means to express social and cultural values. However, many people treat languages as homogeneous entities, disregarding the fact that they are often composed of multiple varieties. These language varieties may be tied to certain geographical locations or the cultural identity of the speakers.Studying language variation can thus provide valuable insights into how language varieties relate to their linguistic communities. Most language varieties do not correspond to administrative boundaries, such as provinces or states within nations, and neighboring varieties often transition gradually.In this dissertation, we presented a new method to describe and model linguistic diversity. Specifically, we leveraged deep learning or artificial neural network models to quantify differences between the pronunciations of speakers from different language varieties. This new method assesses the differences between language varieties more accurately and efficiently compared to previously-used methods.Additionally, we investigated the use of these neural network models to develop speech technology to help empower language varieties. We developed an audio-based search algorithm that can automatically identify occurrences of a spoken search term in a large collection of spoken materials, improving access to resources that would normally require manual annotation. Furthermore, we presented approaches to improve speech recognition performance for several language varieties from different language families. This technology could, for example, be used to generate subtitles for videos or television broadcasts. This can be a promising step towards the important goal of developing speech technology that is inclusive of the world’s languages

    A New Acoustic-Based Pronunciation Distance Measure

    Get PDF
    We present an acoustic distance measure for comparing pronunciations, and apply the measure to assess foreign accent strength in American-English by comparing speech of non-native American-English speakers to a collection of native American-English speakers. An acoustic-only measure is valuable as it does not require the time-consuming and error-prone process of phonetically transcribing speech samples which is necessary for current edit distance-based approaches. We minimize speaker variability in the data set by employing speaker-based cepstral mean and variance normalization, and compute word-based acoustic distances using the dynamic time warping algorithm. Our results indicate a strong correlation of r = −0.71 (p < 0.0001) between the acoustic distances and human judgments of native-likeness provided by more than 1,100 native American-English raters. Therefore, the convenient acoustic measure performs only slightly lower than the state-of-the-art transcription-based performance of r = −0.77. We also report the results of several small experiments which show that the acoustic measure is not only sensitive to segmental differences, but also to intonational differences and durational differences. However, it is not immune to unwanted differences caused by using a different recording device

    Adapting Monolingual Models:Data can be Scarce when Language Similarity is High

    Get PDF
    For many (minority) languages, the resources needed to train large models are not available. We investigate the performance of zero-shot transfer learning with as little data as possible, and the influence of language similarity in this process. We retrain the lexical layers of four BERT-based models using data from two low-resource target language varieties, while the Transformer layers are independently fine-tuned on a POS-tagging task in the model's source language. By combining the new lexical layers and fine-tuned Transformer layers, we achieve high task performance for both target languages. With high language similarity, 10MB of data appears sufficient to achieve substantial monolingual transfer performance. Monolingual BERT-based models generally achieve higher downstream task performance after retraining the lexical layer than multilingual BERT, even when the target language is included in the multilingual model

    Neural representations for modeling variation in speech

    Get PDF
    Variation in speech is often quantified by comparing phonetic transcriptions of the same utterance. However, manually transcribing speech is time-consuming and error prone. As an alternative, therefore, we investigate the extraction of acoustic embeddings from several self-supervised neural models. We use these representations to compute word-based pronunciation differences between non-native and native speakers of English, and between Norwegian dialect speakers. For comparison with several earlier studies, we evaluate how well these differences match human perception by comparing them with available human judgements of similarity. We show that speech representations extracted from a specific type of neural model (i.e. Transformers) lead to a better match with human perception than two earlier approaches on the basis of phonetic transcriptions and MFCC-based acoustic features. We furthermore find that features from the neural models can generally best be extracted from one of the middle hidden layers than from the final layer. We also demonstrate that neural speech representations not only capture segmental differences, but also intonational and durational differences that cannot adequately be represented by a set of discrete symbols used in phonetic transcriptions.Comment: Submitted to Journal of Phonetic
    corecore