3,793 research outputs found

    Native Language Identification on Text and Speech

    Full text link
    This paper presents an ensemble system combining the output of multiple SVM classifiers to native language identification (NLI). The system was submitted to the NLI Shared Task 2017 fusion track which featured students essays and spoken responses in form of audio transcriptions and iVectors by non-native English speakers of eleven native languages. Our system competed in the challenge under the team name ZCD and was based on an ensemble of SVM classifiers trained on character n-grams achieving 83.58% accuracy and ranking 3rd in the shared task.Comment: Proceedings of the Workshop on Innovative Use of NLP for Building Educational Applications (BEA

    Machine Assisted Analysis of Vowel Length Contrasts in Wolof

    Full text link
    Growing digital archives and improving algorithms for automatic analysis of text and speech create new research opportunities for fundamental research in phonetics. Such empirical approaches allow statistical evaluation of a much larger set of hypothesis about phonetic variation and its conditioning factors (among them geographical / dialectal variants). This paper illustrates this vision and proposes to challenge automatic methods for the analysis of a not easily observable phenomenon: vowel length contrast. We focus on Wolof, an under-resourced language from Sub-Saharan Africa. In particular, we propose multiple features to make a fine evaluation of the degree of length contrast under different factors such as: read vs semi spontaneous speech ; standard vs dialectal Wolof. Our measures made fully automatically on more than 20k vowel tokens show that our proposed features can highlight different degrees of contrast for each vowel considered. We notably show that contrast is weaker in semi-spontaneous speech and in a non standard semi-spontaneous dialect.Comment: Accepted to Interspeech 201

    Workshop on Advanced Corpus Solutions

    Get PDF

    Vowel duration and the voicing effect across dialects of English

    Get PDF
    The ‘voicing effect’ – the durational difference in vowels preceding voiced and voiceless consonants – is a well-documented phenomenon in English, where it plays a key role in the production and perception of the English final voicing contrast. Despite this supposed importance, little is known as to how robust this effect is in spontaneous connected speech, which is itself subject to a range of linguistic factors. Similarly, little attention has focused on variability in the voicing effect across dialects of English, bar analysis of specific varieties. Our findings show that the voicing of the following consonant exhibits a weaker-than-expected effect in spontaneous speech, interacting with manner, vowel height, speech rate, and word frequency. English dialects appear to demonstrate a continuum of potential voicing effect sizes, where varieties with dialect-specific phonological rules exhibit the most extreme values. The results suggest that the voicing effect in English is both substantially weaker than previously assumed in spontaneous connected speech, and subject to a wide range of dialectal variability

    COMPUTER CORPORA AND THEIR USE IN LANGUAGE ANALYSIS

    Get PDF

    Towards dialect-inclusive recognition in a low-resource language: are balanced corpora the answer?

    Full text link
    ASR systems are generally built for the spoken 'standard', and their performance declines for non-standard dialects/varieties. This is a problem for a language like Irish, where there is no single spoken standard, but rather three major dialects: Ulster (Ul), Connacht (Co) and Munster (Mu). As a diagnostic to quantify the effect of the speaker's dialect on recognition performance, 12 ASR systems were trained, firstly using baseline dialect-balanced training corpora, and then using modified versions of the baseline corpora, where dialect-specific materials were either subtracted or added. Results indicate that dialect-balanced corpora do not yield a similar performance across the dialects: the Ul dialect consistently underperforms, whereas Mu yields lowest WERs. There is a close relationship between Co and Mu dialects, but one that is not symmetrical. These results will guide future corpus collection and system building strategies to optimise for cross-dialect performance equity.Comment: Accepted to Interspeech 2023, Dubli
    corecore