2,336 research outputs found

    Automatic Transcription of Northern Prinmi Oral Art: Approaches and Challenges to Automatic Speech Recognition for Language Documentation

    Get PDF
    One significant issue facing language documentation efforts is the transcription bottleneck: each documented recording must be transcribed and annotated, and these tasks are extremely labor intensive (Ćavar et al., 2016). Researchers have sought to accelerate these tasks with partial automation via forced alignment, natural language processing, and automatic speech recognition (ASR) (Neubig et al., 2020). Neural network—especially transformer-based—approaches have enabled large advances in ASR over the last decade. Models like XLSR-53 promise improved performance on under-resourced languages by leveraging massive data sets from many different languages (Conneau et al., 2020). This project extends these efforts to a novel context, applying XLSR-53 to Northern Prinmi, a Tibeto-Burman Qiangic language spoken in Southwest China (Daudey & Pincuo, 2020). Specifically, this thesis aims to answer two questions. First, is the XLSR-53 ASR model useful for first-pass transcription of oral art recordings from Northern Prinmi, an under-resourced tonal language? Second, does preprocessing target transcripts to combine grapheme clusters—multi-character representations of lexical tones and characters with modifying diacritics—into more phonologically salient units improve the model\u27s predictions? Results indicate that—with substantial adaptations—XLSR-53 will be useful for this task, and that preprocessing to combine grapheme clusters does improve model performance

    Cross-Lingual and Low-Resource Sentiment Analysis

    Get PDF
    Identifying sentiment in a low-resource language is essential for understanding opinions internationally and for responding to the urgent needs of locals affected by disaster incidents in different world regions. While tools and resources for recognizing sentiment in high-resource languages are plentiful, determining the most effective methods for achieving this task in a low-resource language which lacks annotated data is still an open research question. Most existing approaches for cross-lingual sentiment analysis to date have relied on high-resource machine translation systems, large amounts of parallel data, or resources only available for Indo-European languages. This work presents methods, resources, and strategies for identifying sentiment cross-lingually in a low-resource language. We introduce a cross-lingual sentiment model which can be trained on a high-resource language and applied directly to a low-resource language. The model offers the feature of lexicalizing the training data using a bilingual dictionary, but can perform well without any translation into the target language. Through an extensive experimental analysis, evaluated on 17 target languages, we show that the model performs well with bilingual word vectors pre-trained on an appropriate translation corpus. We compare in-genre and in-domain parallel corpora, out-of-domain parallel corpora, in-domain comparable corpora, and monolingual corpora, and show that a relatively small, in-domain parallel corpus works best as a transfer medium if it is available. We describe the conditions under which other resources and embedding generation methods are successful, and these include our strategies for leveraging in-domain comparable corpora for cross-lingual sentiment analysis. To enhance the ability of the cross-lingual model to identify sentiment in the target language, we present new feature representations for sentiment analysis that are incorporated in the cross-lingual model: bilingual sentiment embeddings that are used to create bilingual sentiment scores, and a method for updating the sentiment embeddings during training by lexicalization of the target language. This feature configuration works best for the largest number of target languages in both untargeted and targeted cross-lingual sentiment experiments. The cross-lingual model is studied further by evaluating the role of the source language, which has traditionally been assumed to be English. We build cross-lingual models using 15 source languages, including two non-European and non-Indo-European source languages: Arabic and Chinese. We show that language families play an important role in the performance of the model, as does the morphological complexity of the source language. In the last part of the work, we focus on sentiment analysis towards targets. We study Arabic as a representative morphologically complex language and develop models and morphological representation features for identifying entity targets and sentiment expressed towards them in Arabic open-domain text. Finally, we adapt our cross-lingual sentiment models for the detection of sentiment towards targets. Through cross-lingual experiments on Arabic and English, we demonstrate that our findings regarding resources, features, and language also hold true for the transfer of targeted sentiment

    Metaphors in spoken academic discourse in german and english

    Get PDF
    Metaphors have been increasingly associated with cognitive functions, which means that metaphors structure how we think and express ourselves. Metaphors are embodied in our basic physical experience, which is one reason why certain abstract concepts are expressed in more concrete terms, such as visible entities, journeys, and other types of movement, spaces etc. This communicative relevance also applies to specialised, institutionalised settings and genres, such as those produced in or related to higher education institutions, among which is spoken academic discourse. A significant research gap has been identified regarding spoken academic discourse and metaphors therein, but also given the fact that with increasing numbers of students in higher education and international research and cooperation e.g. in the form of invited lectures, spoken academic discourse can be seen as nearly omnipresent. In this context, research talks are a key research genre. A mixed methods study has been conducted, which investigates metaphors in a corpus of eight fully transcribed German and English L1 speaker conference talks and invited lectures, totalling to 440 minutes. A wide range of categories and functions were identified in the corpus. Abstract research concepts, such as results or theories are expressed in terms of concrete visual entities that can be seen or shown, but also in terms of journeys or other forms of movement. The functions of these metaphors are simplification, rhetorical emphasis, theory-construction, or pedagogic illustration. For both the speaker and the audience or discussants, anthropomorphism causes abstract and complex ideas to become concretely imaginable and at the same time more interesting because the contents of the talk appear to be livelier and hence closer to their own experience, which ensures the audience’s attention. These metaphor categories are present in both the English and the German sub corpus of this study with similar functions
    • …
    corecore