116 research outputs found

    Multi-reference WER for evaluating ASR for languages with no orthographic rule

    Get PDF

    Multi-dialect Arabic broadcast speech recognition

    Get PDF
    Dialectal Arabic speech research suffers from the lack of labelled resources and standardised orthography. There are three main challenges in dialectal Arabic speech recognition: (i) finding labelled dialectal Arabic speech data, (ii) training robust dialectal speech recognition models from limited labelled data and (iii) evaluating speech recognition for dialects with no orthographic rules. This thesis is concerned with the following three contributions: Arabic Dialect Identification: We are mainly dealing with Arabic speech without prior knowledge of the spoken dialect. Arabic dialects could be sufficiently diverse to the extent that one can argue that they are different languages rather than dialects of the same language. We have two contributions: First, we use crowdsourcing to annotate a multi-dialectal speech corpus collected from Al Jazeera TV channel. We obtained utterance level dialect labels for 57 hours of high-quality consisting of four major varieties of dialectal Arabic (DA), comprised of Egyptian, Levantine, Gulf or Arabic peninsula, North African or Moroccan from almost 1,000 hours. Second, we build an Arabic dialect identification (ADI) system. We explored two main groups of features, namely acoustic features and linguistic features. For the linguistic features, we look at a wide range of features, addressing words, characters and phonemes. With respect to acoustic features, we look at raw features such as mel-frequency cepstral coefficients combined with shifted delta cepstra (MFCC-SDC), bottleneck features and the i-vector as a latent variable. We studied both generative and discriminative classifiers, in addition to deep learning approaches, namely deep neural network (DNN) and convolutional neural network (CNN). In our work, we propose Arabic as a five class dialect challenge comprising of the previously mentioned four dialects as well as modern standard Arabic. Arabic Speech Recognition: We introduce our effort in building Arabic automatic speech recognition (ASR) and we create an open research community to advance it. This section has two main goals: First, creating a framework for Arabic ASR that is publicly available for research. We address our effort in building two multi-genre broadcast (MGB) challenges. MGB-2 focuses on broadcast news using more than 1,200 hours of speech and 130M words of text collected from the broadcast domain. MGB-3, however, focuses on dialectal multi-genre data with limited non-orthographic speech collected from YouTube, with special attention paid to transfer learning. Second, building a robust Arabic ASR system and reporting a competitive word error rate (WER) to use it as a potential benchmark to advance the state of the art in Arabic ASR. Our overall system is a combination of five acoustic models (AM): unidirectional long short term memory (LSTM), bidirectional LSTM (BLSTM), time delay neural network (TDNN), TDNN layers along with LSTM layers (TDNN-LSTM) and finally TDNN layers followed by BLSTM layers (TDNN-BLSTM). The AM is trained using purely sequence trained neural networks lattice-free maximum mutual information (LFMMI). The generated lattices are rescored using a four-gram language model (LM) and a recurrent neural network with maximum entropy (RNNME) LM. Our official WER is 13%, which has the lowest WER reported on this task. Evaluation: The third part of the thesis addresses our effort in evaluating dialectal speech with no orthographic rules. Our methods learn from multiple transcribers and align the speech hypothesis to overcome the non-orthographic aspects. Our multi-reference WER (MR-WER) approach is similar to the BLEU score used in machine translation (MT). We have also automated this process by learning different spelling variants from Twitter data. We mine automatically from a huge collection of tweets in an unsupervised fashion to build more than 11M n-to-m lexical pairs, and we propose a new evaluation metric: dialectal WER (WERd). Finally, we tried to estimate the word error rate (e-WER) with no reference transcription using decoding and language features. We show that our word error rate estimation is robust for many scenarios with and without the decoding features

    Computational Sociolinguistics: A Survey

    Get PDF
    Language is a social phenomenon and variation is inherent to its social nature. Recently, there has been a surge of interest within the computational linguistics (CL) community in the social dimension of language. In this article we present a survey of the emerging field of "Computational Sociolinguistics" that reflects this increased interest. We aim to provide a comprehensive overview of CL research on sociolinguistic themes, featuring topics such as the relation between language and social identity, language use in social interaction and multilingual communication. Moreover, we demonstrate the potential for synergy between the research communities involved, by showing how the large-scale data-driven methods that are widely used in CL can complement existing sociolinguistic studies, and how sociolinguistics can inform and challenge the methods and assumptions employed in CL studies. We hope to convey the possible benefits of a closer collaboration between the two communities and conclude with a discussion of open challenges.Comment: To appear in Computational Linguistics. Accepted for publication: 18th February, 201

    Arabic Dialect Texts Classification

    Get PDF
    This study investigates how to classify Arabic dialects in text by extracting features which show the differences between dialects. There has been a lack of research about classification of Arabic dialect texts, in comparison to English and some other languages, due to the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and some other languages. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a medium of communication and as a source of a corpus. We collected tweets from Twitter, comments from Facebook and online newspapers from five groups of Arabic dialects: Gulf, Iraqi, Egyptian, Levantine, and North African. The research sought to: 1) create a dataset of Arabic dialect texts to use in training and testing the system of classification, 2) find appropriate features to classify Arabic dialects: lexical (word and multi-word-unit) and grammatical variation across dialects, 3) build a more sophisticated filter to extract features from Arabic-character written dialect text files. In this thesis, the first part describes the research motivation to show the reason for choosing the Arabic dialects as a research topic. The second part presents some background information about the Arabic language and its dialects, and the literature review shows previous research about this subject. The research methodology part shows the initial experiment to classify Arabic dialects. The results of this experiment showed the need to create an Arabic dialect text corpus, by exploring Twitter and online newspaper. The corpus used to train the ensemble classifier and to improve the accuracy of classification the corpus was extended by collecting tweets from Twitter based on the spatial coordinate points and comments from Facebook posts. The corpus was annotated with dialect labels and used in automatic dialect classification experiments. The last part of this thesis presents the results of classification, conclusions and future work

    Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources

    Get PDF
    Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations

    AIUCD2018 - Book of Abstracts

    Get PDF
    Questo volume raccoglie gli abstract dei paper presentati al Settimo Convegno Annuale AIUCD 2018 (Bari, 31 gennaio – 2 febbraio 2018) dal titolo "Patrimoni culturali nell’era digitale. Memorie, culture umanistiche e tecnologia" (Cultural Heritage in the Digital Age. Memory, Humanities and Technologies). Gli abstract pubblicati in questo volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia, attraverso un processo di revisione anonima mediante double-blind peer review sotto la responsabilità del Comitato Scientifico di AIUCD. Il programma della conferenza AIUCD 2018 è disponibile online all'indirizzo http://www.aiucd2018.uniba.it/

    AIUCD2018 - Book of Abstracts

    Get PDF
    Questo volume raccoglie gli abstract dei paper presentati al Settimo Convegno Annuale AIUCD 2018 (Bari, 31 gennaio – 2 febbraio 2018) dal titolo "Patrimoni culturali nell’era digitale. Memorie, culture umanistiche e tecnologia" (Cultural Heritage in the Digital Age. Memory, Humanities and Technologies). Gli abstract pubblicati in questo volume hanno ottenuto il parere favorevole da parte di valutatori esperti della materia, attraverso un processo di revisione anonima mediante double-blind peer review sotto la responsabilità del Comitato Scientifico di AIUCD. Il programma della conferenza AIUCD 2018 è disponibile online all'indirizzo http://www.aiucd2018.uniba.it/
    • …
    corecore