535 research outputs found

    Automating the creation of speech recognition systems for under-resourced languages

    Get PDF
    © 2015 IEEE.More than 7100 languages are spoken in the world and the significant part of these languages suffers frothe absence of speech services, therefore people cannot use them on their native languages and have to learn and use other languages in order to communicate with modern information technologies. This paper describes an approach to automate the creation of speech recognition systems for under-resourced languages. The aim is to simplify and speed up this process via providing the necessary tools and organizing the process of systems' development and testing. The results of building phoneme and speech recognition systems for the Tatar language (3rd most spoken language in Russia) demonstrate the possibility of using the proposed platform for under-resourced languages

    Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources

    Get PDF
    Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations

    KenSwQuAD -- A Question Answering Dataset for Swahili Low Resource Language

    Full text link
    The need for Question Answering datasets in low resource languages is the motivation of this research, leading to the development of Kencorpus Swahili Question Answering Dataset, KenSwQuAD. This dataset is annotated from raw story texts of Swahili low resource language, which is a predominantly spoken in Eastern African and in other parts of the world. Question Answering (QA) datasets are important for machine comprehension of natural language for tasks such as internet search and dialog systems. Machine learning systems need training data such as the gold standard Question Answering set developed in this research. The research engaged annotators to formulate QA pairs from Swahili texts collected by the Kencorpus project, a Kenyan languages corpus. The project annotated 1,445 texts from the total 2,585 texts with at least 5 QA pairs each, resulting into a final dataset of 7,526 QA pairs. A quality assurance set of 12.5% of the annotated texts confirmed that the QA pairs were all correctly annotated. A proof of concept on applying the set to the QA task confirmed that the dataset can be usable for such tasks. KenSwQuAD has also contributed to resourcing of the Swahili language.Comment: 17 pages, 1 figure, 10 table

    Speech recognition, machine translation, and corpus analysis for identifying farmer demands and targeting digital extension

    Get PDF
    The increasing capabilities of Artificial Intelligence-augmented data analytics present significant opportunities for agricultural extension organizations operating in the Global South. In this project, we supported Farm Radio International (FRI) in investigating the possibility of automating the process of translating and analyzing farmers' voice message data. This report reviews several approaches to overcoming technical constraints and then presents a cutting-edge approach that utilizes innovations in unsupervised learning to deliver highly accurate speech recognition and machine translation in a diverse set of languages

    Punctuation Prediction for Norwegian: Using Established Approaches for Under-Resourced Languages

    Get PDF
    Masteroppgåve i informasjonsvitskapINFO390MASV-INF

    Integration of Phonotactic Features for Language Identification on Code-Switched Speech

    Get PDF
    Abstract: In this paper, phoneme sequences are used as language information to perform code-switched language identification (LID). With the one-pass recognition system, the spoken sounds are converted into phonetically arranged sequences of sounds. The acoustic models are robust enough to handle multiple languages when emulating multiple hidden Markov models (HMMs). To determine the phoneme similarity among our target languages, we reported two methods of phoneme mapping. Statistical phoneme-based bigram language models (LM) are integrated into speech decoding to eliminate possible phone mismatches. The supervised support vector machine (SVM) is used to learn to recognize the phonetic information of mixed-language speech based on recognized phone sequences. As the back-end decision is taken by an SVM, the likelihood scores of segments with monolingual phone occurrence are used to classify language identity. The speech corpus was tested on Sepedi and English languages that are often mixed. Our system is evaluated by measuring both the ASR performance and the LID performance separately. The systems have obtained a promising ASR accuracy with data-driven phone merging approach modelled using 16 Gaussian mixtures per state. In code-switched speech and monolingual speech segments respectively, the proposed systems achieved an acceptable ASR and LID accuracy

    The European Language Resources and Technologies Forum: Shaping the Future of the Multilingual Digital Europe

    Get PDF
    Proceedings of the 1st FLaReNet Forum on the European Language Resources and Technologies, held in Vienna, at the Austrian Academy of Science, on 12-13 February 2009
    corecore