121 research outputs found

    AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages

    Full text link
    Africa is home to over 2000 languages from over six language families and has the highest linguistic diversity among all continents. This includes 75 languages with at least one million speakers each. Yet, there is little NLP research conducted on African languages. Crucial in enabling such research is the availability of high-quality annotated datasets. In this paper, we introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda, Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili, Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated by native speakers. The data is used in SemEval 2023 Task 12, the first Afro-centric SemEval shared task. We describe the data collection methodology, annotation process, and related challenges when curating each of the datasets. We conduct experiments with different sentiment classification baselines and discuss their usefulness. We hope AfriSenti enables new work on under-represented languages. The dataset is available at https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be loaded as a huggingface datasets (https://huggingface.co/datasets/shmuhammad/AfriSenti).Comment: 15 pages, 6 Figures, 9 Table

    Stimulated training for automatic speech recognition and keyword search in limited resource conditions

    Get PDF
    © 2017 IEEE. Training neural network acoustic models on limited quantities of data is a challenging task. A number of techniques have been proposed to improve generalisation. This paper investigates one such technique called stimulated training. It enables standard criteria such as cross-entropy to enforce spatial constraints on activations originating from different units. Having different regions being active depending on the input unit may help network to discriminate better and as a consequence yield lower error rates. This paper investigates stimulated training for automatic speech recognition of a number of languages representing different families, alphabets, phone sets and vocabulary sizes. In particular, it looks at ensembles of stimulated networks to ensure that improved generalisation will withstand system combination effects. In order to assess stimulated training beyond 1-best transcription accuracy, this paper looks at keyword search as a proxy for assessing quality of lattices. Experiments are conducted on IARPA Babel program languages including the surprise language of OpenKWS 2016 competition

    Low-resource speech recognition and keyword-spotting

    Get PDF
    © Springer International Publishing AG 2017. The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented

    Advances in Image Processing, Analysis and Recognition Technology

    Get PDF
    For many decades, researchers have been trying to make computers’ analysis of images as effective as the system of human vision is. For this purpose, many algorithms and systems have previously been created. The whole process covers various stages, including image processing, representation and recognition. The results of this work can be applied to many computer-assisted areas of everyday life. They improve particular activities and provide handy tools, which are sometimes only for entertainment, but quite often, they significantly increase our safety. In fact, the practical implementation of image processing algorithms is particularly wide. Moreover, the rapid growth of computational complexity and computer efficiency has allowed for the development of more sophisticated and effective algorithms and tools. Although significant progress has been made so far, many issues still remain, resulting in the need for the development of novel approaches

    Book reports

    Get PDF

    Advancing Multilingual Handwritten Numeral Recognition with Attention-driven Transfer Learning

    Get PDF
    As deep learning continues to evolve, we have observed huge breakthroughs in the fields of medical imaging, video and frame generation, optical character recognition (OCR), and other domains. In the field of data analysis and document processing, the recognition of handwritten numerals plays a crucial role. This work has led to remarkable changes in OCR, historical handwritten document analysis, and postal automation. In this study, we present a novel framework to overcome this challenge, going beyond digit recognition in only one language. Unlike common methods that focus on a limited set of languages, our method provides a comprehensive solution for recognition of handwritten digit images in 12 different languages. These specific languages are chosen because most of them have fairly distant representations in latent space. We utilize transfer learning, as it reduces the computational cost and maintains the quality of enhanced images and the models’ recognition accuracy. Another strength of our approach is the innovative attention-based module called the MRA module. Our experiments confirm that by applying this module, major progress is made in both image quality and the accuracy of handwritten digit recognition. Notably, we reached high precisions, surpassing nearly 2% improvement in specific languages compared to earlier techniques. In this work, we present a robust and cost-effective approach that handles multilingual handwritten numeral recognition across a wide range of languages. The code and further implementation details are available at https://github.com/CVLab-SHUT/HandWrittenDigitRecognition

    Handbook of Stemmatology

    Get PDF
    Stemmatology studies aspects of textual criticism that use genealogical methods. This handbook is the first to cover the entire field, encompassing both theoretical and practical aspects, ranging from traditional to digital methods. Authors from all the disciplines involved examine topics such as the material aspects of text traditions, methods of traditional textual criticism and their genesis, and modern digital approaches used in the field
    • …
    corecore