121 research outputs found
AfriSenti: A Twitter Sentiment Analysis Benchmark for African Languages
Africa is home to over 2000 languages from over six language families and has
the highest linguistic diversity among all continents. This includes 75
languages with at least one million speakers each. Yet, there is little NLP
research conducted on African languages. Crucial in enabling such research is
the availability of high-quality annotated datasets. In this paper, we
introduce AfriSenti, which consists of 14 sentiment datasets of 110,000+ tweets
in 14 African languages (Amharic, Algerian Arabic, Hausa, Igbo, Kinyarwanda,
Moroccan Arabic, Mozambican Portuguese, Nigerian Pidgin, Oromo, Swahili,
Tigrinya, Twi, Xitsonga, and Yor\`ub\'a) from four language families annotated
by native speakers. The data is used in SemEval 2023 Task 12, the first
Afro-centric SemEval shared task. We describe the data collection methodology,
annotation process, and related challenges when curating each of the datasets.
We conduct experiments with different sentiment classification baselines and
discuss their usefulness. We hope AfriSenti enables new work on
under-represented languages. The dataset is available at
https://github.com/afrisenti-semeval/afrisent-semeval-2023 and can also be
loaded as a huggingface datasets
(https://huggingface.co/datasets/shmuhammad/AfriSenti).Comment: 15 pages, 6 Figures, 9 Table
Stimulated training for automatic speech recognition and keyword search in limited resource conditions
© 2017 IEEE. Training neural network acoustic models on limited quantities of data is a challenging task. A number of techniques have been proposed to improve generalisation. This paper investigates one such technique called stimulated training. It enables standard criteria such as cross-entropy to enforce spatial constraints on activations originating from different units. Having different regions being active depending on the input unit may help network to discriminate better and as a consequence yield lower error rates. This paper investigates stimulated training for automatic speech recognition of a number of languages representing different families, alphabets, phone sets and vocabulary sizes. In particular, it looks at ensembles of stimulated networks to ensure that improved generalisation will withstand system combination effects. In order to assess stimulated training beyond 1-best transcription accuracy, this paper looks at keyword search as a proxy for assessing quality of lattices. Experiments are conducted on IARPA Babel program languages including the surprise language of OpenKWS 2016 competition
Low-resource speech recognition and keyword-spotting
© Springer International Publishing AG 2017. The IARPA Babel program ran from March 2012 to November 2016. The aim of the program was to develop agile and robust speech technology that can be rapidly applied to any human language in order to provide effective search capability on large quantities of real world data. This paper will describe some of the developments in speech recognition and keyword-spotting during the lifetime of the project. Two technical areas will be briefly discussed with a focus on techniques developed at Cambridge University: the application of deep learning for low-resource speech recognition; and efficient approaches for keyword spotting. Finally a brief analysis of the Babel speech language characteristics and language performance will be presented
Advances in Image Processing, Analysis and Recognition Technology
For many decades, researchers have been trying to make computers’ analysis of images as effective as the system of human vision is. For this purpose, many algorithms and systems have previously been created. The whole process covers various stages, including image processing, representation and recognition. The results of this work can be applied to many computer-assisted areas of everyday life. They improve particular activities and provide handy tools, which are sometimes only for entertainment, but quite often, they significantly increase our safety. In fact, the practical implementation of image processing algorithms is particularly wide. Moreover, the rapid growth of computational complexity and computer efficiency has allowed for the development of more sophisticated and effective algorithms and tools. Although significant progress has been made so far, many issues still remain, resulting in the need for the development of novel approaches
Advancing Multilingual Handwritten Numeral Recognition with Attention-driven Transfer Learning
As deep learning continues to evolve, we have observed huge breakthroughs in the fields of medical imaging, video and frame generation, optical character recognition (OCR), and other domains. In the field of data analysis and document processing, the recognition of handwritten numerals plays a crucial role. This work has led to remarkable changes in OCR, historical handwritten document analysis, and postal automation. In this study, we present a novel framework to overcome this challenge, going beyond digit recognition in only one language. Unlike common methods that focus on a limited set of languages, our method provides a comprehensive solution for recognition of handwritten digit images in 12 different languages. These specific languages are chosen because most of them have fairly distant representations in latent space. We utilize transfer learning, as it reduces the computational cost and maintains the quality of enhanced images and the models’ recognition accuracy. Another strength of our approach is the innovative attention-based module called the MRA module. Our experiments confirm that by applying this module, major progress is made in both image quality and the accuracy of handwritten digit recognition. Notably, we reached high precisions, surpassing nearly 2% improvement in specific languages compared to earlier techniques. In this work, we present a robust and cost-effective approach that handles multilingual handwritten numeral recognition across a wide range of languages. The code and further implementation details are available at https://github.com/CVLab-SHUT/HandWrittenDigitRecognition
Handbook of Stemmatology
Stemmatology studies aspects of textual criticism that use genealogical methods. This handbook is the first to cover the entire field, encompassing both theoretical and practical aspects, ranging from traditional to digital methods. Authors from all the disciplines involved examine topics such as the material aspects of text traditions, methods of traditional textual criticism and their genesis, and modern digital approaches used in the field
- …