64 research outputs found
Initial Normalization of User Generated Content: Case Study in a Multilingual Setting
We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy
Graphemic Normalization of the Perso-Arabic Script
Since its original appearance in 1991, the Perso-Arabic script representation
in Unicode has grown from 169 to over 440 atomic isolated characters spread
over several code pages representing standard letters, various diacritics and
punctuation for the original Arabic and numerous other regional orthographic
traditions. This paper documents the challenges that Perso-Arabic presents
beyond the best-documented languages, such as Arabic and Persian, building on
earlier work by the expert community. We particularly focus on the situation in
natural language processing (NLP), which is affected by multiple, often
neglected, issues such as the use of visually ambiguous yet canonically
nonequivalent letters and the mixing of letters from different orthographies.
Among the contributing conflating factors are the lack of input methods, the
instability of modern orthographies, insufficient literacy, and loss or lack of
orthographic tradition. We evaluate the effects of script normalization on
eight languages from diverse language families in the Perso-Arabic script
diaspora on machine translation and statistical language modeling tasks. Our
results indicate statistically significant improvements in performance in most
conditions for all the languages considered when normalization is applied. We
argue that better understanding and representation of Perso-Arabic script
variation within regional orthographic traditions, where those are present, is
crucial for further progress of modern computational NLP techniques especially
for languages with a paucity of resources.Comment: Pre-print to appear in the Proceedings of Grapholinguistics in the
21st Century (G21C), 2022. Telecom Paris, Palaiseau, France, June 8-10, 2022.
41 pages, 38 tables, 3 figure
Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography
Texts written in Old Literary Finnish represent the first literary work ever
written in Finnish starting from the 16th century. There have been several
projects in Finland that have digitized old publications and made them
available for research use. However, using modern NLP methods in such data
poses great challenges. In this paper we propose an approach for simultaneously
normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best
model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\%
accuracy in other contemporary out-of-domain text. Our method has been made
freely available on Zenodo and Github.Comment: la 28e Conf\'erence sur le Traitement Automatique des Langues
Naturelles (TALN
Beyond Arabic: Software for Perso-Arabic Script Manipulation
This paper presents an open-source software library that provides a set of
finite-state transducer (FST) components and corresponding utilities for
manipulating the writing systems of languages that use the Perso-Arabic script.
The operations include various levels of script normalization, including visual
invariance-preserving operations that subsume and go beyond the standard
Unicode normalization forms, as well as transformations that modify the visual
appearance of characters in accordance with the regional orthographies for
eleven contemporary languages from diverse language families. The library also
provides simple FST-based romanization and transliteration. We additionally
attempt to formalize the typology of Perso-Arabic characters by providing
one-to-many mappings from Unicode code points to the languages that use them.
While our work focuses on the Arabic script diaspora rather than Arabic itself,
this approach could be adopted for any language that uses the Arabic script,
thus providing a unified framework for treating a script family used by close
to a billion people.Comment: Preprint to appear in the Proceedings of the 7th Arabic Natural
Language Processing Workshop (WANLP 2022) at EMNLP, Abu Dhabi, United Arab
Emirates, December 7-11, 2022. 7 page
Social Media Text Classification by Enhancing Well-Formed Text Trained Model
Social media are a powerful communication tool in our era of digital information. The large amount of user-generated data is a useful novel source of data, even though it is not easy to extract the treasures from this vast and noisy trove. Since classification is an important part of text mining, many techniques have been proposed to classify this kind of information. We developed an effective technique of social media text classification by semi-supervised learning utilizing an online news source consisting of well-formed text. The computer first automatically extracts news categories, well-categorized by publishers, as classes for topic classification. A bag of words taken from news articles provides the initial keywords related to their category in the form of word vectors. The principal task is to retrieve a set of new productive keywords. Term Frequency-Inverse Document Frequency weighting (TF-IDF) and Word Article Matrix (WAM) are used as main methods. A modification of WAM is recomputed until it becomes the most effective model for social media text classification. The key success factor was enhancing our model with effective keywords from social media. A promising result of 99.50% accuracy was achieved, with more than 98.5% of Precision, Recall, and F-measure after updating the model three times
Natural language processing for similar languages, varieties, and dialects: A survey
There has been a lot of recent interest in the natural language processing (NLP) community in the computational processing of language varieties and dialects, with the aim to improve the performance of applications such as machine translation, speech recognition, and dialogue systems. Here, we attempt to survey this growing field of research, with focus on computational methods for processing similar languages, varieties, and dialects. In particular, we discuss the most important challenges when dealing with diatopic language variation, and we present some of the available datasets, the process of data collection, and the most common data collection strategies used to compile datasets for similar languages, varieties, and dialects. We further present a number of studies on computational methods developed and/or adapted for preprocessing, normalization, part-of-speech tagging, and parsing similar languages, language varieties, and dialects. Finally, we discuss relevant applications such as language and dialect identification and machine translation for closely related languages, language varieties, and dialects.Non peer reviewe
- …