Search CORE

3 research outputs found

Initial Normalization of User Generated Content: Case Study in a Multilingual Setting

Author: Makazhanov Aibek
Myrzakhmetov Bagdat
Yessenbayev Zhandos
Publication venue: The IEEE 12th International Conference Application of Information and Communication Technologies
Publication date: 01/10/2018
Field of study

We address the problem of normalizing user generated content in a multilingual setting. Specifically, we target comment sections of popular Kazakhstani Internet news outlets, where comments almost always appear in Kazakh or Russian, or in a mixture of both. Moreover, such comments are noisy, i.e. difficult to process due to (mostly) intentional breach of spelling conventions, which aggravates data sparseness problem. Therefore, we propose a simple yet effective normalization method that accounts for multilingual input. We evaluate our approach extrinsically, on the tasks of language identification and sentiment analysis, showing that in both cases normalization improves overall accuracy

Crossref

Nazarbayev University Repository

Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography

Author: Alnajjar Khalid
Hämäläinen Mika
Partanen Niko
Publication venue: Association pour le Traitement Automatique des Langues
Publication date: 01/01/2021
Field of study

Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.Comment: la 28e Conf\'erence sur le Traitement Automatique des Langues Naturelles (TALN

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Noisy Uyghur Text Normalization

Author: Akici Ruket C.
Tursun Osman
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2017
Field of study

Uyghur is the second largest and most actively used social media language in China. However, a non-negligible part of Uyghur text appearing in social media is unsystematically written with the Latin alphabet, and it continues to increase in size. Uyghur text in this format is incomprehensible and ambiguous even to native Uyghur speakers. In addition, Uyghur texts in this form lack the potential for any kind of advancement for the NLP tasks related to the Uyghur language. Restoring and preventing noisy Uyghur text written with unsystematic Latin alphabets will be essential to the protection of Uyghur language and improving the accuracy of Uyghur NLP tasks. To this purpose, in this work we propose and compare the noisy channel model and the neural encoderdecoder model as normalizing methods. </p

Queensland University of Technology ePrints Archive