Search CORE

9 research outputs found

Much Gracias: Semi-supervised Code-switch Detection for Spanish-English: How far can we get?

Author: Grand Rasmus
Iliescu Dana-Maria
Qirko Sara
van der Goot Rob
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/06/2021
Field of study

The IT University of Copenhagen's Repository

Analisis Text Clustering Kebijakan Pembukaan Daerah Wisata pada Masa Pandemi Berbasis Densitas Spasial (DBSCAN)

Author: Wulandari Rahmah
Yustanti Wiyli
Publication venue: Program Studi S1 Sistem Informasi Teknik Informatika Universitas Negeri Surabaya
Publication date: 21/03/2022
Field of study

Clustering merupakan metode pengelompokkan data ke dalam suatu kelompok atau klaster menggunakan parameter tertentu sehingga objek dalam suatu klaster memiliki tingkat kemiripan yang sama. Pada penelitian ini dilakukan analisis text clustering terhadap komentar video youtube yang membahas tentang kebijakan pembukaan daerah wisata pada masa pandemi menggunakan algoritma DBSCAN serta membandingkannya dengan algoritma K-Means. Hasil dari penelitian ini diperoleh Silhouette Score sebesar 0.732 untuk algoritma DBSCAN dan sebesar 0.637 untuk algoritma K-Means. Hasil analisis dan identifikasi topik klaster DBSCAN menunjukkan bahwa klaster yang terbentuk menggunakan DBSCAN lebih baik daripada K-Means. Hal tersebut dapat terlihat dari kata-kata yang paling sering muncul pada tiap klaster. Klaster pertama dan ketiga yang terbentuk menggunakan K-Means masih terdapat kata-kata yang sama muncul, yakni kata “moga” dan “masuk”. Topik tiap klaster menggunakan DBSCAN juga lebih mudah disimpulkan daripada K-Means karena tiap klasternya terkategori dengan baik berdasarkan jenis topik dalam komentarnya. Topik pada klaster pertama menggunakan DBSCAN yaitu mengenai  kebijakan karantina di Indonesia pada pembukaan negara bagi wisatawan mancanegara, sedangkan klaster kedua mengenai harapan agar cepat bangkit dan ucapan syukur atas kebijakan yang diterapkan. Selain itu DBSCAN juga menghasilkan noise sebanyak 9 noise. Maka disimpulkan bahwa penggunaan algoritma DBSCAN lebih baik daripada algoritma K-Means untuk mengelompokkan data teks berupa komentar

Online Electronic Journal Portal Universitas Negeri Surabaya

Lexical Normalization for Code-switched Data and its Effect on POS Tagging

Author: van der Goot Rob
Çetinoğlu Özlem
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 31/01/2021
Field of study

Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input

arXiv.org e-Print Archive

The IT University of Copenhagen's Repository

NAZIEF-ADRIANI STEMMER DENGAN IMBUHAN TAK BAKU PADA NORMALISASI BAHASA PERCAKAPAN DI MEDIA SOSIAL

Author: Fanggidae Adriana
Lakonawa Katarina N.
Mola Sebastianus A. S.
Publication venue: 'Universitas Nusa Cendana'
Publication date: 24/03/2021
Field of study

The use of non-standard language is increasingly prevalent in communication on social media. The use of indefinite language is not limited to sentences, clauses, or phrases but also word usage. In this study, the nonstandard word (NSW) will be normalized to the Indonesian standard word (SW). The Nazief-Adriani stemmer (NAS) method was developed into a nonstandard stemmer (NSS) by increasing its ability to detect non-standard additives. The Needleman-Wunsch similarity algorithm is used to weight the matches. The test results with the Mean Reciprocal Rank (MRR) of 3,438 NSW found that the use of NSS with the number of queries = 9 (Q = 9) had the highest of 79.26% with an average of 50.48%. Meanwhile, MRR testing using NAS with Q = 9 got the highest result of 72.87% and an average of 47.23%. Of the two MRR tests carried out, there were 3 letters that had the highest stemming results, both in tests using NAS and using NSS, namely the initial letters r, f and j. The most significant increase in MRR value occurs in the initial letters 'd', 'n' and 't' which are the initial letters of some non-standard affixes.Penggunaan bahasa tak baku semakin marak dalam komunikasi di media sosial. Penggunaan bahasa tak baku tidak terbatas pada kalimat, klausa, atau frasa saja namun juga pada penggunaan kata. Pada penelitian ini, akan dilakukan normalisasi kata yang tak baku/ nonstandard word (NSW) tersebut ke kata baku/ standard word (SW) Bahasa Indonesia. Metode stemmer Nazief-Adriani (Nazief-Adriani stemmer (NAS)) dikembangkan menjadi nonstandard stemmer (NSS) dengan meningkatkan kemampuannya untuk mendeteksi imbuhan tak baku. Tujuan penelitian ini adalah membandingkan penggunaan NAS dan NSS dalam normalisasi NSW.  Algoritma kemiripan Needleman-Wunsch digunakan untuk membobot hasil pencocokan. Hasil pengujian dengan Mean Reciprocal Rank (MRR) pada sebanyak 3.438 NSW didapatkan penggunaan NSS dengan jumlah kueri = 9 (Q=9) memiliki tertinggi sebesar 79.26% dengan rata-rata sebesar 50.48%. Sedangkan pengujian MRR menggunakan NAS dengan Q=9 mendapatkan hasil tertinggi sebesar 72.87% dan rata-rata sebesar 47.23%. Dari dua pengujian MRR yang dilakukan, ada 3 huruf yang memiliki hasil stemming tertinggi, baik dalam pengujian menggunakan NAS maupun menggunakan NSS yaitu huruf  awal r, f dan j. Peningkatan nilai MRR paling signifikan terjadi pada huruf awal ‘d’, ‘n’  dan  ‘t’ yang merupakan huruf awal dari sebagian imbuhan tak standar

Neliti

E-Journal Undana Universitas Nusa Cendana Kupang

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Author: Baldwin T
Caselli T
Ljubešić N
Mahendra R
Muller B
Plank B
Ramponi A
Roncal ISV
Sidorenko W
van der Goot R
Workshop on Noisy User-Generated Text
Zubiaga A
Çetinoğlu Ö
Çolakoğlu T
Publication venue
Publication date: 01/01/2021
Field of study

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MULTILEXNORM shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 12 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system

Queen Mary Research Online

MultiLexNorm: A Shared Task on Multilingual Lexical Normalization

Author: Baldwin Timothy
Caselli Tommaso
Ljubešic´ Nikola
Mahendra Rahmad
Muller Benjamin
Plank Barbara
Ramponi Alan
San Vicente Roncal Iñaki
Sidorenko Wladimir
van der Goot Rob
Zubiaga Arkaitz
Çetinoğlu Özlem
Çolakoglu Talha
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2021
Field of study

Lexical normalization is the task of transforming an utterance into its standardized form. This task is beneficial for downstream analysis, as it provides a way to harmonize (often spontaneous) linguistic variation. Such variation is typical for social media on which information is shared in a multitude of ways, including diverse languages and code-switching. Since the seminal work of Han and Baldwin (2011) a decade ago, lexical normalization has attracted attention in English and multiple other languages. However, there exists a lack of a common benchmark for comparison of systems across languages with a homogeneous data and evaluation setup. The MultiLexNorm shared task sets out to fill this gap. We provide the largest publicly available multilingual lexical normalization benchmark including 13 language variants. We propose a homogenized evaluation setup with both intrinsic and extrinsic evaluation. As extrinsic evaluation, we use dependency parsing and part-of-speech tagging with adapted evaluation metrics (a-LAS, a-UAS, and a-POS) to account for alignment discrepancies. The shared task hosted at W-NUT 2021 attracted 9 participants and 18 submissions. The results show that neural normalization systems outperform the previous state-of-the-art system by a large margin. Downstream parsing and part-of-speech tagging performance is positively affected but to varying degrees, with improvements of up to 1.72 a-LAS, 0.85 a-UAS, and 1.54 a-POS for the winning system

Proceedings - University of Groningen

University of Groningen

Archivio della ricerca - Fondazione Bruno Kessler

ARTS repository - University of Groningen

The IT University of Copenhagen's Repository

Dissertations of the University of Groningen

Prompting Large Language Models to Generate Code-Mixed Texts: The Case of South East Asian Languages

Author: Aji Alham Fikri
Cahyawijaya Samuel
Cruz Jan Christian Blaise
Forde Jessica Zosa
Lovenia Holy
Phan Long
Sutawika Lintang
Tan Yin Lin
Wang Skyler
Yong Zheng-Xin
Zhang Ruochen
Publication venue
Publication date: 23/03/2023
Field of study

While code-mixing is a common linguistic practice in many parts of the world, collecting high-quality and low-cost code-mixed data remains a challenge for natural language processing (NLP) research. The proliferation of Large Language Models (LLMs) in recent times compels one to ask: can these systems be used for data generation? In this article, we explore prompting LLMs in a zero-shot manner to create code-mixed data for five languages in South East Asia (SEA) -- Indonesian, Malay, Chinese, Tagalog, Vietnamese, as well as the creole language Singlish. We find that ChatGPT shows the most potential, capable of producing code-mixed text 68% of the time when the term "code-mixing" is explicitly defined. Moreover, both ChatGPT and InstructGPT's (davinci-003) performances in generating Singlish texts are noteworthy, averaging a 96% success rate across a variety of prompts. The code-mixing proficiency of ChatGPT and InstructGPT, however, is dampened by word choice errors that lead to semantic inaccuracies. Other multilingual models such as BLOOMZ and Flan-T5-XXL are unable to produce code-mixed texts altogether. By highlighting the limited promises of LLMs in a specific form of low-resource data generation, we call for a measured approach when applying similar techniques to other data-scarce NLP contexts

arXiv.org e-Print Archive

Suomenkielisen sosiaalisen median tekstin automaattinen normalisointi

Author: Vehomäki Varpu
Publication venue: Helsingfors universitet
Publication date: 01/01/2022
Field of study

Social media provides huge amounts of potential data for natural language processing but using this data may be challenging. Finnish social media text differs greatly from standard Finnish and models trained on standard data may not be able to adequately handle the differences. Text normalization is the process of processing non-standard language into its standardized form. It provides a way to both process non-standard data with standard natural language processing tools and to get more data for training new tools for different tasks. In this thesis I experiment with bidirectional recurrent neural network models and models based on the ByT5 foundation model, as well as the Murre normalizer to see if existing tools are suitable for normalizing Finnish social media text. I manually normalize a small set of data from the Ylilauta and Suomi24 corpora to use as a test set. For training the models I use the Samples of Spoken Finnish corpus and Wikipedia data with added synthetic noise. The results of this thesis show that there are no existing tools suitable for normalizing Finnish written on social media. There is a lack of suitable data for training models for this task. The ByT5-based models perform better than the BRNN models

Helsingin yliopiston digitaalinen arkisto