657 research outputs found

    SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020)

    Get PDF
    We present the results and main findings of SemEval-2020 Task 12 on Multilingual Offensive Language Identification in Social Media (OffensEval 2020). The task involves three subtasks corresponding to the hierarchical taxonomy of the OLID schema (Zampieri et al., 2019a) from OffensEval 2019. The task featured five languages: English, Arabic, Danish, Greek, and Turkish for Subtask A. In addition, English also featured Subtasks B and C. OffensEval 2020 was one of the most popular tasks at SemEval-2020 attracting a large number of participants across all subtasks and also across all languages. A total of 528 teams signed up to participate in the task, 145 teams submitted systems during the evaluation period, and 70 submitted system description papers.Comment: Proceedings of the International Workshop on Semantic Evaluation (SemEval-2020

    Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"

    Get PDF
    The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products. However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements. Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities. As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach. In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    First International Workshop on Lexical Resources

    Get PDF
    International audienceLexical resources are one of the main sources of linguistic information for research and applications in Natural Language Processing and related fields. In recent years advances have been achieved in both symbolic aspects of lexical resource development (lexical formalisms, rule-based tools) and statistical techniques for the acquisition and enrichment of lexical resources, both monolingual and multilingual. The latter have allowed for faster development of large-scale morphological, syntactic and/or semantic resources, for widely-used as well as resource-scarce languages. Moreover, the notion of dynamic lexicon is used increasingly for taking into account the fact that the lexicon undergoes a permanent evolution.This workshop aims at sketching a large picture of the state of the art in the domain of lexical resource modeling and development. It is also dedicated to research on the application of lexical resources for improving corpus-based studies and language processing tools, both in NLP and in other language-related fields, such as linguistics, translation studies, and didactics

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Uncertainty in deliberate lexical interventions

    Get PDF
    Language managers in their different forms (language planners, terminologists, professional neologists …) have long tried to intervene in the lexical usage of speakers, with various degrees of success: Some of their lexical items (partly) penetrate language use, others do not. Based on electronic networks of practice of the Esperanto speech community, Mélanie Maradan establishes the foundation for a new method to extract speakers’ opinions on lexical items from text corpora. The method is intended as a tool for language managers to detect and explore in context the reasons why speakers might accept or reject lexical items. Mélanie Maradan holds a master’s degree in translation and terminology from the University of Geneva/Switzerland as well as a joint doctoral degree in multilingual information processing and philosophy (Dr. phil.) from the universities of Geneva and Hildesheim/Germany. Her research interests include planned languages (Esperanto studies) as well as neology and corpus linguistics. She works as a professional translator and terminologist in Switzerland

    Sentiment analysis and resources for informal Arabic text on social media

    Get PDF
    Online content posted by Arab users on social networks does not generally abide by the grammatical and spelling rules. These posts, or comments, are valuable because they contain users’ opinions towards different objects such as products, policies, institutions, and people. These opinions constitute important material for commercial and governmental institutions. Commercial institutions can use these opinions to steer marketing campaigns, optimize their products and know the weaknesses and/ or strengths of their products. Governmental institutions can benefit from the social networks posts to detect public opinion before or after legislating a new policy or law and to learn about the main issues that concern citizens. However, the huge size of online data and its noisy nature can hinder manual extraction and classification of opinions present in online comments. Given the irregularity of dialectal Arabic (or informal Arabic), tools developed for formally correct Arabic are of limited use. This is specifically the case when employed in sentiment analysis (SA) where the target of the analysis is social media content. This research implemented a system that addresses this challenge. This work can be roughly divided into three blocks: building a corpus for SA and manually tagging it to check the performance of the constructed lexicon-based (LB) classifier; building a sentiment lexicon that consists of three different sets of patterns (negative, positive, and spam); and finally implementing a classifier that employs the lexicon to classify Facebook comments. In addition to providing resources for dialectal Arabic SA and classifying Facebook comments, this work categorises reasons behind incorrect classification, provides preliminary solutions for some of them with focus on negation, and uses regular expressions to detect the presence of lexemes. This work also illustrates how the constructed classifier works along with its different levels of reporting. Moreover, it compares the performance of the LB classifier against Naïve Bayes classifier and addresses how NLP tools such as POS tagging and Named Entity Recognition can be employed in SA. In addition, the work studies the performance of the implemented LB classifier and the developed sentiment lexicon when used to classify other corpora used in the literature, and the performance of lexicons used in the literature to classify the corpora constructed in this research. With minor changes, the classifier can be used in domain classification of documents (sports, science, news, etc.). The work ends with a discussion of research questions arising from the research reported

    Using Comparable Corpora to Augment Statistical Machine Translation Models in Low Resource Settings

    Get PDF
    Previously, statistical machine translation (SMT) models have been estimated from parallel corpora, or pairs of translated sentences. In this thesis, we directly incorporate comparable corpora into the estimation of end-to-end SMT models. In contrast to parallel corpora, comparable corpora are pairs of monolingual corpora that have some cross-lingual similarities, for example topic or publication date, but that do not necessarily contain any direct translations. Comparable corpora are more readily available in large quantities than parallel corpora, which require significant human effort to compile. We use comparable corpora to estimate machine translation model parameters and show that doing so improves performance in settings where a limited amount of parallel data is available for training. The major contributions of this thesis are the following: * We release ‘language packs’ for 151 human languages, which include bilingual dictionaries, comparable corpora of Wikipedia document pairs, comparable corpora of time-stamped news text that we harvested from the web, and, for non-roman script languages, dictionaries of name pairs, which are likely to be transliterations. * We present a novel technique for using a small number of example word translations to learn a supervised model for bilingual lexicon induction which takes advantage of a wide variety of signals of translation equivalence that can be estimated over comparable corpora. * We show that using comparable corpora to induce new translations and estimate new phrase table feature functions improves end-to-end statistical machine translation performance for low resource language pairs as well as domains. * We present a novel algorithm for composing multiword phrase translations from multiple unigram translations and then use comparable corpora to prune the large space of hypothesis translations. We show that these induced phrase translations improve machine translation performance beyond that of component unigrams. This thesis focuses on critical low resource machine translation settings, where insufficient parallel corpora exist for training statistical models. We experiment with both low resource language pairs and low resource domains of text. We present results from our novel error analysis methodology, which show that most translation errors in low resource settings are due to unseen source language words and phrases and unseen target language translations. We also find room for fixing errors due to how different translations are weighted, or scored, in the models. We target both error types; we use comparable corpora to induce new word and phrase translations and estimate novel translation feature scores. Our experiments show that augmenting baseline SMT systems with new translations and features estimated over comparable corpora improves translation performance significantly. Additionally, our techniques expand the applicability of statistical machine translation to those language pairs for which zero parallel text is available
    • …
    corecore