7 research outputs found

    Benefits of data augmentation for NMT-based text normalization of user-generated content

    Get PDF
    One of the most persistent characteristics of written user-generated content (UGC) is the use of non-standard words. This characteristic contributes to an increased difficulty to automatically process and analyze UGC. Text normalization is the task of transforming lexical variants to their canonical forms and is often used as a pre-processing step for conventional NLP tasks in order to overcome the performance drop that NLP systems experience when applied to UGC. In this work, we follow a Neural Machine Translation approach to text normalization. To train such an encoder-decoder model, large parallel training corpora of sentence pairs are required. However, obtaining large data sets with UGC and their normalized version is not trivial, especially for languages other than English. In this paper, we explore how to overcome this data bottleneck for Dutch, a low-resource language. We start off with a publicly available tiny parallel Dutch data set comprising three UGC genres and compare two different approaches. The first is to manually normalize and add training data, a money and time-consuming task. The second approach is a set of data augmentation techniques which increase data size by converting existing resources into synthesized non-standard forms. Our results reveal that a combination of both approaches leads to the best results

    Robust Parsing for Ungrammatical Sentences

    Get PDF
    Natural Language Processing (NLP) is a research area that specializes in studying computational approaches to human language. However, not all of the natural language sentences are grammatically correct. Sentences that are ungrammatical, awkward, or too casual/colloquial tend to appear in a variety of NLP applications, from product reviews and social media analysis to intelligent language tutors or multilingual processing. In this thesis, we focus on parsing, because it is an essential component of many NLP applications. We investigate in what ways the performances of statistical parsers degrade when dealing with ungrammatical sentences. We also hypothesize that breaking up parse trees from problematic parts prevents NLP applications from degrading due to incorrect syntactic analysis. A parser is robust if it can overlook problems such as grammar mistakes and produce a parse tree that closely resembles the correct analysis for the intended sentence. We develop a robustness evaluation metric and conduct a series of experiments to compare the performances of state-of-the-art parsers on the ungrammatical sentences. The evaluation results show that ungrammatical sentences present challenges for statistical parsers, because the well-formed syntactic trees they produce may not be appropriate for ungrammatical sentences. We also define a new framework for reviewing the parses of ungrammatical sentences and extracting the coherent parts whose syntactic analyses make sense. We call this task parse tree fragmentation. The experimental results suggest that the proposed overall fragmentation framework is a promising way to handle syntactically unusual sentences

    Phonetic normalization as a means to improve toxicity detection

    Get PDF
    À travers le temps et en présence des avancements de la technologie, l'utilisation de cette technologie afin de créer et de maintenir des communautés en ligne est devenue une occurrence journalière. Avec l'augmentation de l'utilisation de ces technologies, une tendance négative peut aussi se faire identifier; il y a une quantité croissante d'utilisateurs ayant des objectifs négatifs qui créent du contenu illicite ou nuisible à ces communautés. Afin de protéger ces communautés, il devient donc nécessaire de modérer les communications des communautés. Bien qu'il serait possible d'engager une équipe de modérateurs, cette équipe devrait constamment grandir afin de pouvoir modérer l'entièreté du contenu. Afin de résoudre ce problème, plusieurs se tournent vers des techniques de modération automatique. Deux exemples de techniques sont les "whitelists" et les "blacklists". Malheureusement, les utilisateurs néfastes peuvent facilement contourner ces techniques à l'aide de techniques subversives. Une des techniques populaires est l'utilisation de substitution où un utilisateur remplace un mot par un équivalent phonétique, ou une combinaison visuellement semblable au mot original. À travers ce mémoire, nous offrons une nouvelle technique de normalisation faisant usage de la phonétique à l'intérieur d'un normalisateur de texte. Ce normalisateur recrée la prononciation et infère le mot réel à partir de cette normalisation, l'objectif étant de retirer les signes de subversion. Une fois normalisé, un message peut ensuite être passé aux systèmes de classification.Over time, the presence of online communities and the use of electronic means of communication have and keep becoming more prevalent. With this increase, the presence of users making use of those means to spread and create harmful, or sometimes known as toxic, content has also increased. In order to protect those communities, the need for moderation becomes a critical matter. While it could be possible to hire a team of moderators, this team would have to be ever-growing, and as such, most turn to automatic means of detection as a step in their moderation process. Examples of such automatic means would be the use of methods such as blacklists and whitelists, but those methods can easily be subverted by harmful users. A common subversion technique is the substitution of a complete word by a phonetically similar word, or combination of letters that resembles the intended word. This thesis aims to offer a novel approach to moderation specifically targeting phonetic substitutions by creating a normalizer capable of identifying how a word should be read and inferring the obfuscated word, nullifying the effects of subversion. Once normalized phonetically, the messages are then sent to existing means of classification for automatic moderation

    Macro-micro approach for mining public sociopolitical opinion from social media

    Get PDF
    During the past decade, we have witnessed the emergence of social media, which has prominence as a means for the general public to exchange opinions towards a broad range of topics. Furthermore, its social and temporal dimensions make it a rich resource for policy makers and organisations to understand public opinion. In this thesis, we present our research in understanding public opinion on Twitter along three dimensions: sentiment, topics and summary. In the first line of our work, we study how to classify public sentiment on Twitter. We focus on the task of multi-target-specific sentiment recognition on Twitter, and propose an approach which utilises the syntactic information from parse-tree in conjunction with the left-right context of the target. We show the state-of-the-art performance on two datasets including a multi-target Twitter corpus on UK elections which we make public available for the research community. Additionally we also conduct two preliminary studies including cross-domain emotion classification on discourse around arts and cultural experiences, and social spam detection to improve the signal-to-noise ratio of our sentiment corpus. Our second line of work focuses on automatic topical clustering of tweets. Our aim is to group tweets into a number of clusters, with each cluster representing a meaningful topic, story, event or a reason behind a particular choice of sentiment. We explore various ways of tackling this challenge and propose a two-stage hierarchical topic modelling system that is efficient and effective in achieving our goal. Lastly, for our third line of work, we study the task of summarising tweets on common topics, with the goal to provide informative summaries for real-world events/stories or explanation underlying the sentiment expressed towards an issue/entity. As most existing tweet summarisation approaches rely on extractive methods, we propose to apply state-of-the-art neural abstractive summarisation model for tweets. We also tackle the challenge of cross-medium supervised summarisation with no target-medium training resources. To the best of our knowledge, there is no existing work on studying neural abstractive summarisation on tweets. In addition, we present a system for providing interactive visualisation of topic-entity sentiments and the corresponding summaries in chronological order. Throughout our work presented in this thesis, we conduct experiments to evaluate and verify the effectiveness of our proposed models, comparing to relevant baseline methods. Most of our evaluations are quantitative, however, we do perform qualitative analyses where it is appropriate. This thesis provides insights and findings that can be used for better understanding public opinion in social media

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Towards the Understanding of Private Content – Content-based Privacy Assessment and Protection in Social Networks

    Get PDF
    In the wake of the Facebook data breach scandal, users begin to realize how vulnerable their per-sonal data is and how blindly they trust the online social networks (OSNs) by giving them an inordinate amount of private data that touch on unlimited areas of their lives. In particular, stud-ies show that users sometimes reveal too much information or unintentionally release regretful messages, especially when they are careless, emotional, or unaware of privacy risks. Additionally, friends on social media platforms are also found to be adversarial and may leak one’s private in-formation. Threats from within users’ friend networks – insider threats by human or bots – may be more concerning because they are much less likely to be mitigated through existing solutions, e.g., the use of privacy settings. Therefore, we argue that the key component of privacy protection in social networks is protecting sensitive/private content, i.e. privacy as having the ability to control dissemination of information. A mechanism to automatically identify potentially sensitive/private posts and alert users before they are posted is urgently needed. In this dissertation, we propose a context-aware, text-based quantitative model for private in-formation assessment, namely PrivScore, which is expected to serve as the foundation of a privacy leakage alerting mechanism. We first explicitly research and study topics that might contain private content. Based on this knowledge, we solicit diverse opinions on the sensitiveness of private infor-mation from crowdsourcing workers, and examine the responses to discover a perceptual model behind the consensuses and disagreements. We then develop a computational scheme using deep neural networks to compute a context-free PrivScore (i.e., the “consensus” privacy score among average users). Finally, we integrate tweet histories, topic preferences and social contexts to gener-ate a personalized context-aware PrivScore. This privacy scoring mechanism could be employed to identify potentially-private messages and alert users to think again before posting them to OSNs. It could also benefit non-human users such as social media chatbots
    corecore