127 research outputs found

    MoNoise: Modeling Noise Using a Modular Normalization System

    Get PDF
    We propose MoNoise: a normalization model focused on generalizability and efficiency, it aims at being easily reusable and adaptable. Normalization is the task of translating texts from a non- canonical domain to a more canonical domain, in our case: from social media data to standard language. Our proposed model is based on a modular candidate generation in which each module is responsible for a different type of normalization action. The most important generation modules are a spelling correction system and a word embeddings module. Depending on the definition of the normalization task, a static lookup list can be crucial for performance. We train a random forest classifier to rank the candidates, which generalizes well to all different types of normaliza- tion actions. Most features for the ranking originate from the generation modules; besides these features, N-gram features prove to be an important source of information. We show that MoNoise beats the state-of-the-art on different normalization benchmarks for English and Dutch, which all define the task of normalization slightly different.Comment: Source code: https://bitbucket.org/robvanderg/monois

    An In-depth Analysis of the Effect of Lexical Normalization on the Dependency Parsing of Social Media

    Get PDF
    Existing natural language processing systems have often been designed with standard texts in mind. However, when these tools are used on the substantially different texts from social media, their performance drops dramatically. One solution is to translate social media data to standard language before processing, this is also called normalization. It is well-known that this improves performance for many natural language processing tasks on social media data. However, little is known about which types of normalization replacements have the most effect. Furthermore, it is unknown what the weaknesses of existing lexical normalization systems are in an extrinsic setting. In this paper, we analyze the effect of manual as well as automatic lexical normalization for dependency parsing. After our analysis, we conclude that for most categories, automatic normalization scores close to manually annotated normalization and that small annotation differences are important to take into consideration when exploiting normalization in a pipeline setup

    To Normalize, or Not to Normalize: The Impact of Normalization on Part-of-Speech Tagging

    Full text link
    Does normalization help Part-of-Speech (POS) tagging accuracy on noisy, non-canonical data? To the best of our knowledge, little is known on the actual impact of normalization in a real-world scenario, where gold error detection is not available. We investigate the effect of automatic normalization on POS tagging of tweets. We also compare normalization to strategies that leverage large amounts of unlabeled data kept in its raw form. Our results show that normalization helps, but does not add consistently beyond just word embedding layer initialization. The latter approach yields a tagging model that is competitive with a Twitter state-of-the-art tagger.Comment: In WNUT 201

    Normalization and parsing algorithms for uncertain input

    Get PDF

    Normalization and parsing algorithms for uncertain input

    Get PDF
    The automatic analysis (parsing) of natural language is an important ingredient for many natural language processing applications (search-engines, automatic translation, speech-processing, etc.), as it is the first step towards interpretation. For standard texts, like well-edited news articles, current parsers perform very well. However, for user-generated content, such as tweets, parser performance drops dramatically.In this research, we attempt to improve the automatic analysis of spontaneous language by translating it to 'normal' language. For example, the sentence "new pix comming tomorroe" is translated to "new pictures coming tomorrow". In this example sentence, a variety of phenomena occurs: 'pix' is a replacement based on the pronunciation, whereas 'comming' is probably a typo. This translation is also referred to as 'normalization'. Based on the observation that the normalization problem actually consists of multiple sub-problems, we developed a modular normalization model: MoNoise. This normalization model reaches a new state-of-art performance on a variety of languages.Normalizing social media texts leads to a performance increase for syntactic parsers. In the basic setup, we use only the single best normalization candidate for each word, which might lead to error propagation. Hence, we introduce two novel methods to let the parser to take multiple normalization candidates into account per position, leading to further improvements in parser performance

    Challenges in Annotating and Parsing Spoken, Code-switched, Frisian-Dutch Data

    Get PDF

    Modeling Input Uncertainty in Neural Network Dependency Parsing

    Get PDF
    corecore