7,613 research outputs found

    Neural text normalization for Turkish social media

    Get PDF
    This is an accepted manuscript of an article published by IEEE in 2018 3rd International Conference on Computer Science and Engineering (UBMK) on 10/12/2018, available online: https://ieeexplore.ieee.org/document/8566406 The accepted version of the publication may differ from the final published version.Social media has become a rich data source for natural language processing tasks with its worldwide use; however, it is hard to process social media data due to its informal nature. Text normalization is the task of transforming the noisy text into its canonical form. It generally serves as a preprocessing task in other NLP tasks that are applied to noisy text. In this study, we apply two approaches for Turkish text normalization: Contextual Normalization approach using distributed representations of words and Sequence-to-Sequence Normalization approach using neural encoder-decoder models. As the approaches applied to Turkish and also other languages are mostly rule-based, additional rules are required to be added to the normalization model in order to detect new error patterns arising from the change of the language use in social media. In contrast to rule-based approaches, the proposed approaches provide the advantage of normalizing different error patterns that change over time by training with a new dataset and updating the normalization model. Therefore, the proposed methods provide a solution to language change dependency in social media by updating the normalization model without defining new rules.Published versio

    Two Tales of the World: Comparison of Widely Used World News Datasets GDELT and EventRegistry

    Full text link
    In this work, we compare GDELT and Event Registry, which monitor news articles worldwide and provide big data to researchers regarding scale, news sources, and news geography. We found significant differences in scale and news sources, but surprisingly, we observed high similarity in news geography between the two datasets.Comment: To be appeared in ICWSM'1
    corecore