240 research outputs found

    Lexical Normalization for Code-switched Data and its Effect on POS Tagging

    Get PDF
    Lexical normalization, the translation of non-canonical data to standard language, has shown to improve the performance of manynatural language processing tasks on social media. Yet, using multiple languages in one utterance, also called code-switching (CS), is frequently overlooked by these normalization systems, despite its common use in social media. In this paper, we propose three normalization models specifically designed to handle code-switched data which we evaluate for two language pairs: Indonesian-English (Id-En) and Turkish-German (Tr-De). For the latter, we introduce novel normalization layers and their corresponding language ID and POS tags for the dataset, and evaluate the downstream effect of normalization on POS tagging. Results show that our CS-tailored normalization models outperform Id-En state of the art and Tr-De monolingual models, and lead to 5.4% relative performance increase for POS tagging as compared to unnormalized input

    An Efficient Architecture for Predicting the Case of Characters using Sequence Models

    Full text link
    The dearth of clean textual data often acts as a bottleneck in several natural language processing applications. The data available often lacks proper case (uppercase or lowercase) information. This often comes up when text is obtained from social media, messaging applications and other online platforms. This paper attempts to solve this problem by restoring the correct case of characters, commonly known as Truecasing. Doing so improves the accuracy of several processing tasks further down in the NLP pipeline. Our proposed architecture uses a combination of convolutional neural networks (CNN), bi-directional long short-term memory networks (LSTM) and conditional random fields (CRF), which work at a character level without any explicit feature engineering. In this study we compare our approach to previous statistical and deep learning based approaches. Our method shows an increment of 0.83 in F1 score over the current state of the art. Since truecasing acts as a preprocessing step in several applications, every increment in the F1 score leads to a significant improvement in the language processing tasks.Comment: to be published in IEEE ICSC 2020 proceeding

    Robust Named Entity Recognition with Truecasing Pretraining

    Full text link
    Although modern named entity recognition (NER) systems show impressive performance on standard datasets, they perform poorly when presented with noisy data. In particular, capitalization is a strong signal for entities in many languages, and even state of the art models overfit to this feature, with drastically lower performance on uncapitalized text. In this work, we address the problem of robustness of NER systems in data with noisy or uncertain casing, using a pretraining objective that predicts casing in text, or a truecaser, leveraging unlabeled data. The pretrained truecaser is combined with a standard BiLSTM-CRF model for NER by appending output distributions to character embeddings. In experiments over several datasets of varying domain and casing quality, we show that our new model improves performance in uncased text, even adding value to uncased BERT embeddings. Our method achieves a new state of the art on the WNUT17 shared task dataset.Comment: Accepted to AAAI 202

    Outage Detection via Real-time Social Stream Analysis: Leveraging the Power of Online Complaints

    Get PDF
    Over the past couple of years, Netflix has significantly expanded its online streaming offerings, which now encompass multiple delivery platforms and thousands of titles available for instant view. This paper documents the design and development of an outage detection system for the online services provided by Netflix. Unlike other internal quality control measures used at Netflix, this system uses only publicly available information: the tweets, or Twitter posts, that mention the word “Netflix,” and has been developed and deployed externally, on servers independent of the Netflix infrastructure. This paper discussed the system and provides assessment of the accuracy of its real-time detection and alert mechanisms

    Automatic punctuation restoration with BERT models

    Get PDF
    We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged F1-score of 79.8 in English and 82.2 in Hungarian. Our code is publicly available

    Legitimation Strategies on President Trump’s Twitter Account : A Case Study on @realDonaldTrump’s Tweets Related to the Russian Interference Investigation

    Get PDF
    This thesis studies how the President of the United States of America, Donald Trump, uses his Twitter account, @realDonaldTrump, to (de)legitimate his views and stance on the Russian interference investigation and the charges he faced. In addition, this study examines whether the different (de)legitimation strategies used change when the official results of the “Report On The Investigation Into Russian Interference In The 2016 Presidential Election” are published. The aim was also to examine the underlying meanings behind these different (de)legitimation strategies. In order to answer my research questions, I comprised a micro-corpus comprised of @realDonaldTrump’s tweets. To locate the related tweet, I used a selection of key words that were used to discuss the matter. Retweets, direct quotes, pictures, videos and comment sections were left out, since the focus was mainly on @realDonaldTrump words. The corpus was analyzed with Van Leeuwen’s (2007) framework of (de)legitimation and Van Dijk’s (1998) theory of ideological square. This study was also aimed to test if these frameworks can be applied to this type of topic and data. After carefully examining the corpus, it became evident that @realDonaldTrump relies heavily on delegitimating his opposition through the use of moral evaluation, morally loaded language and other modal elements. He also uses these elements to legitimate his side. Legitimation through authorization is used to corroborate his stance. @realDonaldTrump emphasizes the negative aspects of the other side, while emphasizing what is positive on his side. He also concentrates on suppressing negative aspects related to him or his team. Rationality and mythopoesis are also employed, but in some cases these strategies remain ambiguous and open for differing interpretations. The biggest perceivable changes, around the time of publication of the results of the investigation, are related to the lead investigator Robert Mueller. Especially, Mueller’s placement on the Us/Them -axis and ideological square varies. It becomes clear that these (de)legitimation strategies are used to convince the reader of President Trump’s innocence, decrease the legitimacy of the investigation and link the reader to the side of @realDonaldTrump. This study proved that both of the frameworks can be successfully applied to this topic in this dataset, even though some limitations exist. Having said that, the scope was relatively small and further research on a different topic in the field of political discourse with possibly a larger data could prove itself very interesting

    Using named entity recognition for relevance detection in social network messages

    Get PDF
    O crescimento contínuo das redes sociais ao longo da última década levou a que quantidades massivas de informação sejam geradas diariamente. Enquanto grande parte desta informação é de índole pessoal ou simplesmente sem interesse para a população em geral, tem-se por outro lado vindo a testemunhar cada vez mais a transmissão de notícias importantes através de redes sociais.Esta tese foca-se no estudo da relação entre entidades mencionadas numa publicação de rede social e a respetiva relevância jornalística dessa mesma publicação. Nesse sentido, este trabalho foi dividido em dois grandes objetivos: 1) implementar ou encontrar o melhor sistema de reconhecimento de entidades mencionadas (REM) para textos de redes sociais, e 2) analisar a importância de entidades extraídas de publicações como atributos para deteção de relevância com aprendizagem computacional.Apesar de já existirem diversas ferramentas para extração de entidades, a maioria destas ferramentas apresenta uma perda significativa de performance quando testada em textos de redes sociais, ao invés de textos formais. Isto deve-se essencialmente à informalidade característica deste tipo de textos, como por exemplo a ausência de contexto, pontuação desadequada, utilização errada de maiúsculas e minúsculas, a representação de emoticons com recurso a caracteres, erros gramáticos ou lexicais e até mesmo a utilização de diferentes línguas no mesmo texto. Para endereçar estes problemas, quatro ferramentas de reconhecimento de entidades - "Stanford NLP", "Gate" com "TwitIE", "Twitter NLP tools" e "OpenNLP" - foram testadas em "datasets" de redes sociais. Para além disso, tentamos compreender quão diferentes é que estas ferramentas eram, em termos de Precisão e "Recall" para 3 tipos de entidades (Pessoa, Local e Organização), e de que forma estas ferramentas se poderiam complementar de forma a obter um desempenho combinado superior ao de cada ferramenta utilizada individualmente, criando assim um Ensemble de ferramentas de REM. No seguimento da extração de entidades utilizando o Ensemble desenvolvido, diferentes atributos foram gerados baseados nestas entidades. Estes atributos incluíram o número de pessoas, locais e organizações mencionados numa publicação, estatísticas obtidas a partir da API pública do jornal "The Guardian", e foram também combinados com atributos baseados em "word embeddings". Vários modelos de aprendizagem foram treinados num "dataset" de tweets manualmente anotados. Os resultados obtidos das diferentes combinações de atributos, algoritmos, "hyperparameters" e "datasets" foram comparados e analisados. Os nossos resultados mostraram que utilizar um ensemble de ferramentas de NER pode melhorar o reconhecimento de certos tipos de entidades mencionadas, dependendo dos critérios de votação, e pode mesmo até melhorar a performance geral média dos tipos de entidades: Pessoa, Local e Organização. A análise de relevância mostrou que entidades mencionadas numa publicação podem de facto ser úteis na deteção da sua relevância, sendo não apenas uteis quando usadas isoladamente, tendo alcançado até 74% de AUC, mas também úteis quando combinadas com outros atributos como "word embeddings", tendo nesse caso alcançado um máximo de 94%, uma melhoria de 2.6% em relação a usar exclusivamente "word embeddings".The continuous growth of social networks in the past decade has led to massive amounts of information being generated on a daily-basis. While a lot of this information is merely personal or simply irrelevant to a general audience, relevant news being transmitted through social networks is an increasingly common phenomenon, and therefore detecting such news automatically has become a field of interest and active research.The contribution of the present thesis consisted in studying the importance of named entities in the task of relevance detection. With that in mind, the goal of this work was twofold: 1) to implement or find the best named entity recognition tools for social media texts, and 2) to analyze the importance of extracted entities from posts as features for relevance detection with machine learning. There are already well-known named entity recognition tools, however, most state-of-the-art tools for named entity recognition show significant decrease of performance when tested on social media texts, in comparison to news media texts. This is mainly due to the informal character of social media texts: the absence of context, the lack of proper punctuation, wrong capitalization, the use of characters to represent emoticons, spelling errors and even the use of different languages in the same text. To address these problems, four different state-of-the-art toolkits - Stanford NLP, GATE with TwitIE, Twitter NLP tools and OpenNLP - were tested on social media datasets. In addition, we tried to understand how differently these toolkits predicted Named Entities, in terms of their precision and recall for three different entity types (Person, Location, Organization), and how they could complement each other in this task in order to achieve a combined performance superior to each individual one, creating an ensemble of toolkits.Following the extraction of entities using the developed Ensemble, different features were generated based on these entities. These features included the number of persons, locations and organizations mentioned in a post, statistics retrieved from The Guardian's open API, and were also combined with word embeddings features. Multiple machine learning models were then trained on a manually annotated datasets of tweets. The obtained performances of different combinations of selected features, ML algorithms, hyperparameters, and datasets, were analyzed. Our results showed that using an ensemble of toolkits can improve the recognition of specific entity types, depending on the criteria used for the voting, and even the overall performance average of the entity types Person, Location, and Organization. The relevance analysis showed that Named Entities can indeed be useful for relevance detection, proving to be useful not only when used alone, achieving up to 74% of AUC, but also helpful when combined with other features such as word embeddings, achieving a maximum AUC of 94%, a 2.6% improve over word embeddings alone

    Explaining non-performing loans in Greece: a comparative study on the effects of recession and banking practices

    Get PDF
    Using a new dataset of macroeconomic and banking-related variables we attempt to explain the evolution of “bad” loans in Greece over the period 2005-2015. Our findings suggest that the primary cause of the sharp increase in non-performing loans (NPLs) following the outbreak of the sovereign debt crisis can be mainly attributed to the unprecedented contraction of domestic economic activity and the subsequent rise in unemployment. Furthermore, our results offer no empirical evidence in support of a range of examined hypotheses assuming overly aggressive lending practices by major Greek credit institutions or any systematic efforts to boost current earnings by extending credit to lower credit quality clients. We find that the transmission of macroeconomic shocks to NPLs takes place relatively fast, with the estimated magnitude of the respective responses being broadly comparable with that documented in some earlier studies for other euro area periphery economies. Overall, our results support a swift implementation of reforms agreed with official lenders in the context of the new (3rd) bailout programme. These envisage the modernization the county’s private sector insolvency framework and the creation of a more efficient model for the management of NPLs. A vigorous implementation of these reforms is key for allowing a resumption of positive credit creation, by freeing up valuable resources that are currently trapped in unproductive sectors of the domestic economy. This, in turn, would facilitate a speedier return to positive economic growth and a gradual reduction in unemployment
    corecore