100 research outputs found

    Automatic fake news detection on Twitter

    Get PDF
    Nowadays, information is easily accessible online, from articles by reliable news agencies to reports from independent reporters, to extreme views published by unknown individuals. Moreover, social media platforms are becoming increasingly important in everyday life, where users can obtain the latest news and updates, share links to any information they want to spread, and post their own opinions. Such information may create difficulties for information consumers as they try to distinguish fake news from genuine news. Indeed, users may not be necessarily aware that the information they encounter is false and may not have the time and effort to fact-check all the claims and information they encounter online. With the amount of information created and shared daily, it is also not feasible for journalists to manually fact-check every published news article, sentence or tweet. Therefore, an automatic fact-checking system that identifies the check-worthy claims and tweets, and then fact-checks these identified check-worthy claims and tweets can help inform the public of fake news circulating online. Existing fake news detection systems mostly rely on the machine learning models’ computational power to automatically identify fake news. Some researchers have focused on extracting the semantic and contextual meaning from news articles, statements, and tweets. These methods aim to identify fake news by analysing the differences in writing style between fake news and factual news. On the other hand, some researchers investigated using social networks information to detect fake news accurately. These methods aim to distinguish fake news from factual news based on the spreading pattern of news, and the statistical information of the engaging users with the propagated news. In this thesis, we propose a novel end-to-end fake news detection framework that leverages both the textual features and social network features, which can be extracted from news, tweets, and their engaging users. Specifically, our proposed end-to-end framework is able to process a Twitter feed, identify check-worthy tweets and sentences using textual features and embedded entity features, and fact-check the claims using previously unexplored information, such as existing fake news collections and user network embeddings. Our ultimate aim is to rank tweets and claims based on their check-worthiness to focus the available computational power on fact-checking the tweets and claims that are important and potentially fake. In particular, we leverage existing fake news collections to identify recurring fake news, while we explore the Twitter users’ engagement with the check-worthy news to identify fake news that are spreading on Twitter. To identify fake news effectively, we first propose the fake news detection framework (FNDF), which consists of the check-worthiness identification phase and the fact-checking phase. These two phases are divided into three tasks: Phase 1 Task 1: check-worthiness identification task; Phase 2 Task 2: recurring fake news identification task; and Phase 2 Task 3: social network structure-assisted fake news detection task. We conduct experiments on two large publicly available datasets, namely the MM-COVID and the stance detection (SD) datasets. The experimental results show that our proposed framework, FNDF, can indeed identify fake news more effectively than the existing SOTA models, with 23.2% and 4.0% significant increases in F1 scores on the two tested datasets, respectively. To identify the check-worthy tweets and claims effectively, we incorporate embedded entities with language representations to form a vector representation of a given text, to identify if the text is check-worthy or not. We conduct experiments using three publicly available datasets, namely, the CLEF 2019, 2020 CheckThat! Lab check-worthy sentence detection dataset, and the CLEF 2021 CheckThat! Lab check-worthy tweets detection dataset. The experimental results show that combining entity representations and language model representations enhance the language model’s performance in identifying check-worthy tweets and sentences. Specifically, combining embedded entities with the language model results in as much as 177.6% increase in MAP on ranking check-worthy tweets,and a 92.9% increase in ranking check-worthy sentences. Moreover, we conduct an ablation study on the proposed end-to-end framework, FNDF, and show that including a model for identifying check-worthy tweets and claims in our end-to-end framework, can significantly increase the F1 score by as much as 14.7%, compared to not including this model in our framework. To identify recurring fake news effectively, we propose an ensemble model of the BM25 scores and the BERT language model. Experiments conducted on two datasets, namely, the WSDM Cup 2019 Fake News Challenge dataset, and the MM-COVID dataset. Experimental results show that enriching the BERT language model with the BM25 scores can help the BERT model identify fake news significantly more accurately by 4.4%. Moreover, the ablation study on the end-to-end fake news detection framework, FNDF, shows that including the identification of recurring fake news model in our proposed framework results in significant increase in terms of F1 score by as much as 15.5%, compared to not including this task in our framework. To leverage the user network structure in detecting fake news, we first obtain user embed- dings from unsupervised user network embeddings based on their friendship or follower connections on Twitter. Next, we use the user embeddings of the users who engaged with the news to represent a check-worthy tweet/claim, thus predicting whether it is fake news. Our results show that using user network embeddings to represent check-worthy tweets/sentences significantly outperforms the SOTA model, which uses language models to represent the tweets/sentences and complex networks requiring handcrafted features, by 12.0% in terms of the F1 score. Furthermore, including the user network assisted fake news detection model in our end-to-end framework, FNDF, significantly increase the F1 score by as much as 29.3%. Overall, this thesis shows that an end-to-end fake news detection framework, FNDF, that identifies check-worthy tweets and claims, then fact-checks the check-worthy tweets and claims, by identifying recurring fake news and leveraging the social network users’ connections, can effectively identify fake news online

    Towards More Human-Like Text Summarization: Story Abstraction Using Discourse Structure and Semantic Information.

    Get PDF
    PhD ThesisWith the massive amount of textual data being produced every day, the ability to effectively summarise text documents is becoming increasingly important. Automatic text summarization entails the selection and generalisation of the most salient points of a text in order to produce a summary. Approaches to automatic text summarization can fall into one of two categories: abstractive or extractive approaches. Extractive approaches involve the selection and concatenation of spans of text from a given document. Research in automatic text summarization began with extractive approaches, scoring and selecting sentences based on the frequency and proximity of words. In contrast, abstractive approaches are based on a process of interpretation, semantic representation, and generalisation. This is closer to the processes that psycholinguistics tells us that humans perform when reading, remembering and summarizing. However in the sixty years since its inception, the field has largely remained focused on extractive approaches. This thesis aims to answer the following questions. Does knowledge about the discourse structure of a text aid the recognition of summary-worthy content? If so, which specific aspects of discourse structure provide the greatest benefit? Can this structural information be used to produce abstractive summaries, and are these more informative than extractive summaries? To thoroughly examine these questions, they are each considered in isolation, and as a whole, on the basis of both manual and automatic annotations of texts. Manual annotations facilitate an investigation into the upper bounds of what can be achieved by the approach described in this thesis. Results based on automatic annotations show how this same approach is impacted by the current performance of imperfect preprocessing steps, and indicate its feasibility. Extractive approaches to summarization are intrinsically limited by the surface text of the input document, in terms of both content selection and summary generation. Beginning with a motivation for moving away from these commonly used methods of producing summaries, I set out my methodology for a more human-like approach to automatic summarization which examines the benefits of using discourse-structural information. The potential benefit of this is twofold: moving away from a reliance on the wording of a text in order to detect important content, and generating concise summaries that are independent of the input text. The importance of discourse structure to signal key textual material has previously been recognised, however it has seen little applied use in the field of autovii matic summarization. A consideration of evaluation metrics also features significantly in the proposed methodology. These play a role in both preprocessing steps and in the evaluation of the final summary product. I provide evidence which indicates a disparity between the performance of coreference resolution systems as indicated by their standard evaluation metrics, and their performance in extrinsic tasks. Additionally, I point out a range of problems for the most commonly used metric, ROUGE, and suggest that at present summary evaluation should not be automated. To illustrate the general solutions proposed to the questions raised in this thesis, I use Russian Folk Tales as an example domain. This genre of text has been studied in depth and, most importantly, it has a rich narrative structure that has been recorded in detail. The rules of this formalism are suitable for the narrative structure reasoning system presented as part of this thesis. The specific discourse-structural elements considered cover the narrative structure of a text, coreference information, and the story-roles fulfilled by different characters. The proposed narrative structure reasoning system produces highlevel interpretations of a text according to the rules of a given formalism. For the example domain of Russian Folktales, a system is implemented which constructs such interpretations of a tale according to an existing set of rules and restrictions. I discuss how this process of detecting narrative structure can be transferred to other genres, and a key factor in the success of this process: how constrained are the rules of the formalism. The system enumerates all possible interpretations according to a set of constraints, meaning a less restricted rule set leads to a greater number of interpretations. For the example domain, sentence level discourse-structural annotations are then used to predict summary-worthy content. The results of this study are analysed in three parts. First, I examine the relative utility of individual discourse features and provide a qualitative discussion of these results. Second, the predictive abilities of these features are compared when they are manually annotated to when they are annotated with varying degrees of automation. Third, these results are compared to the predictive capabilities of classic extractive algorithms. I show that discourse features can be used to more accurately predict summary-worthy content than classic extractive algorithms. This holds true for automatically obtained annotations, but with a much clearer difference when using manual annotations. The classifiers learned in the prediction of summary-worthy sentences are subsequently used to inform the production of both extractive and abstractive summaries to a given length. A human-based evaluation is used to compare these summaries, as well as the outputs of a classic extractive summarizer. I analyse the impact of knowledge about discourse structure, obtained both manually and automatically, on summary production. This allows for some insight into the knock on effects on summary production that can occur from inaccurate discourse information (narrative structure and coreference information). My analyses show that even given inaccurate discourse information, the resulting abstractive summaries are considered more informative than their extractive counterparts. With human-level knowledge about discourse structure, these results are even clearer. In conclusion, this research provides a framework which can be used to detect the narrative structure of a text, and shows its potential to provide a more human-like approach to automatic summarization. I show the limit of what is achievable with this approach both when manual annotations are obtainable, and when only automatic annotations are feasible. Nevertheless, this thesis supports the suggestion that the future of summarization lies with abstractive and not extractive techniques

    Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

    Get PDF
    No abstract available

    Cross-Platform Text Mining and Natural Language Processing Interoperability - Proceedings of the LREC2016 conference

    Get PDF
    No abstract available

    Essential Speech and Language Technology for Dutch: Results by the STEVIN-programme

    Get PDF
    Computational Linguistics; Germanic Languages; Artificial Intelligence (incl. Robotics); Computing Methodologie

    Knowledge acquisition for coreference resolution

    Get PDF
    Diese Arbeit befasst sich mit dem Problem der statistischen Koreferenzauflösung. Theoretische Studien bezeichnen Koreferenz als ein vielseitiges linguistisches Phänomen, das von verschiedenen Faktoren beeinflusst wird. Moderne statistiche Algorithmen dagegen basieren sich typischerweise auf einfache wissensarme Modelle. Ziel dieser Arbeit ist das Schließen der Lücke zwischen Theorie und Praxis. Ausgehend von den Erkentnissen der theoretischen Studien erfolgt die Bestimmung der linguistischen Faktoren die fuer die Koreferenz besonders relevant erscheinen. Unterschiedliche Informationsquellen werden betrachtet: von der Oberflächenübereinstimmung bis zu den tieferen syntaktischen, semantischen und pragmatischen Merkmalen. Die Präzision der untersuchten Faktoren wird mit korpus-basierten Methoden evaluiert. Die Ergebnisse beweisen, dass die Koreferenz mit den linguistischen, in den theoretischen Studien eingebrachten Merkmalen interagiert. Die Arbeit zeigt aber auch, dass die Abdeckung der untersuchten theoretischen Aussagen verbessert werden kann. Die Merkmale stellen die Grundlage für den Aufbau eines einerseits linguistisch gesehen reichen andererseits auf dem Machinellen Lerner basierten, d.h. eines flexiblen und robusten Systems zur Koreferenzauflösung. Die aufgestellten Untersuchungen weisen darauf hin dass das wissensreiche Model erfolgversprechende Leistung zeigt und im Vergleich mit den Algorithmen, die sich auf eine einzelne Informationsquelle verlassen, sowie mit anderen existierenden Anwendungen herausragt. Das System erreicht einen F-wert von 65.4% auf dem MUC-7 Korpus. In den bereits veröffentlichen Studien ist kein besseres Ergebnis verzeichnet. Die Lernkurven zeigen keine Konvergenzzeichen. Somit kann der Ansatz eine gute Basis fuer weitere Experimente bilden: eine noch bessere Leistung kann dadurch erreicht werden, dass man entweder mehr Texte annotiert oder die bereits existierende Daten effizienter einsetzt. Diese Arbeit beweist, dass statistiche Algorithmen fuer Koreferenzauflösung stark von den theoretischen linguistischen Studien profitiern können und sollen: auch unvollständige Informationen, die automatische fehleranfällige Sprachmodule liefern, können die Leistung der Anwendung signifikant verbessern.This thesis addresses the problem of statistical coreference resolution. Theoretical studies describe coreference as a complex linguistic phenomenon, affected by various different factors. State-of-the-art statistical approaches, on the contrary, rely on rather simple knowledge-poor modeling. This thesis aims at bridging the gap between the theory and the practice. We use insights from linguistic theory to identify relevant linguistic parameters of co-referring descriptions. We consider different types of information, from the most shallow name-matching measures to deeper syntactic, semantic, and discourse knowledge. We empirically assess the validity of the investigated theoretic predictions for the corpus data. Our data-driven evaluation experiments confirm that various linguistic parameters, suggested by theoretical studies, interact with coreference and may therefore provide valuable information for resolution systems. At the same time, our study raises several issues concerning the coverage of theoretic claims. It thus brings feedback to linguistic theory. We use the investigated knowledge sources to build a linguistically informed statistical coreference resolution engine. This framework allows us to combine the flexibility and robustness of a machine learning-based approach with wide variety of data from different levels of linguistic description. Our evaluation experiments with different machine learners show that our linguistically informed model, on the one side, outperforms algorithms, based on a single knowledge source and, on the other side, yields the best result on the MUC-7 data, reported in the literature (F-score of 65.4% with the SVM-light learning algorithm). The learning curves for our classifiers show no signs of convergence. This suggests that our approach makes a good basis for further experimentation: one can obtain even better results by annotating more material or by using the existing data more intelligently. Our study proves that statistical approaches to the coreference resolution task may and should benefit from linguistic theories: even imperfect knowledge, extracted from raw text data with off-the-shelf error-prone NLP modules, helps achieve significant improvements

    A requirement-driven approach for modelling software architectures

    Get PDF
    Throughout the software development lifecycle (SDLC) there are many pitfalls which software engineers have to face. Regardless of the methodology adopted, classic methodologies such as waterfall or more modern ones such as agile or scrum, defects can be injected in any phase of the SDLC. The main avenue to detect and remove defects is through Quality Assurance (QA) activities. The planned activities to detect, fix and remove defects occur later on and there is less effort spent in the initial phases of the SDLC to either detect, remove or prevent the injection of defects. In fact, the cost of detecting and fixing a defect in the later phases of the SDLC such as development, deployment, maintenance and support is much higher than detecting and fixing defects in the initial phases of the SDLC. The software architecture of the application also has an incidence on defect injection whereby the software architecture can be regarded asthe fundamental structures of a software system. The impact of detecting and fixing defects later on is exacerbated for software architecture which are distributed, such as service-oriented architectures or microservices. Thus, the aim of this research is to develop a semi-automated framework to translate requirements into design with the aim of reducing the introduction of defects from the early phases of the SDLC. Part of the objectives of this work is to conceptualize a design for architectural paradigms such as object-oriented and service-oriented programming. The proposed framework uses a series of techniques from Natural Language Processing (NLP) and a blend of techniques from intelligent learning systems such as ontologies and neural networks to partially automate the translation of requirements into a design. The novelty focuses on moulding the design into an architecture which is better adapted for distributed systems. The framework is evaluated with a case study where the design and architecture from the framework is compared to a design and architecture which was drawn by a software architect. In addition, the evaluation using a case study aims to demonstrate the use of the framework and how each individual design and architecture artefacts fair

    Entity-Oriented Search

    Get PDF
    This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms
    • …
    corecore