10 research outputs found

    Real-time Event Detection Using Self-Evolving Contextual Analysis (SECA) Approach

    Get PDF
    Publisher Copyright: AuthorsPeer reviewedPublisher PD

    Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale

    Full text link
    Misinformation, propaganda, and outright lies proliferate on the web, with some narratives having dangerous real-world consequences on public health, elections, and individual safety. However, despite the impact of misinformation, the research community largely lacks automated and programmatic approaches for tracking news narratives across online platforms. In this work, utilizing daily scrapes of 1,334 unreliable news websites, the large-language model MPNet, and DP-Means clustering, we introduce a system to automatically identify and track the narratives spread within online ecosystems. Identifying 52,036 narratives on these 1,334 websites, we describe the most prevalent narratives spread in 2022 and identify the most influential websites that originate and amplify narratives. Finally, we show how our system can be utilized to detect new narratives originating from unreliable news websites and to aid fact-checkers in more quickly addressing misinformation. We release code and data at https://github.com/hanshanley/specious-sites.Comment: Accepted to IEEE S&P 2024. Updated Email

    Combining Supervised and Unsupervised Learning to Detect and Semantically Aggregate Crisis-Related Twitter Content

    Get PDF
    Twitter is an immediate and almost ubiquitous platform and therefore can be a valuable source of information during disasters. Current methods for identifying and classifying crisis-related content are often based on single tweets, i.e., already known information from the past is neglected. In this paper, the combination of tweet-wise pre-trained neural networks and unsupervised semantic clustering is proposed and investigated. The intention is to (1) enhance the generalization capability of pre-trained models, (2) to be able to handle massive amounts of stream data, (3) to reduce information overload by identifying potentially crisis-related content, and (4) to obtain a semantically aggregated data representation that allows for further automated, manual and visual analyses. Latent representations of each tweet based on pre-trained sentence embedding models are used for both, clustering and tweet classification. For a fast, robust and time-continuous processing, subsequent time periods are clustered individually according to a Chinese restaurant process. Clusters without any tweet classified as crisis-related are pruned. Data aggregation over time is ensured by merging semantically similar clusters. A comparison of our hybrid method to a similar clustering approach, as well as first quantitative and qualitative results from experiments with two different labeled data sets demonstrate the great potential for crisis-related Twitter stream analyses

    Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review

    Full text link
    Due to the advent and increase in the popularity of the Internet, people have been producing and disseminating textual data in several ways, such as reviews, social media posts, and news articles. As a result, numerous researchers have been working on discovering patterns in textual data, especially because social media posts function as social sensors, indicating peoples' opinions, interests, etc. However, most tasks regarding natural language processing are addressed using traditional machine learning methods and static datasets. This setting can lead to several problems, such as an outdated dataset, which may not correspond to reality, and an outdated model, which has its performance degrading over time. Concept drift is another aspect that emphasizes these issues, which corresponds to data distribution and pattern changes. In a text stream scenario, it is even more challenging due to its characteristics, such as the high speed and data arriving sequentially. In addition, models for this type of scenario must adhere to the constraints mentioned above while learning from the stream by storing texts for a limited time and consuming low memory. In this study, we performed a systematic literature review regarding concept drift adaptation in text stream scenarios. Considering well-defined criteria, we selected 40 papers to unravel aspects such as text drift categories, types of text drift detection, model update mechanism, the addressed stream mining tasks, types of text representations, and text representation update mechanism. In addition, we discussed drift visualization and simulation and listed real-world datasets used in the selected papers. Therefore, this paper comprehensively reviews the concept drift adaptation in text stream mining scenarios.Comment: 49 page

    Development of research trends evolution model for computer science for Malaysian publication

    Get PDF
    Nowadays, there seem to be research trends done on studies that manipulate publications that utilise the text mining approach. However, most of these studies only investigated the gaps faced by existing research trends models, and the execution of text mining of bibliometric elements and the timeline windows representing the "trends" was not clarified. Thus, this study aimed to develop the conceptual model for research trends in Malaysian publications, specifically, to incorporate the text element of bibliometrics and the execution of timeline windows to identify research trends. In the context of research trends, the evolution or growth of some research area from one period to another is important. This included what has happened, what is currently happening, and predicting potential research trends that will happen in the near future for others to continue the research development. The element in the newly developed model was extracted from the literature review and adapted from one of the selected models. The new model consisted of three stages which is the first stage consisted of three elements - selecting document collection; the second stage was the selection of the bibliometric element; and the third stage was the execution of text mining, co-word analysis from the selected textual bibliometric element, the implementation of two timeline windows (fixed time and sliding time windows-timeline). Also, the execution of the third stage required aid from tools - CiteSpace. The newly developed model was tested, and data were downloaded from two databases, Scopus (10,052 publications) and Web of Science (WoS) (22,088 publications), for a duration between 1995 and 2019. This study identified that the research trend pattern became more active from 2002 onwards. Besides that, the research topic became fresher and more unconventional throughout the timelines. Research topics on artificial intelligence, network communication, and wireless sensor networks are the hottest topics and timeless. Besides that, knowledge management, internet banking, online shopping, and eCommerce were the alternative options for computer science researchers. Each timeline's evolution and blooming shows that researchers are investigating each topic thoroughly. In addition, some small topics do not appear in fixed timeline windows but instead emerge from sliding timeline windows, such as system development, shared banking service, virtual team collaboration, and internet policy. This study also captured the highlighted keywords that could give hints or appear as an initial idea for the next research journey. Experts' evaluation and validation were executed as the interpretation of experimental results require experts' expertise, experience, and views. A semi-structured interview was done with thirteen experts who have remarkable expertise in research and development. From the discussion, most experts agreed that the model could help others identify the research trends and potential new research topics emerging for future research journeys. The newly developed model could be beneficial to those who need hints for their next exploration and help those keen to understand how to execute the text mining within the bibliometric elements

    Interactions in Information Spread

    Full text link
    Since the development of writing 5000 years ago, human-generated data gets produced at an ever-increasing pace. Classical archival methods aimed at easing information retrieval. Nowadays, archiving is not enough anymore. The amount of data that gets generated daily is beyond human comprehension, and appeals for new information retrieval strategies. Instead of referencing every single data piece as in traditional archival techniques, a more relevant approach consists in understanding the overall ideas conveyed in data flows. To spot such general tendencies, a precise comprehension of the underlying data generation mechanisms is required. In the rich literature tackling this problem, the question of information interaction remains nearly unexplored. First, we investigate the frequency of such interactions. Building on recent advances made in Stochastic Block Modelling, we explore the role of interactions in several social networks. We find that interactions are rare in these datasets. Then, we wonder how interactions evolve over time. Earlier data pieces should not have an everlasting influence on ulterior data generation mechanisms. We model this using dynamic network inference advances. We conclude that interactions are brief. Finally, we design a framework that jointly models rare and brief interactions based on Dirichlet-Hawkes Processes. We argue that this new class of models fits brief and sparse interaction modelling. We conduct a large-scale application on Reddit and find that interactions play a minor role in this dataset. From a broader perspective, our work results in a collection of highly flexible models and in a rethinking of core concepts of machine learning. Consequently, we open a range of novel perspectives both in terms of real-world applications and in terms of technical contributions to machine learning.Comment: PhD thesis defended on 2022/09/1
    corecore