10 research outputs found
Real-time Event Detection Using Self-Evolving Contextual Analysis (SECA) Approach
Publisher Copyright: AuthorsPeer reviewedPublisher PD
Specious Sites: Tracking the Spread and Sway of Spurious News Stories at Scale
Misinformation, propaganda, and outright lies proliferate on the web, with
some narratives having dangerous real-world consequences on public health,
elections, and individual safety. However, despite the impact of
misinformation, the research community largely lacks automated and programmatic
approaches for tracking news narratives across online platforms. In this work,
utilizing daily scrapes of 1,334 unreliable news websites, the large-language
model MPNet, and DP-Means clustering, we introduce a system to automatically
identify and track the narratives spread within online ecosystems. Identifying
52,036 narratives on these 1,334 websites, we describe the most prevalent
narratives spread in 2022 and identify the most influential websites that
originate and amplify narratives. Finally, we show how our system can be
utilized to detect new narratives originating from unreliable news websites and
to aid fact-checkers in more quickly addressing misinformation. We release code
and data at https://github.com/hanshanley/specious-sites.Comment: Accepted to IEEE S&P 2024. Updated Email
Combining Supervised and Unsupervised Learning to Detect and Semantically Aggregate Crisis-Related Twitter Content
Twitter is an immediate and almost ubiquitous platform and therefore can be a valuable source of information during disasters. Current methods for identifying and classifying crisis-related content are often based on single tweets, i.e., already known information from the past is neglected. In this paper, the combination of tweet-wise pre-trained neural networks and unsupervised semantic clustering is proposed and investigated. The intention is to (1) enhance the generalization capability of pre-trained models, (2) to be able to handle massive amounts of stream data, (3) to reduce information overload by identifying potentially crisis-related content, and (4) to obtain a semantically aggregated data representation that allows for further automated, manual and visual analyses. Latent representations of each tweet based on pre-trained sentence embedding models are used for both, clustering and tweet classification.
For a fast, robust and time-continuous processing, subsequent time periods are clustered individually according to a Chinese restaurant process. Clusters without any tweet classified as crisis-related are pruned. Data aggregation over time is ensured by merging semantically similar clusters. A comparison of our hybrid method to a similar clustering approach, as well as first quantitative and qualitative results from experiments with two different labeled data sets demonstrate the great potential for crisis-related Twitter stream analyses
Concept Drift Adaptation in Text Stream Mining Settings: A Comprehensive Review
Due to the advent and increase in the popularity of the Internet, people have
been producing and disseminating textual data in several ways, such as reviews,
social media posts, and news articles. As a result, numerous researchers have
been working on discovering patterns in textual data, especially because social
media posts function as social sensors, indicating peoples' opinions,
interests, etc. However, most tasks regarding natural language processing are
addressed using traditional machine learning methods and static datasets. This
setting can lead to several problems, such as an outdated dataset, which may
not correspond to reality, and an outdated model, which has its performance
degrading over time. Concept drift is another aspect that emphasizes these
issues, which corresponds to data distribution and pattern changes. In a text
stream scenario, it is even more challenging due to its characteristics, such
as the high speed and data arriving sequentially. In addition, models for this
type of scenario must adhere to the constraints mentioned above while learning
from the stream by storing texts for a limited time and consuming low memory.
In this study, we performed a systematic literature review regarding concept
drift adaptation in text stream scenarios. Considering well-defined criteria,
we selected 40 papers to unravel aspects such as text drift categories, types
of text drift detection, model update mechanism, the addressed stream mining
tasks, types of text representations, and text representation update mechanism.
In addition, we discussed drift visualization and simulation and listed
real-world datasets used in the selected papers. Therefore, this paper
comprehensively reviews the concept drift adaptation in text stream mining
scenarios.Comment: 49 page
Development of research trends evolution model for computer science for Malaysian publication
Nowadays, there seem to be research trends done on studies that manipulate publications that utilise the text mining approach. However, most of these studies only investigated the gaps faced by existing research trends models, and the execution of text mining of bibliometric elements and the timeline windows representing the "trends" was not clarified. Thus, this study aimed to develop the conceptual model for research trends in Malaysian publications, specifically, to incorporate the text element of bibliometrics and the execution of timeline windows to identify research trends. In the context of research trends, the evolution or growth of some research area from one period to another is important. This included what has happened, what is currently happening, and predicting potential research trends that will happen in the near future for others to continue the research development. The element in the newly developed model was extracted from the literature review and adapted from one of the selected models. The new model consisted of three stages which is the first stage consisted of three elements - selecting document collection; the second stage was the selection of the bibliometric element; and the third stage was the execution of text mining, co-word analysis from the selected textual bibliometric element, the implementation of two timeline windows (fixed time and sliding time windows-timeline). Also, the execution of the third stage required aid from tools - CiteSpace. The newly developed model was tested, and data were downloaded from two databases, Scopus (10,052 publications) and Web of Science (WoS) (22,088 publications), for a duration between 1995 and 2019. This study identified that the research trend pattern became more active from 2002 onwards. Besides that, the research topic became fresher and more unconventional throughout the timelines. Research topics on artificial intelligence, network communication, and wireless sensor networks are the hottest topics and timeless. Besides that, knowledge management, internet banking, online shopping, and eCommerce were the alternative options for computer science researchers. Each timeline's evolution and blooming shows that researchers are investigating each topic thoroughly. In addition, some small topics do not appear in fixed timeline windows but instead emerge from sliding timeline windows, such as system development, shared banking service, virtual team collaboration, and internet policy. This study also captured the highlighted keywords that could give hints or appear as an initial idea for the next research journey. Experts' evaluation and validation were executed as the interpretation of experimental results require experts' expertise, experience, and views. A semi-structured interview was done with thirteen experts who have remarkable expertise in research and development. From the discussion, most experts agreed that the model could help others identify the research trends and potential new research topics emerging for future research journeys. The newly developed model could be beneficial to those who need hints for their next exploration and help those keen to understand how to execute the text mining within the bibliometric elements
Interactions in Information Spread
Since the development of writing 5000 years ago, human-generated data gets
produced at an ever-increasing pace. Classical archival methods aimed at easing
information retrieval. Nowadays, archiving is not enough anymore. The amount of
data that gets generated daily is beyond human comprehension, and appeals for
new information retrieval strategies. Instead of referencing every single data
piece as in traditional archival techniques, a more relevant approach consists
in understanding the overall ideas conveyed in data flows. To spot such general
tendencies, a precise comprehension of the underlying data generation
mechanisms is required. In the rich literature tackling this problem, the
question of information interaction remains nearly unexplored. First, we
investigate the frequency of such interactions. Building on recent advances
made in Stochastic Block Modelling, we explore the role of interactions in
several social networks. We find that interactions are rare in these datasets.
Then, we wonder how interactions evolve over time. Earlier data pieces should
not have an everlasting influence on ulterior data generation mechanisms. We
model this using dynamic network inference advances. We conclude that
interactions are brief. Finally, we design a framework that jointly models rare
and brief interactions based on Dirichlet-Hawkes Processes. We argue that this
new class of models fits brief and sparse interaction modelling. We conduct a
large-scale application on Reddit and find that interactions play a minor role
in this dataset. From a broader perspective, our work results in a collection
of highly flexible models and in a rethinking of core concepts of machine
learning. Consequently, we open a range of novel perspectives both in terms of
real-world applications and in terms of technical contributions to machine
learning.Comment: PhD thesis defended on 2022/09/1