8 research outputs found
SCStory: Self-supervised and Continual Online Story Discovery
We present a framework SCStory for online story discovery, that helps people
digest rapidly published news article streams in real-time without human
annotations. To organize news article streams into stories, existing approaches
directly encode the articles and cluster them based on representation
similarity. However, these methods yield noisy and inaccurate story discovery
results because the generic article embeddings do not effectively reflect the
story-indicative semantics in an article and cannot adapt to the rapidly
evolving news article streams. SCStory employs self-supervised and continual
learning with a novel idea of story-indicative adaptive modeling of news
article streams. With a lightweight hierarchical embedding module that first
learns sentence representations and then article representations, SCStory
identifies story-relevant information of news articles and uses them to
discover stories. The embedding module is continuously updated to adapt to
evolving news streams with a contrastive learning objective, backed up by two
unique techniques, confidence-aware memory replay and prioritized-augmentation,
employed for label absence and data scarcity problems. Thorough experiments on
real and the latest news data sets demonstrate that SCStory outperforms
existing state-of-the-art algorithms for unsupervised online story discovery.Comment: Presented at WWW'2
Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding
Unsupervised discovery of stories with correlated news articles in real-time
helps people digest massive news streams without expensive human annotations. A
common approach of the existing studies for unsupervised online story discovery
is to represent news articles with symbolic- or graph-based embedding and
incrementally cluster them into stories. Recent large language models are
expected to improve the embedding further, but a straightforward adoption of
the models by indiscriminately encoding all information in articles is
ineffective to deal with text-rich and evolving news streams. In this work, we
propose a novel thematic embedding with an off-the-shelf pretrained sentence
encoder to dynamically represent articles and stories by considering their
shared temporal themes. To realize the idea for unsupervised online story
discovery, a scalable framework USTORY is introduced with two main techniques,
theme- and time-aware dynamic embedding and novelty-aware adaptive clustering,
fueled by lightweight story summaries. A thorough evaluation with real news
data sets demonstrates that USTORY achieves higher story discovery performances
than baselines while being robust and scalable to various streaming settings.Comment: Accepted by SIGIR'2
NETS: Extremely fast outlier detection from a data stream via set-based processing
This paper addresses the problem of efficiently detecting outliers from a data stream as old data points expire from and new data points enter the window incrementally. The proposed method is based on a newly discovered characteristic of a data stream that the change in the locations of data points in the data space is typically very insignificant. This observation has led to the finding that the existing distance-based outlier detection algorithms perform excessive unnecessary computations that are repetitive and/or canceling out the effects. Thus, in this paper, we propose a novel set-based approach to detecting outliers, whereby data points at similar locations are grouped and the detection of outliers or inliers is handled at the group level. Specifically, a new algorithm NETS is proposed to achieve a remarkable performance improvement by realizing set-based early identification of outliers or inliers and taking advantage of the net effect between expired and new data points. Additionally, NETS is capable of achieving the same efficiency even for a high-dimensional data stream through two-level dimensional filtering. Comprehensive experiments using six real-world data streams show 5 to 25 times faster processing time than state-of-the-art algorithms with comparable memory consumption. We assert that NETS opens a new possibility to real-time data stream outlier detection
MEGClass: Extremely Weakly Supervised Text Classification via Mutually-Enhancing Text Granularities
Text classification is essential for organizing unstructured text.
Traditional methods rely on human annotations or, more recently, a set of class
seed words for supervision, which can be costly, particularly for specialized
or emerging domains. To address this, using class surface names alone as
extremely weak supervision has been proposed. However, existing approaches
treat different levels of text granularity (documents, sentences, or words)
independently, disregarding inter-granularity class disagreements and the
context identifiable exclusively through joint extraction. In order to tackle
these issues, we introduce MEGClass, an extremely weakly-supervised text
classification method that leverages Mutually-Enhancing Text Granularities.
MEGClass utilizes coarse- and fine-grained context signals obtained by jointly
considering a document's most class-indicative words and sentences. This
approach enables the learning of a contextualized document representation that
captures the most discriminative class indicators. By preserving the
heterogeneity of potential classes, MEGClass can select the most informative
class-indicative documents as iterative feedback to enhance the initial
word-based class representations and ultimately fine-tune a pre-trained text
classifier. Extensive experiments on seven benchmark datasets demonstrate that
MEGClass outperforms other weakly and extremely weakly supervised methods.Comment: Code: https://github.com/pkargupta/MEGClass
Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream
Online anomaly detection from a data stream is critical for the safety and
security of many applications but is facing severe challenges due to complex
and evolving data streams from IoT devices and cloud-based infrastructures.
Unfortunately, existing approaches fall too short for these challenges; online
anomaly detection methods bear the burden of handling the complexity while
offline deep anomaly detection methods suffer from the evolving data
distribution. This paper presents a framework for online deep anomaly
detection, ARCUS, which can be instantiated with any autoencoder-based deep
anomaly detection methods. It handles the complex and evolving data streams
using an adaptive model pooling approach with two novel techniques:
concept-driven inference and drift-aware model pool update; the former detects
anomalies with a combination of models most appropriate for the complexity, and
the latter adapts the model pool dynamically to fit the evolving data streams.
In comprehensive experiments with ten data sets which are both high-dimensional
and concept-drifted, ARCUS improved the anomaly detection accuracy of the
streaming variants of state-of-the-art autoencoder-based methods and that of
the state-of-the-art streaming anomaly detection methods by up to 22% and 37%,
respectively.Comment: Accepted by KDD 2022 Research Trac
COVID-EENet: Predicting Fine-Grained Impact of COVID-19 on Local Economies
Assessing the impact of the COVID-19 crisis on economies is fundamental to tailor the responses of the governments to recover from the crisis. In this paper, we present a novel approach to assessing the economic impact with a large-scale credit card transaction dataset at a fine granularity. For this purpose, we develop a fine-grained economic-epidemiological modeling framework COVID-EENet, which is featured with a two-level deep neural network. In support of the fine-grained EEM, COVID-EENet learns the impact of nearby mass infection cases on the changes of local economies in each district. Through the experiments using the nationwide dataset, given a set of active mass infection cases, COVID-EENet is shown to precisely predict the sales changes in two or four weeks for each district and business category. Therefore, policymakers can be informed of the predictive impact to put in the most effective mitigation measures. Overall, we believe that our work opens a new perspective of using financial data to recover from the economic crisis. For public use in this urgent problem, we release the source code at https://github.com/kaist-dmlab/COVID-EENet