Search CORE

7 research outputs found

Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding

Author: Han Jiawei
Lee Dongha
Yoon Susik
Zhang Yunyi
Publication venue
Publication date: 08/04/2023
Field of study

Unsupervised discovery of stories with correlated news articles in real-time helps people digest massive news streams without expensive human annotations. A common approach of the existing studies for unsupervised online story discovery is to represent news articles with symbolic- or graph-based embedding and incrementally cluster them into stories. Recent large language models are expected to improve the embedding further, but a straightforward adoption of the models by indiscriminately encoding all information in articles is ineffective to deal with text-rich and evolving news streams. In this work, we propose a novel thematic embedding with an off-the-shelf pretrained sentence encoder to dynamically represent articles and stories by considering their shared temporal themes. To realize the idea for unsupervised online story discovery, a scalable framework USTORY is introduced with two main techniques, theme- and time-aware dynamic embedding and novelty-aware adaptive clustering, fueled by lightweight story summaries. A thorough evaluation with real news data sets demonstrates that USTORY achieves higher story discovery performances than baselines while being robust and scalable to various streaming settings.Comment: Accepted by SIGIR'2

arXiv.org e-Print Archive

NETS: Extremely fast outlier detection from a data stream via set-based processing

Author: Lee Byung Suk
Lee Jae Gil
Yoon Susik
Publication venue: UVM ScholarWorks
Publication date: 01/01/2018
Field of study

This paper addresses the problem of efficiently detecting outliers from a data stream as old data points expire from and new data points enter the window incrementally. The proposed method is based on a newly discovered characteristic of a data stream that the change in the locations of data points in the data space is typically very insignificant. This observation has led to the finding that the existing distance-based outlier detection algorithms perform excessive unnecessary computations that are repetitive and/or canceling out the effects. Thus, in this paper, we propose a novel set-based approach to detecting outliers, whereby data points at similar locations are grouped and the detection of outliers or inliers is handled at the group level. Specifically, a new algorithm NETS is proposed to achieve a remarkable performance improvement by realizing set-based early identification of outliers or inliers and taking advantage of the net effect between expired and new data points. Additionally, NETS is capable of achieving the same efficiency even for a high-dimensional data stream through two-level dimensional filtering. Comprehensive experiments using six real-world data streams show 5 to 25 times faster processing time than state-of-the-art algorithms with comparable memory consumption. We assert that NETS opens a new possibility to real-time data stream outlier detection

UVM ScholarWorks

MEGClass: Extremely Weakly Supervised Text Classification via Mutually-Enhancing Text Granularities

Author: Han Jiawei
Kargupta Priyanka
Komarlu Tanay
Wang Xuan
Yoon Susik
Publication venue
Publication date: 29/10/2023
Field of study

Text classification is essential for organizing unstructured text. Traditional methods rely on human annotations or, more recently, a set of class seed words for supervision, which can be costly, particularly for specialized or emerging domains. To address this, using class surface names alone as extremely weak supervision has been proposed. However, existing approaches treat different levels of text granularity (documents, sentences, or words) independently, disregarding inter-granularity class disagreements and the context identifiable exclusively through joint extraction. In order to tackle these issues, we introduce MEGClass, an extremely weakly-supervised text classification method that leverages Mutually-Enhancing Text Granularities. MEGClass utilizes coarse- and fine-grained context signals obtained by jointly considering a document's most class-indicative words and sentences. This approach enables the learning of a contextualized document representation that captures the most discriminative class indicators. By preserving the heterogeneity of potential classes, MEGClass can select the most informative class-indicative documents as iterative feedback to enhance the initial word-based class representations and ultimately fine-tune a pre-trained text classifier. Extensive experiments on seven benchmark datasets demonstrate that MEGClass outperforms other weakly and extremely weakly supervised methods.Comment: Code: https://github.com/pkargupta/MEGClass

arXiv.org e-Print Archive

Adaptive Model Pooling for Online Deep Anomaly Detection from a Complex Evolving Data Stream

Author: Lee Byung Suk
Lee Jae-Gil
Lee Youngjun
Yoon Susik
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 09/06/2022
Field of study

Online anomaly detection from a data stream is critical for the safety and security of many applications but is facing severe challenges due to complex and evolving data streams from IoT devices and cloud-based infrastructures. Unfortunately, existing approaches fall too short for these challenges; online anomaly detection methods bear the burden of handling the complexity while offline deep anomaly detection methods suffer from the evolving data distribution. This paper presents a framework for online deep anomaly detection, ARCUS, which can be instantiated with any autoencoder-based deep anomaly detection methods. It handles the complex and evolving data streams using an adaptive model pooling approach with two novel techniques: concept-driven inference and drift-aware model pool update; the former detects anomalies with a combination of models most appropriate for the complexity, and the latter adapts the model pool dynamically to fit the evolving data streams. In comprehensive experiments with ten data sets which are both high-dimensional and concept-drifted, ARCUS improved the anomaly detection accuracy of the streaming variants of state-of-the-art autoencoder-based methods and that of the state-of-the-art streaming anomaly detection methods by up to 22% and 37%, respectively.Comment: Accepted by KDD 2022 Research Trac

arXiv.org e-Print Archive

Topic Taxonomy Expansion via Hierarchy-Aware Topic Phrase Generation

Author: Dongha Lee
Hwanjo Yu
Jiaming Shen
Jiawei Han
Seonghyeon Lee
Susik Yoon
Publication venue: EMNLP
Publication date: 07/12/2022
Field of study

포항공과대학교

TaxoCom: Topic Taxonomy Completion with Hierarchical Discovery of Novel Topic Clusters

Author: Dongha Lee
Hwanjo Yu
Jiaming Shen
Jiawei Han
SeongKu Kang
Susik Yoon
Publication venue: The Web Conference
Publication date: 19/01/2022
Field of study

arXiv.org e-Print Archive

포항공과대학교

COVID-EENet: Predicting Fine-Grained Impact of COVID-19 on Local Economies

Author: Kim Doyoung
Kim Minseok
Lee Jae-Gil
Min Hyangsuk
Nam Youngeun
Song Hwanjun
Yoon Susik
Publication venue: Association for the Advancement of Artificial Intelligence
Publication date: 28/06/2022
Field of study

Assessing the impact of the COVID-19 crisis on economies is fundamental to tailor the responses of the governments to recover from the crisis. In this paper, we present a novel approach to assessing the economic impact with a large-scale credit card transaction dataset at a fine granularity. For this purpose, we develop a fine-grained economic-epidemiological modeling framework COVID-EENet, which is featured with a two-level deep neural network. In support of the fine-grained EEM, COVID-EENet learns the impact of nearby mass infection cases on the changes of local economies in each district. Through the experiments using the nationwide dataset, given a set of active mass infection cases, COVID-EENet is shown to precisely predict the sales changes in two or four weeks for each district and business category. Therefore, policymakers can be informed of the predictive impact to put in the most effective mitigation measures. Overall, we believe that our work opens a new perspective of using financial data to recover from the economic crisis. For public use in this urgent problem, we release the source code at https://github.com/kaist-dmlab/COVID-EENet

Association for the Advancement of Artificial Intelligence: AAAI Publications