6 research outputs found
Dense vs. Sparse representations for news stream clustering
The abundance of news being generated on a daily basis has made it hard, if not impossible, to monitor all news developments. Thus, there is an increasing need for accurate tools that can organize the news for easier exploration. Typically, this means clustering the news stream, and then connecting the clusters into story lines. Here, we focus on the clustering step, using a local topic graph and a community detection algorithm. Traditionally, news clustering was done using sparse vector representations with TF\u2013IDF weighting, but more recently dense representations have emerged as a popular alternative. Here, we compare these two representations, as well as combinations thereof. The evaluation results on a standard dataset show a sizeable improvement over the state of the art both for the standard F1 as well as for a BCubed version thereof, which we argue is more suitable for the task
Report on the Second International Workshop on Narrative Extraction from Texts (Text2Story 2019)
The Second International Workshop on Narrative Extraction from Texts (Text2Story’19 [http://text2story19.inesctec.pt/]) was held on the 14th of April 2019, in conjunction with the 41st European Conference on Information Retrieval (ECIR 2019) in Cologne, Germany. The workshop provided a platform for researchers in IR, NLP, and design and visualization to come together and share the recent advances in extraction and formal representation of narratives. The workshop consisted of two invited talks, ten research paper presentations, and a poster and demo session. The proceedings of the workshop are available online at http://ceur-ws.org/Vol-2342/info:eu-repo/semantics/publishedVersio
SCStory: Self-supervised and Continual Online Story Discovery
We present a framework SCStory for online story discovery, that helps people
digest rapidly published news article streams in real-time without human
annotations. To organize news article streams into stories, existing approaches
directly encode the articles and cluster them based on representation
similarity. However, these methods yield noisy and inaccurate story discovery
results because the generic article embeddings do not effectively reflect the
story-indicative semantics in an article and cannot adapt to the rapidly
evolving news article streams. SCStory employs self-supervised and continual
learning with a novel idea of story-indicative adaptive modeling of news
article streams. With a lightweight hierarchical embedding module that first
learns sentence representations and then article representations, SCStory
identifies story-relevant information of news articles and uses them to
discover stories. The embedding module is continuously updated to adapt to
evolving news streams with a contrastive learning objective, backed up by two
unique techniques, confidence-aware memory replay and prioritized-augmentation,
employed for label absence and data scarcity problems. Thorough experiments on
real and the latest news data sets demonstrate that SCStory outperforms
existing state-of-the-art algorithms for unsupervised online story discovery.Comment: Presented at WWW'2
Unsupervised Story Discovery from Continuous News Streams via Scalable Thematic Embedding
Unsupervised discovery of stories with correlated news articles in real-time
helps people digest massive news streams without expensive human annotations. A
common approach of the existing studies for unsupervised online story discovery
is to represent news articles with symbolic- or graph-based embedding and
incrementally cluster them into stories. Recent large language models are
expected to improve the embedding further, but a straightforward adoption of
the models by indiscriminately encoding all information in articles is
ineffective to deal with text-rich and evolving news streams. In this work, we
propose a novel thematic embedding with an off-the-shelf pretrained sentence
encoder to dynamically represent articles and stories by considering their
shared temporal themes. To realize the idea for unsupervised online story
discovery, a scalable framework USTORY is introduced with two main techniques,
theme- and time-aware dynamic embedding and novelty-aware adaptive clustering,
fueled by lightweight story summaries. A thorough evaluation with real news
data sets demonstrates that USTORY achieves higher story discovery performances
than baselines while being robust and scalable to various streaming settings.Comment: Accepted by SIGIR'2