Search CORE

4 research outputs found

Grouping business news stories based on salience of named entities

Author: Du Mian
Escoter Llorenc
Katinskaia Anisia
Pivovarova Lidia
Yangarber Roman
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2017
Field of study

In news aggregation systems focused on broad news domains, certain stories may appear in multiple articles. Depending on the relative importance of the story, the number of versions can reach dozens or hundreds within a day. The text in these versions may be nearly identical or quite different. Linking multiple versions of a story into a single group brings several important benefits to the end-user—reducing the cognitive load on the reader, as well as signaling the relative importance of the story. We present a grouping algorithm, and explore several vector-based representations of input documents: from a baseline using keywords, to a method using salience—a measure of importance of named entities in the text. We demonstrate that features beyond keywords yield substantial improvements, verified on a manually-annotated corpus of business news stories.Peer reviewe

Crossref

Helsingin yliopiston digitaalinen arkisto

Measuring Latent Variables is space and/or time: A Gender Statistics exercise

Author: Bertarelli G
Crippa F
Mecatti F
Publication venue: place:Athens
Publication date: 01/01/2017
Field of study

Archivio della ricerca della Scuola Superiore Sant'Anna

A hybrid framework for news clustering based on the DBSCAN-Martingale and LDA

Author: Gialampoukidis Ilias
Kompatsiaris Ioannis
Vrochidis Stefanos
Wanner Leo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Nowadays there is an important need by journalists and media monitoring companies to cluster news in large amounts of web articles, in order to ensure fast access to their topics or events of interest. Our aim in this work is to identify groups of news articles that share a common topic or event, without a priori knowledge of the number of clusters. The estimation of the correct number of topics is a challenging issue, due to the existence of “noise”, i.e. news articles which are irrelevant to all other topics. In this context, we introduce a novel density-based news clustering framework, in which the assignment of news articles to topics is done by the well-established Latent Dirichlet Allocation, but the estimation of the number of clusters is performed by our novel method, called “DBSCAN-Martingale”, which allows for extracting noise from the dataset and progressively extracts clusters from an OPTICS reachability plot. We evaluate our framework and the DBSCAN-Martingale on the 20newsgroups-mini dataset and on 220 web news articles, which are references to specific Wikipedia pages. Among twenty methods for news clustering, without knowing the number of clusters k, the framework of DBSCAN-Martingale provides the correct number of clusters and the highest Normalized Mutual Information.This work was supported by the projects MULTISENSOR (FP7-610411) and KRISTINA (H2020-645012), funded by the European Commission

UPF Digital Repository

A hybrid framework for news clustering based on the DBSCAN-Martingale and LDA

Author: Gialampoukidis Ilias
Kompatsiaris Ioannis
Vrochidis Stefanos
Wanner Leo
Publication venue: 'Springer Fachmedien Wiesbaden GmbH'
Publication date
Field of study

RECERCAT