Search CORE

7,090 research outputs found

Growing Story Forest Online from Massive Breaking News

Author: Kong Linglong
Lai Kunfeng
Liu Bang
Niu Di
Xu Yu
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 28/02/2018
Field of study

We describe our experience of implementing a news content organization system at Tencent that discovers events from vast streams of breaking news and evolves news story structures in an online fashion. Our real-world system has distinct requirements in contrast to previous studies on topic detection and tracking (TDT) and event timeline or graph generation, in that we 1) need to accurately and quickly extract distinguishable events from massive streams of long text documents that cover diverse topics and contain highly redundant information, and 2) must develop the structures of event stories in an online manner, without repeatedly restructuring previously formed stories, in order to guarantee a consistent user viewing experience. In solving these challenges, we propose Story Forest, a set of online schemes that automatically clusters streaming documents into events, while connecting related events in growing trees to tell evolving stories. We conducted extensive evaluation based on 60 GB of real-world Chinese news data, although our ideas are not language-dependent and can easily be extended to other languages, through detailed pilot user experience studies. The results demonstrate the superior capability of Story Forest to accurately identify events and organize news text into a logical structure that is appealing to human readers, compared to multiple existing algorithm frameworks.Comment: Accepted by CIKM 2017, 9 page

arXiv.org e-Print Archive

Crossref

BlogForever D2.6: Data Extraction Methodology

Author: Banos V.
Davis R.
Gkotsis G.
Pincent E.
Stepanyan K.
Publication venue
Publication date: 25/10/2013
Field of study

This report outlines an inquiry into the area of web data extraction, conducted within the context of blog preservation. The report reviews theoretical advances and practical developments for implementing data extraction. The inquiry is extended through an experiment that demonstrates the effectiveness and feasibility of implementing some of the suggested approaches. More specifically, the report discusses an approach based on unsupervised machine learning that employs the RSS feeds and HTML representations of blogs. It outlines the possibilities of extracting semantics available in blogs and demonstrates the benefits of exploiting available standards such as microformats and microdata. The report proceeds to propose a methodology for extracting and processing blog data to further inform the design and development of the BlogForever platform

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

Author: Pavllo Dario
Piccardi Tiziano
West Robert
Publication venue
Publication date: 07/04/2018
Field of study

We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media (ICWSM), 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Exploring open information via event network

Author: Agichtein
Ahn
Alex
Angeli
Banko
Banko
Batagelj
Blei
Brin
Carpenter
Che
Chen
Chen
Chen
Chen
Chiu
Collins
Csardi
Curran
Das Sarma
Doddington
Downey
Etzioni
FENG TIAN
Fu
Hacioglu
Hoffmann
HUAN LIU
Kambhatla
Kozareva
Kuzey
Kuzey
Lample
Leydesdorff
Ling
Liu
McIntosh
Mintz
Mohamed
Moro
Moro
NAZARAF SHAH
Padró
Parikh
QINGHUA ZHENG
Riedel
Ritter
Roth
Roth
Sowa
Suchanek
Sun
Takamatsu
Tang
Wang
Weld
Xu
YANPING CHEN
YAZHOU HAO
Zelenko
Zeng
Zhang
Zhang
Zhou
Zhou
Zhu
Publication venue: 'Cambridge University Press (CUP)'
Publication date: 01/01/2017
Field of study

Crossref

Coventry University Pure Portal