Search CORE

151,259 research outputs found

Quootstrap: Scalable Unsupervised Extraction of Quotation-Speaker Pairs from Large News Corpora via Bootstrapping

Author: Pavllo Dario
Piccardi Tiziano
West Robert
Publication venue
Publication date: 07/04/2018
Field of study

We propose Quootstrap, a method for extracting quotations, as well as the names of the speakers who uttered them, from large news corpora. Whereas prior work has addressed this problem primarily with supervised machine learning, our approach follows a fully unsupervised bootstrapping paradigm. It leverages the redundancy present in large news corpora, more precisely, the fact that the same quotation often appears across multiple news articles in slightly different contexts. Starting from a few seed patterns, such as ["Q", said S.], our method extracts a set of quotation-speaker pairs (Q, S), which are in turn used for discovering new patterns expressing the same quotations; the process is then repeated with the larger pattern set. Our algorithm is highly scalable, which we demonstrate by running it on the large ICWSM 2011 Spinn3r corpus. Validating our results against a crowdsourced ground truth, we obtain 90% precision at 40% recall using a single seed pattern, with significantly higher recall values for more frequently reported (and thus likely more interesting) quotations. Finally, we showcase the usefulness of our algorithm's output for computational social science by analyzing the sentiment expressed in our extracted quotations.Comment: Accepted at the 12th International Conference on Web and Social Media (ICWSM), 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Noise and nonlinearities in high-throughput data

Author: Bagnoli F
Bagnoli F
Franco Bagnoli
Koukolíková-Nicola Z
Minka T
Nguyen V-A Nicola-Koulikova Z Bagnoli F Lió P Ho Tu Bao Zhou Zhi-Hua
Pietro Lió
Rajan J J
Viet-Anh Nguyen
Zdena Koukolíková-Nicola
Publication venue: 'IOP Publishing'
Publication date: 05/01/2010
Field of study

High-throughput data analyses are becoming common in biology, communications, economics and sociology. The vast amounts of data are usually represented in the form of matrices and can be considered as knowledge networks. Spectra-based approaches have proved useful in extracting hidden information within such networks and for estimating missing data, but these methods are based essentially on linear assumptions. The physical models of matching, when applicable, often suggest non-linear mechanisms, that may sometimes be identified as noise. The use of non-linear models in data analysis, however, may require the introduction of many parameters, which lowers the statistical weight of the model. According to the quality of data, a simpler linear analysis may be more convenient than more complex approaches. In this paper, we show how a simple non-parametric Bayesian model may be used to explore the role of non-linearities and noise in synthetic and experimental data sets.Comment: 12 pages, 3 figure

arXiv.org e-Print Archive

Crossref