28,882 research outputs found
Generalized Score Matching for Non-Negative Data
A common challenge in estimating parameters of probability density functions
is the intractability of the normalizing constant. While in such cases maximum
likelihood estimation may be implemented using numerical integration, the
approach becomes computationally intensive. The score matching method of
Hyv\"arinen [2005] avoids direct calculation of the normalizing constant and
yields closed-form estimates for exponential families of continuous
distributions over . Hyv\"arinen [2007] extended the approach to
distributions supported on the non-negative orthant, . In this
paper, we give a generalized form of score matching for non-negative data that
improves estimation efficiency. As an example, we consider a general class of
pairwise interaction models. Addressing an overlooked inexistence problem, we
generalize the regularized score matching method of Lin et al. [2016] and
improve its theoretical guarantees for non-negative Gaussian graphical models.Comment: 70 pages, 76 figure
Non-Parametric Causality Detection: An Application to Social Media and Financial Data
According to behavioral finance, stock market returns are influenced by
emotional, social and psychological factors. Several recent works support this
theory by providing evidence of correlation between stock market prices and
collective sentiment indexes measured using social media data. However, a pure
correlation analysis is not sufficient to prove that stock market returns are
influenced by such emotional factors since both stock market prices and
collective sentiment may be driven by a third unmeasured factor. Controlling
for factors that could influence the study by applying multivariate regression
models is challenging given the complexity of stock market data. False
assumptions about the linearity or non-linearity of the model and inaccuracies
on model specification may result in misleading conclusions.
In this work, we propose a novel framework for causal inference that does not
require any assumption about the statistical relationships among the variables
of the study and can effectively control a large number of factors. We apply
our method in order to estimate the causal impact that information posted in
social media may have on stock market returns of four big companies. Our
results indicate that social media data not only correlate with stock market
returns but also influence them.Comment: Physica A: Statistical Mechanics and its Applications 201
Query-Driven Sampling for Collective Entity Resolution
Probabilistic databases play a preeminent role in the processing and
management of uncertain data. Recently, many database research efforts have
integrated probabilistic models into databases to support tasks such as
information extraction and labeling. Many of these efforts are based on batch
oriented inference which inhibits a realtime workflow. One important task is
entity resolution (ER). ER is the process of determining records (mentions) in
a database that correspond to the same real-world entity. Traditional pairwise
ER methods can lead to inconsistencies and low accuracy due to localized
decisions. Leading ER systems solve this problem by collectively resolving all
records using a probabilistic graphical model and Markov chain Monte Carlo
(MCMC) inference. However, for large datasets this is an extremely expensive
process. One key observation is that, such exhaustive ER process incurs a huge
up-front cost, which is wasteful in practice because most users are interested
in only a small subset of entities. In this paper, we advocate pay-as-you-go
entity resolution by developing a number of query-driven collective ER
techniques. We introduce two classes of SQL queries that involve ER operators
--- selection-driven ER and join-driven ER. We implement novel variations of
the MCMC Metropolis Hastings algorithm to generate biased samples and
selectivity-based scheduling algorithms to support the two classes of ER
queries. Finally, we show that query-driven ER algorithms can converge and
return results within minutes over a database populated with the extraction
from a newswire dataset containing 71 million mentions
- …