4,972 research outputs found
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
Equivariant differential characters and symplectic reduction
We describe equivariant differential characters (classifying equivariant
circle bundles with connections), their prequantization, and reduction
Robust computation of linear models by convex relaxation
Consider a dataset of vector-valued observations that consists of noisy
inliers, which are explained well by a low-dimensional subspace, along with
some number of outliers. This work describes a convex optimization problem,
called REAPER, that can reliably fit a low-dimensional model to this type of
data. This approach parameterizes linear subspaces using orthogonal projectors,
and it uses a relaxation of the set of orthogonal projectors to reach the
convex formulation. The paper provides an efficient algorithm for solving the
REAPER problem, and it documents numerical experiments which confirm that
REAPER can dependably find linear structure in synthetic and natural data. In
addition, when the inliers lie near a low-dimensional subspace, there is a
rigorous theory that describes when REAPER can approximate this subspace.Comment: Formerly titled "Robust computation of linear models, or How to find
a needle in a haystack
VIP: Incorporating Human Cognitive Biases in a Probabilistic Model of Retweeting
Information spread in social media depends on a number of factors, including
how the site displays information, how users navigate it to find items of
interest, users' tastes, and the `virality' of information, i.e., its
propensity to be adopted, or retweeted, upon exposure. Probabilistic models can
learn users' tastes from the history of their item adoptions and recommend new
items to users. However, current models ignore cognitive biases that are known
to affect behavior. Specifically, people pay more attention to items at the top
of a list than those in lower positions. As a consequence, items near the top
of a user's social media stream have higher visibility, and are more likely to
be seen and adopted, than those appearing below. Another bias is due to the
item's fitness: some items have a high propensity to spread upon exposure
regardless of the interests of adopting users. We propose a probabilistic model
that incorporates human cognitive biases and personal relevance in the
generative model of information spread. We use the model to predict how
messages containing URLs spread on Twitter. Our work shows that models of user
behavior that account for cognitive factors can better describe and predict
user behavior in social media.Comment: SBP 201
Variational Data Assimilation via Sparse Regularization
This paper studies the role of sparse regularization in a properly chosen
basis for variational data assimilation (VDA) problems. Specifically, it
focuses on data assimilation of noisy and down-sampled observations while the
state variable of interest exhibits sparsity in the real or transformed domain.
We show that in the presence of sparsity, the -norm regularization
produces more accurate and stable solutions than the classic data assimilation
methods. To motivate further developments of the proposed methodology,
assimilation experiments are conducted in the wavelet and spectral domain using
the linear advection-diffusion equation
Narrative Health Communication and Behavior Change: The Influence of Exemplars in the News on Intention to Quit Smoking.
This study investigated psychological mechanisms underlying the effect of narrative health communication on behavioral intention. Specifically, the study examined how exemplification in news about successful smoking cessation affects recipients\u27 narrative engagement, thereby changing their intention to quit smoking. Nationally representative samples of U.S. adult smokers participated in 2 experiments. The results from the 2 experiments consistently showed that smokers reading a news article with an exemplar experienced greater narrative engagement compared to those reading an article without an exemplar. Those who reported more engagement were in turn more likely to report greater smoking cessation intentions
Why Do Cascade Sizes Follow a Power-Law?
We introduce random directed acyclic graph and use it to model the
information diffusion network. Subsequently, we analyze the cascade generation
model (CGM) introduced by Leskovec et al. [19]. Until now only empirical
studies of this model were done. In this paper, we present the first
theoretical proof that the sizes of cascades generated by the CGM follow the
power-law distribution, which is consistent with multiple empirical analysis of
the large social networks. We compared the assumptions of our model with the
Twitter social network and tested the goodness of approximation.Comment: 8 pages, 7 figures, accepted to WWW 201
- …