3,081 research outputs found
Wrapper Maintenance: A Machine Learning Approach
The proliferation of online information sources has led to an increased use
of wrappers for extracting data from Web sources. While most of the previous
research has focused on quick and efficient generation of wrappers, the
development of tools for wrapper maintenance has received less attention. This
is an important research problem because Web sources often change in ways that
prevent the wrappers from extracting data correctly. We present an efficient
algorithm that learns structural information about data from positive examples
alone. We describe how this information can be used for two wrapper maintenance
applications: wrapper verification and reinduction. The wrapper verification
system detects when a wrapper is not extracting correct data, usually because
the Web source has changed its format. The reinduction algorithm automatically
recovers from changes in the Web source by identifying data on Web pages so
that a new wrapper may be generated for this source. To validate our approach,
we monitored 27 wrappers over a period of a year. The verification algorithm
correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes,
resulting in precision of 0.73 and recall of 0.95. We validated the reinduction
algorithm on ten Web sources. We were able to successfully reinduce the
wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data
extraction task
The symplectic Deligne-Mumford stack associated to a stacky polytope
We discuss a symplectic counterpart of the theory of stacky fans. First, we
define a stacky polytope and construct the symplectic Deligne-Mumford stack
associated to the stacky polytope. Then we establish a relation between stacky
polytopes and stacky fans: the stack associated to a stacky polytope is
equivalent to the stack associated to a stacky fan if the stacky fan
corresponds to the stacky polytope.Comment: 20 pages; v2: To appear in Results in Mathematic
Equivariant differential characters and symplectic reduction
We describe equivariant differential characters (classifying equivariant
circle bundles with connections), their prequantization, and reduction
VIP: Incorporating Human Cognitive Biases in a Probabilistic Model of Retweeting
Information spread in social media depends on a number of factors, including
how the site displays information, how users navigate it to find items of
interest, users' tastes, and the `virality' of information, i.e., its
propensity to be adopted, or retweeted, upon exposure. Probabilistic models can
learn users' tastes from the history of their item adoptions and recommend new
items to users. However, current models ignore cognitive biases that are known
to affect behavior. Specifically, people pay more attention to items at the top
of a list than those in lower positions. As a consequence, items near the top
of a user's social media stream have higher visibility, and are more likely to
be seen and adopted, than those appearing below. Another bias is due to the
item's fitness: some items have a high propensity to spread upon exposure
regardless of the interests of adopting users. We propose a probabilistic model
that incorporates human cognitive biases and personal relevance in the
generative model of information spread. We use the model to predict how
messages containing URLs spread on Twitter. Our work shows that models of user
behavior that account for cognitive factors can better describe and predict
user behavior in social media.Comment: SBP 201
Analysis of the Precipitation Detection Algorithm for the GEONOR T-200B Precipitation Gauge to Improve Accuracy
In an effort to improve the precipitation detection algorithm for the Geonor All Weather Precipitation Gauge, an automated truth algorithm has been created to detect errors in the original algorithm. The original algorithm detects precipitation in real time and uses the rate of precipitation to indicate an event. The automated truth does not detect in real time, and focuses on precipitation accumulation to indicate an event. Since the automated truth is delayed, it is able to consider the data collected before and after the point it is analyzing. The automated truth is already more accurate than the original algorithm but the accuracy can be improved further. The goal of this study was to develop ways to improve the automated truth algorithm’s accuracy in order to compare it to the original algorithm to detect errors. Ultimately, this will be used to detect errors in the original algorithm for years of data. In order to improve the truth algorithm, we created a human truth output using data collected over a four month time period by four Geonor gauges located at NCAR’s Marshall Test Field in Boulder, CO. The human truth was created by two individuals who observed the Geonor accumulation data and indicated when an event occurred. Because humans are able to process and analyze images more precisely than computers, this human truth is considered the most accurate output. It was completed using a web based plotting tool to create graphs that can be further analyzed. The human truth output will be compared to the automated truth output in order to detect errors in the algorithm so that scientists will be able to correct these errors and improve the automated truth algorithm
Symplectic Partially Hyperbolic Automorphisms of 6-Torus
We study topological properties of automorphisms of a 6-dimensional torus
generated by integer matrices symplectic with respect to either the standard
symplectic structure in six-dimensional linear space or a nonstandard
symplectic structure given by an integer skew-symmetric non-degenerate matrix.
Such a symplectic matrix generates a partially hyperbolic automorphism of the
torus, if it has eigenvalues both outside and on the unit circle. We study the
case (2,2,2), numbers are dimensions of stable, center and unstable subspaces
of the matrix. We study transitive and decomposable cases possible here and
present a classification in both cases.Comment: 15 pages, 0 figures. arXiv admin note: text overlap with
arXiv:2001.1072
Why Do Cascade Sizes Follow a Power-Law?
We introduce random directed acyclic graph and use it to model the
information diffusion network. Subsequently, we analyze the cascade generation
model (CGM) introduced by Leskovec et al. [19]. Until now only empirical
studies of this model were done. In this paper, we present the first
theoretical proof that the sizes of cascades generated by the CGM follow the
power-law distribution, which is consistent with multiple empirical analysis of
the large social networks. We compared the assumptions of our model with the
Twitter social network and tested the goodness of approximation.Comment: 8 pages, 7 figures, accepted to WWW 201
- …