3,081 research outputs found

    Wrapper Maintenance: A Machine Learning Approach

    Full text link
    The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

    The symplectic Deligne-Mumford stack associated to a stacky polytope

    Full text link
    We discuss a symplectic counterpart of the theory of stacky fans. First, we define a stacky polytope and construct the symplectic Deligne-Mumford stack associated to the stacky polytope. Then we establish a relation between stacky polytopes and stacky fans: the stack associated to a stacky polytope is equivalent to the stack associated to a stacky fan if the stacky fan corresponds to the stacky polytope.Comment: 20 pages; v2: To appear in Results in Mathematic

    Equivariant differential characters and symplectic reduction

    Full text link
    We describe equivariant differential characters (classifying equivariant circle bundles with connections), their prequantization, and reduction

    VIP: Incorporating Human Cognitive Biases in a Probabilistic Model of Retweeting

    Full text link
    Information spread in social media depends on a number of factors, including how the site displays information, how users navigate it to find items of interest, users' tastes, and the `virality' of information, i.e., its propensity to be adopted, or retweeted, upon exposure. Probabilistic models can learn users' tastes from the history of their item adoptions and recommend new items to users. However, current models ignore cognitive biases that are known to affect behavior. Specifically, people pay more attention to items at the top of a list than those in lower positions. As a consequence, items near the top of a user's social media stream have higher visibility, and are more likely to be seen and adopted, than those appearing below. Another bias is due to the item's fitness: some items have a high propensity to spread upon exposure regardless of the interests of adopting users. We propose a probabilistic model that incorporates human cognitive biases and personal relevance in the generative model of information spread. We use the model to predict how messages containing URLs spread on Twitter. Our work shows that models of user behavior that account for cognitive factors can better describe and predict user behavior in social media.Comment: SBP 201

    Analysis of the Precipitation Detection Algorithm for the GEONOR T-200B Precipitation Gauge to Improve Accuracy

    Get PDF
    In an effort to improve the precipitation detection algorithm for the Geonor All Weather Precipitation Gauge, an automated truth algorithm has been created to detect errors in the original algorithm. The original algorithm detects precipitation in real time and uses the rate of precipitation to indicate an event. The automated truth does not detect in real time, and focuses on precipitation accumulation to indicate an event. Since the automated truth is delayed, it is able to consider the data collected before and after the point it is analyzing. The automated truth is already more accurate than the original algorithm but the accuracy can be improved further. The goal of this study was to develop ways to improve the automated truth algorithm’s accuracy in order to compare it to the original algorithm to detect errors. Ultimately, this will be used to detect errors in the original algorithm for years of data. In order to improve the truth algorithm, we created a human truth output using data collected over a four month time period by four Geonor gauges located at NCAR’s Marshall Test Field in Boulder, CO. The human truth was created by two individuals who observed the Geonor accumulation data and indicated when an event occurred. Because humans are able to process and analyze images more precisely than computers, this human truth is considered the most accurate output. It was completed using a web based plotting tool to create graphs that can be further analyzed. The human truth output will be compared to the automated truth output in order to detect errors in the algorithm so that scientists will be able to correct these errors and improve the automated truth algorithm

    Mining social semantics on the social web

    Get PDF

    Symplectic Partially Hyperbolic Automorphisms of 6-Torus

    Full text link
    We study topological properties of automorphisms of a 6-dimensional torus generated by integer matrices symplectic with respect to either the standard symplectic structure in six-dimensional linear space or a nonstandard symplectic structure given by an integer skew-symmetric non-degenerate matrix. Such a symplectic matrix generates a partially hyperbolic automorphism of the torus, if it has eigenvalues both outside and on the unit circle. We study the case (2,2,2), numbers are dimensions of stable, center and unstable subspaces of the matrix. We study transitive and decomposable cases possible here and present a classification in both cases.Comment: 15 pages, 0 figures. arXiv admin note: text overlap with arXiv:2001.1072

    Why Do Cascade Sizes Follow a Power-Law?

    Full text link
    We introduce random directed acyclic graph and use it to model the information diffusion network. Subsequently, we analyze the cascade generation model (CGM) introduced by Leskovec et al. [19]. Until now only empirical studies of this model were done. In this paper, we present the first theoretical proof that the sizes of cascades generated by the CGM follow the power-law distribution, which is consistent with multiple empirical analysis of the large social networks. We compared the assumptions of our model with the Twitter social network and tested the goodness of approximation.Comment: 8 pages, 7 figures, accepted to WWW 201
    corecore