Search CORE

34,835 research outputs found

Wrapper Maintenance: A Machine Learning Approach

Author: Knoblock C. A.
Lerman K.
Minton S. N.
Publication venue: 'AI Access Foundation'
Publication date: 23/06/2011
Field of study

The proliferation of online information sources has led to an increased use of wrappers for extracting data from Web sources. While most of the previous research has focused on quick and efficient generation of wrappers, the development of tools for wrapper maintenance has received less attention. This is an important research problem because Web sources often change in ways that prevent the wrappers from extracting data correctly. We present an efficient algorithm that learns structural information about data from positive examples alone. We describe how this information can be used for two wrapper maintenance applications: wrapper verification and reinduction. The wrapper verification system detects when a wrapper is not extracting correct data, usually because the Web source has changed its format. The reinduction algorithm automatically recovers from changes in the Web source by identifying data on Web pages so that a new wrapper may be generated for this source. To validate our approach, we monitored 27 wrappers over a period of a year. The verification algorithm correctly discovered 35 of the 37 wrapper changes, and made 16 mistakes, resulting in precision of 0.73 and recall of 0.95. We validated the reinduction algorithm on ten Web sources. We were able to successfully reinduce the wrappers, obtaining precision and recall values of 0.90 and 0.80 on the data extraction task

arXiv.org e-Print Archive

Crossref

Four Lessons in Versatility or How Query Languages Adapt to the Web

Author: A. Bonifati
A. Gelder van
A. Polleres
A. Polleres
A.C. Klug
B. Adida
B. Cooper
B. Jenner
D. Olteanu
D. Olteanu
D. Recordon
D.D. Chamberlin
D.R. Fulkerson
E. Augurusa
F. Bry
F. Bry
F. Bry
F. Wei
F. Weigel
G. Gottlob
G. Karvounarakis
H. Björklund
H. Garcia-Molina
H. Meuss
H. Meuss
H. Przymusinska
H. Tamaki
H. Wang
H.V. Jagadish
J. Bailey
J. Euzenat
J. Pérez
J. Pérez
J. Pérez
J.D. Ullman
J.J. Carroll
J.V.D. Bussche
K. Kochut
K.A. Ross
K.R. Apt
K.S. Booth
L. Cabibbo
M. Habib
M. Kay
M. Marx
M. Marx
N. Bruno
N. Walsh
P. Boncz
P. Buneman
P. Cholak
P. O’Neil
P.G. Kolaitis
P.P. Schneider
R. Agrawal
R. Fagin
R. Goldman
R. Hull
R. Khare
R. Khare
R. Schenkel
S. Abiteboul
S. Abiteboul
S. Abiteboul
S. Al-Khalifa
S. Berger
S. Groppe
S. Trißl
T. Chen
T. Furche
T. Grust
T. Schwentick
T.C. Przymusinski
U. Assmann
W. Akhtar
W. Chen
W.L. Hsu
W.L. Hsu
Z. Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Exposing not only human-centered information, but machine-processable data on the Web is one of the commonalities of recent Web trends. It has enabled a new kind of applications and businesses where the data is used in ways not foreseen by the data providers. Yet this exposition has fractured the Web into islands of data, each in different Web formats: Some providers choose XML, others RDF, again others JSON or OWL, for their data, even in similar domains. This fracturing stifles innovation as application builders have to cope not only with one Web stack (e.g., XML technology) but with several ones, each of considerable complexity. With Xcerpt we have developed a rule- and pattern based query language that aims to give shield application builders from much of this complexity: In a single query language XML and RDF data can be accessed, processed, combined, and re-published. Though the need for combined access to XML and RDF data has been recognized in previous work (including the W3C’s GRDDL), our approach differs in four main aspects: (1) We provide a single language (rather than two separate or embedded languages), thus minimizing the conceptual overhead of dealing with disparate data formats. (2) Both the declarative (logic-based) and the operational semantics are unified in that they apply for querying XML and RDF in the same way. (3) We show that the resulting query language can be implemented reusing traditional database technology, if desirable. Nevertheless, we also give a unified evaluation approach based on interval labelings of graphs that is at least as fast as existing approaches for tree-shaped XML data, yet provides linear time and space querying also for many RDF graphs. We believe that Web query languages are the right tool for declarative data access in Web applications and that Xcerpt is a significant step towards a more convenient, yet highly efficient data access in a “Web of Data”

CiteSeerX

Crossref

Open Access LMU

Inference of termination conditions for numerical loops

Author: De Schreye Danny
Serebrenik Alexander
Publication venue
Publication date: 01/01/2001
Field of study

We present a new approach to termination analysis of numerical computations in logic programs. Traditional approaches fail to analyse them due to non well-foundedness of the integers. We present a technique that allows to overcome these difficulties. Our approach is based on transforming a program in way that allows integrating and extending techniques originally developed for analysis of numerical computations in the framework of query-mapping pairs with the well-known framework of acceptability. Such an integration not only contributes to the understanding of termination behaviour of numerical computations, but also allows to perform a correct analysis of such computations automatically, thus, extending previous work on a constraints-based approach to termination. In the last section of the paper we discuss possible extensions of the technique, including incorporating general term orderings.Comment: Presented at WST200

arXiv.org e-Print Archive

CiteSeerX

Repository TU/e

Pure OAI Repository

: Méthodes d'Inférence Symbolique pour les Bases de Données

Author: Staworko Slawomir
Publication venue: HAL CCSD
Publication date: 14/12/2015
Field of study

This dissertation is a summary of a line of research, that I wasactively involved in, on learning in databases from examples. Thisresearch focused on traditional as well as novel database models andlanguages for querying, transforming, and describing the schema of adatabase. In case of schemas our contributions involve proposing anoriginal languages for the emerging data models of Unordered XML andRDF. We have studied learning from examples of schemas for UnorderedXML, schemas for RDF, twig queries for XML, join queries forrelational databases, and XML transformations defined with a novelmodel of tree-to-word transducers.Investigating learnability of the proposed languages required us toexamine closely a number of their fundamental properties, often ofindependent interest, including normal forms, minimization,containment and equivalence, consistency of a set of examples, andfinite characterizability. Good understanding of these propertiesallowed us to devise learning algorithms that explore a possibly largesearch space with the help of a diligently designed set ofgeneralization operations in search of an appropriate solution.Learning (or inference) is a problem that has two parameters: theprecise class of languages we wish to infer and the type of input thatthe user can provide. We focused on the setting where the user inputconsists of positive examples i.e., elements that belong to the goallanguage, and negative examples i.e., elements that do not belong tothe goal language. In general using both negative and positiveexamples allows to learn richer classes of goal languages than usingpositive examples alone. However, using negative examples is oftendifficult because together with positive examples they may cause thesearch space to take a very complex shape and its exploration may turnout to be computationally challenging.Ce mémoire est une courte présentation d’une direction de recherche, à laquelle j’ai activementparticipé, sur l’apprentissage pour les bases de données à partir d’exemples. Cette recherches’est concentrée sur les modèles et les langages, aussi bien traditionnels qu’émergents, pourl’interrogation, la transformation et la description du schéma d’une base de données. Concernantles schémas, nos contributions consistent en plusieurs langages de schémas pour les nouveaumodèles de bases de données que sont XML non-ordonné et RDF. Nous avons ainsi étudiél’apprentissage à partir d’exemples des schémas pour XML non-ordonné, des schémas pour RDF,des requêtes twig pour XML, les requêtes de jointure pour bases de données relationnelles et lestransformations XML définies par un nouveau modèle de transducteurs arbre-à-mot.Pour explorer si les langages proposés peuvent être appris, nous avons été obligés d’examinerde près un certain nombre de leurs propriétés fondamentales, souvent souvent intéressantespar elles-mêmes, y compris les formes normales, la minimisation, l’inclusion et l’équivalence, lacohérence d’un ensemble d’exemples et la caractérisation finie. Une bonne compréhension de cespropriétés nous a permis de concevoir des algorithmes d’apprentissage qui explorent un espace derecherche potentiellement très vaste grâce à un ensemble d’opérations de généralisation adapté àla recherche d’une solution appropriée.L’apprentissage (ou l’inférence) est un problème à deux paramètres : la classe précise delangage que nous souhaitons inférer et le type d’informations que l’utilisateur peut fournir. Nousnous sommes placés dans le cas où l’utilisateur fournit des exemples positifs, c’est-à-dire deséléments qui appartiennent au langage cible, ainsi que des exemples négatifs, c’est-à-dire qui n’enfont pas partie. En général l’utilisation à la fois d’exemples positifs et négatifs permet d’apprendredes classes de langages plus riches que l’utilisation uniquement d’exemples positifs. Toutefois,l’utilisation des exemples négatifs est souvent difficile parce que les exemples positifs et négatifspeuvent rendre la forme de l’espace de recherche très complexe, et par conséquent, son explorationinfaisable

Thèses en Ligne

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Differentially Private Publication of Sparse Data

Author: Cormode Graham
Procopiuc Magda
Srivastava Divesh
Tran Thanh T. L.
Publication venue
Publication date: 04/03/2011
Field of study

The problem of privately releasing data is to provide a version of a dataset without revealing sensitive information about the individuals who contribute to the data. The model of differential privacy allows such private release while providing strong guarantees on the output. A basic mechanism achieves differential privacy by adding noise to the frequency counts in the contingency tables (or, a subset of the count data cube) derived from the dataset. However, when the dataset is sparse in its underlying space, as is the case for most multi-attribute relations, then the effect of adding noise is to vastly increase the size of the published data: it implicitly creates a huge number of dummy data points to mask the true data, making it almost impossible to work with. We present techniques to overcome this roadblock and allow efficient private release of sparse data, while maintaining the guarantees of differential privacy. Our approach is to release a compact summary of the noisy data. Generating the noisy data and then summarizing it would still be very costly, so we show how to shortcut this step, and instead directly generate the summary from the input data, without materializing the vast intermediate noisy data. We instantiate this outline for a variety of sampling and filtering methods, and show how to use the resulting summary for approximate, private, query answering. Our experimental study shows that this is an effective, practical solution, with comparable and occasionally improved utility over the costly materialization approach

arXiv.org e-Print Archive

CiteSeerX

Learning Linear Temporal Properties

Author: Gavran Ivan
Neider Daniel
Publication venue
Publication date: 01/01/2018
Field of study

We present two novel algorithms for learning formulas in Linear Temporal Logic (LTL) from examples. The first learning algorithm reduces the learning task to a series of satisfiability problems in propositional Boolean logic and produces a smallest LTL formula (in terms of the number of subformulas) that is consistent with the given data. Our second learning algorithm, on the other hand, combines the SAT-based learning algorithm with classical algorithms for learning decision trees. The result is a learning algorithm that scales to real-world scenarios with hundreds of examples, but can no longer guarantee to produce minimal consistent LTL formulas. We compare both learning algorithms and demonstrate their performance on a wide range of synthetic benchmarks. Additionally, we illustrate their usefulness on the task of understanding executions of a leader election protocol

arXiv.org e-Print Archive

Crossref

MPG.PuRe

Inductive queries for a drug designing robot scientist

Author: A. Lingas
C. Hansch
C.A. Lipinski
D.R. Jones
D.R. Jones
H. Blockeel
J. Matousek
L. Raedt De
R.D. King
R.D. King
T. Gärtner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

It is increasingly clear that machine learning algorithms need to be integrated in an iterative scientific discovery loop, in which data is queried repeatedly by means of inductive queries and where the computer provides guidance to the experiments that are being performed. In this chapter, we summarise several key challenges in achieving this integration of machine learning and data mining algorithms in methods for the discovery of Quantitative Structure Activity Relationships (QSARs). We introduce the concept of a robot scientist, in which all steps of the discovery process are automated; we discuss the representation of molecular data such that knowledge discovery tools can analyse it, and we discuss the adaptation of machine learning and data mining algorithms to guide QSAR experiments

Lirias

Crossref

Bournemouth University Research Online

The University of Manchester - Institutional Repository

DIAL UCLouvain