2,356 research outputs found
Joining Extractions of Regular Expressions
Regular expressions with capture variables, also known as "regex formulas,"
extract relations of spans (interval positions) from text. These relations can
be further manipulated via Relational Algebra as studied in the context of
document spanners, Fagin et al.'s formal framework for information extraction.
We investigate the complexity of querying text by Conjunctive Queries (CQs) and
Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds
(NP-completeness and W[1]-hardness) from the relational world also hold in our
setting; in particular, hardness hits already single-character text! Yet, the
upper bounds from the relational world do not carry over. Unlike the relational
world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source
of hardness is that it may be intractable to instantiate the relation defined
by a regex formula, simply because it has an exponential number of tuples. Yet,
we are able to establish general upper bounds. In particular, UCQs can be
evaluated with polynomial delay, provided that every CQ has a bounded number of
atoms (while unions and projection can be arbitrary). Furthermore, UCQ
evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the
parameter is the size of the UCQ
Joining extractions of regular expressions
Regular expressions with capture variables, also known as “regex formulas,” extract relations of spans (interval positions) from text. These relations can be further manipulated via the relational
Algebra as studied in the context of “document spanners,” Fagin et al.’s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. Such queries have been investigated in prior work on document spanners, but little is known about the (combined) complexity of their evaluation. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text. Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ
vSPARQL: A View Definition Language for the Semantic Web
Translational medicine applications would like to leverage the biological and biomedical ontologies, vocabularies, and data sets available on the semantic web. We present a general solution for RDF information set reuse inspired by database views. Our view definition language, vSPARQL, allows applications to specify the exact content that they are interested in and how that content should be restructured or modified. Applications can access relevant content by querying against these view definitions. We evaluate the expressivity of our approach by defining views for practical use cases and comparing our view definition language to existing query languages
The Complexity of Aggregates over Extractions by Regular Expressions
Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS)
Declarative approach to data extraction of web pages
Thesis submitted to Faculdade de Ciências e Tecnologia of the Universidade Nova de Lisboa, in partial fulfilment of the requirements for the degree of Master in Computer ScienceIn the last few years, we have been witnessing a noticeable WEB evolution with the introduction of significant improvements at technological level, such as the emergence of XHTML, CSS,Javascript, and Web2.0, just to name ones. This, combined with other factors such as physical
expansion of the Web, as well as its low cost, have been the great motivator for the organizations and the general public to join, with a consequent growth in the number of users and thus influencing the volume of the largest global data repository.
In consequence, there was an increasing need for regular data acquisition from the WEB, and because of its frequency, length or complexity, it would only be viable to obtain through automatic extractors. However, two main difficulties are inherent to automatic extractors. First, much of the
Web's information is presented in visual formats mainly directed for human reading. Secondly, the introduction of dynamic webpages, which are brought together in local memory from different sources, causing some pages not to have a source file.
Therefore, this thesis proposes a new and more modern extractor, capable of supporting the Web evolution, as well as being generic, so as to be able to be used in any situation, and capable of being extended and easily adaptable to a more particular use. This project is an extension of an earlier one which had the capability of extractions on semi-structured text files. However it evolved to a modular extraction system capable of extracting data from webpages, semi-structured text files and be expanded to support other data source types. It also contains a more complete and generic
validation system and a new data delivery system capable of performing the earlier deliveries as well as new generic ones.
A graphical editor was also developed to support the extraction system features and to allow a domain expert without computer knowledge to create extractions with only a few simple and intuitive interactions on the rendered webpage
Automatizing Software Cognitive Complexity Reduction
We model the cognitive complexity reduction of a method as an optimization problem where the search space contains all sequences of Extract Method refactoring opportunities. We then propose a novel approach that searches for feasible code extractions allowing developers to apply them, all in an automated way. This will allow software developers to make informed decisions while reducing the complexity of their code. We evaluated our approach over 10 open-source software projects and was able to fix 78% of the 1,050 existing cognitive complexity issues reported by SonarQube. We finally discuss the limitations of the proposed approach and provide interesting findings and guidelines for developers.Universidad de Málaga (grants B1-2020_01 and B4-2019-05)
Project PID2020-116727RB-I00 funded by MCIN/AEI /10.13039/501100011033
Rubén Saborido was recipient of a Juan de la Cierva grant FJC2018-038537-I funded by MCIN/AEI /10.13039/501100011033.
Javier Ferrer was supported by a postdoc grant (DOC/00488) funded by the Andalusian Ministry of Economic Transformation, Industry, Knowledge and Universities
A Purely Regular Approach to Non-Regular Core Spanners
The regular spanners (characterised by vset-automata) are closed under the
algebraic operations of union, join and projection, and have desirable
algorithmic properties. The core spanners (introduced by Fagin, Kimelfeld,
Reiss, and Vansummeren (PODS 2013, JACM 2015) as a formalisation of the core
functionality of the query language AQL used in IBM's SystemT) additionally
need string equality selections and it has been shown by Freydenberger and
Holldack (ICDT 2016, Theory of Computing Systems 2018) that this leads to high
complexity and even undecidability of the typical problems in static analysis
and query evaluation. We propose an alternative approach to core spanners: by
incorporating the string-equality selections directly into the regular language
that represents the underlying regular spanner (instead of treating it as an
algebraic operation on the table extracted by the regular spanner), we obtain a
fragment of core spanners that, while having slightly weaker expressive power
than the full class of core spanners, arguably still covers the intuitive
applications of string equality selections for information extraction and has
much better upper complexity bounds of the typical problems in static analysis
and query evaluation
- …