Search CORE

398 research outputs found

Weight Annotation in Information Extraction

Author: Benny Kimelfeld
Johannes Doleschal
Liat Peterfreund
Wim Martens
Publication venue: Logical Methods in Computer Science e.V.
Publication date: 01/01/2022
Field of study

The framework of document spanners abstracts the task of information extraction from text as a function that maps every document (a string) into a relation over the document's spans (intervals identified by their start and end indices). For instance, the regular spanners are the closure under the Relational Algebra (RA) of the regular expressions with capture variables, and the expressive power of the regular spanners is precisely captured by the class of VSet-automata -- a restricted class of transducers that mark the endpoints of selected spans. In this work, we embark on the investigation of document spanners that can annotate extractions with auxiliary information such as confidence, support, and confidentiality measures. To this end, we adopt the abstraction of provenance semirings by Green et al., where tuples of a relation are annotated with the elements of a commutative semiring, and where the annotation propagates through the positive RA operators via the semiring operators. Hence, the proposed spanner extension, referred to as an annotator, maps every string into an annotated relation over the spans. As a specific instantiation, we explore weighted VSet-automata that, similarly to weighted automata and transducers, attach semiring elements to transitions. We investigate key aspects of expressiveness, such as the closure under the positive RA, and key aspects of computational complexity, such as the enumeration of annotated answers and their ranked enumeration in the case of ordered semirings. For a number of these problems, fundamental properties of the underlying semiring, such as positivity, are crucial for establishing tractability

Directory of Open Access Journals

Constant-Delay Enumeration for SLP-Compressed Documents

Author: Riveros Cristian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 26th International Conference on Database Theory (ICDT 2023)
Publication date: 01/01/2023
Field of study

Dagstuhl Research Online Publication Server

Optimization and Parallelization of RegEx Based Information Extraction

Author: Doleschal Johannes
Publication venue
Publication date: 01/01/2021
Field of study

EPub Bayreuth

The Complexity of Aggregates over Extractions by Regular Expressions

Author: Bratman Noa
Doleschal Johannes
Kimelfeld Benny
Martens Wim
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

Regular expressions with capture variables, also known as "regex-formulas", extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS)

Dagstuhl Research Online Publication Server

Ranked Enumeration of MSO Logic on Words

Author: Bourhis Pierre
Grez Alejandro
Jachiet Louis
Riveros Cristian
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

In the last years, enumeration algorithms with bounded delay have attracted a lot of attention for several data management tasks. Given a query and the data, the task is to preprocess the data and then enumerate all the answers to the query one by one and without repetitions. This enumeration scheme is typically useful when the solutions are treated on the fly or when we want to stop the enumeration once the pertinent solutions have been found. However, with the current schemes, there is no restriction on the order how the solutions are given and this order usually depends on the techniques used and not on the relevance for the user. In this paper we study the enumeration of monadic second order logic (MSO) over words when the solutions are ranked. We present a framework based on MSO cost functions that allows to express MSO formulae on words with a cost associated with each solution. We then demonstrate the generality of our framework which subsumes, for instance, document spanners and adds ranking to them. The main technical result of the paper is an algorithm for enumerating all the solutions of formulae in increasing order of cost efficiently, namely, with a linear preprocessing phase and logarithmic delay between solutions. The novelty of this algorithm is based on using functional data structures, in particular, by extending functional Brodal queues to suit with the ranked enumeration of MSO on words

INRIA a CCSD electronic archive server

HAL Descartes

Dagstuhl Research Online Publication Server

Hal-Diderot

Update-Aware Information Extraction

Author: Kassaie Besat
Publication venue: 'University of Waterloo'
Publication date: 14/11/2023
Field of study

Information extraction programs (extractors) can be applied to documents to isolate structured versions of some content by creating tabular records corresponding to facts found in the documents. When extracted relations or source documents are updated, we wish to ensure that those changes are propagated correctly. That is, we recommend that extracted relations be treated as materialized views over the document database. Because extraction is expensive, maintaining extracted relations in the presence of frequent document updates comes at a high execution cost. We propose a practical framework to effectively update extracted views to represent the most recent version of documents. Our approach entails conducting static analyses of extraction and update programs within a framework compatible with SystemT, a renowned extraction framework based on regular expressions. We describe a multi-level verification process aimed at efficiently identifying document updates for which we can autonomously compute the updated extracted views. Through comprehensive experimentation, we demonstrate the effectiveness of our approach within real-world extraction scenarios. For the reverse problem, we need to translate updates on extracted views into corresponding document updates. We rely on a translation mechanism that is based on value substitution in the source documents. We classify extractors amenable to value substitution as stable extractors. We again leverage static analyses of extraction programs to study stability for extractors expressed in a significant subset of JAPE, another rule-based extraction language. Using a document spanner representation of the JAPE program, we identify four sufficient properties for being able to translate updates back to the documents and use them to verify whether an input JAPE program is stable

University of Waterloo's Institutional Repository

Setting up Extended Enterprises: A Data Aspects Framework

Author: Goethal Frank
Lemahieu Wilfried
Snoeck Monique
Vandenbulcke Jacques
Publication venue: AIS Electronic Library (AISeL)
Publication date: 05/12/2005
Field of study

Nowadays, companies want to share information. When doing so, many issues have to be taken care of, and many options are available for most of these issues. Realizing B2Bi is a very complex task. It is the aim of this paper to make it possible to oversee the complexity of information sharing in a B2B context by structuring the issues that have to be taken care of in a new framework: the DA (Data Aspects) – framework; and by relating this framework to the existing FADEE framework

AIS Electronic Library (AISeL)

The COVID-19 pandemic : Towards a societally engaged IB perspective

Author: Becker-Ritterspach Florian
Boussebaa Mehdi
Curran Louise
de Jonge Alice
Dörrenbächer Christoph
Khan Zaheer
Sinkovics Rudolf R.
Publication venue: 'Emerald'
Publication date: 06/05/2021
Field of study

Peer reviewedPostprin

Aberdeen University Research

Durham Research Online

Enlighten

The Complexity of Aggregates over Extractions by Regular Expressions

Author: Benny Kimelfeld
Johannes Doleschal
Wim Martens
Publication venue: Logical Methods in Computer Science e.V.
Publication date: 01/08/2023
Field of study

Regular expressions with capture variables, also known as regex-formulas, extract relations of spans (intervals identified by their start and end indices) from text. In turn, the class of regular document spanners is the closure of the regex formulas under the Relational Algebra. We investigate the computational complexity of querying text by aggregate functions, such as sum, average, and quantile, on top of regular document spanners. To this end, we formally define aggregate functions over regular document spanners and analyze the computational complexity of exact and approximate computation. More precisely, we show that in a restricted case, all studied aggregate functions can be computed in polynomial time. In general, however, even though exact computation is intractable, some aggregates can still be approximated with fully polynomial-time randomized approximation schemes (FPRAS)

Directory of Open Access Journals