Search CORE

382 research outputs found

Grammars for Document Spanners

Author: Peterfreund Liat
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 24th International Conference on Database Theory (ICDT 2021)
Publication date: 01/01/2021
Field of study

We propose a new grammar-based language for defining information-extractors from documents (text) that is built upon the well-studied framework of document spanners for extracting structured data from text. While previously studied formalisms for document spanners are mainly based on regular expressions, we use an extension of context-free grammars, called {extraction grammars}, to define the new class of context-free spanners. Extraction grammars are simply context-free grammars extended with variables that capture interval positions of the document, namely spans. While regular expressions are efficient for tokenizing and tagging, context-free grammars are also efficient for capturing structural properties. Indeed, we show that context-free spanners are strictly more expressive than their regular counterparts. We reason about the expressive power of our new class and present a pushdown-automata model that captures it. We show that extraction grammars can be evaluated with polynomial data complexity. Nevertheless, as the degree of the polynomial depends on the query, we present an enumeration algorithm for unambiguous extraction grammars that, after quintic preprocessing, outputs the results sequentially, without repetitions, with a constant delay between every two consecutive ones

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Dagstuhl Research Online Publication Server

Joining Extractions of Regular Expressions

Author: Freydenberger Dominik D.
Kimelfeld Benny
Peterfreund Liat
Publication venue
Publication date: 30/03/2017
Field of study

Regular expressions with capture variables, also known as "regex formulas," extract relations of spans (interval positions) from text. These relations can be further manipulated via Relational Algebra as studied in the context of document spanners, Fagin et al.'s formal framework for information extraction. We investigate the complexity of querying text by Conjunctive Queries (CQs) and Unions of CQs (UCQs) on top of regex formulas. We show that the lower bounds (NP-completeness and W[1]-hardness) from the relational world also hold in our setting; in particular, hardness hits already single-character text! Yet, the upper bounds from the relational world do not carry over. Unlike the relational world, acyclic CQs, and even gamma-acyclic CQs, are hard to compute. The source of hardness is that it may be intractable to instantiate the relation defined by a regex formula, simply because it has an exponential number of tuples. Yet, we are able to establish general upper bounds. In particular, UCQs can be evaluated with polynomial delay, provided that every CQ has a bounded number of atoms (while unions and projection can be arbitrary). Furthermore, UCQ evaluation is solvable with FPT (Fixed-Parameter Tractable) delay when the parameter is the size of the UCQ

arXiv.org e-Print Archive

Crossref

Loughborough University Institutional Repository

Japan’s Prostitution Prevention Law: The Case of the Missing Geisha

Author: Peterfreund Tenica
Publication venue: eRepository @ Seton Hall
Publication date: 01/01/2010
Field of study

Recommended from our members

Evaluation Report

Author: Peterfreund Alan
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/01/2014
Field of study

This evaluation report synthesizes the results of evaluation activities conducted by SageFox Consulting Group of the STEM DIGITAL project led by the UMass STEM Ed Institute for its no-cost extension year, covering the period September 2013 to August 2014. The goals of the program are to facilitate the participants’ abilities to stimulate student interest in STEM careers while engaging them in ways to think critically about their environment. Participating teachers incorporated digital cameras and Analyzing Digital Images (ADI) software into lab activities focusing on environmental science. STEM DIGITAL materials focused on three strands related to (1) ozone and air quality, (2) arsenic and soil contamination, and (3) water quality

ScholarWorks@UMass Amherst

Japan’s Prostitution Prevention Law: The Case of the Missing Geisha

Author: Peterfreund Tenica
Publication venue: eRepository @ Seton Hall
Publication date: 01/01/2010
Field of study

bepress Legal Repository

Seton Hall University Libraries

Japan’s Prostitution Prevention Law: The Case of the Missing Geisha

Author: Peterfreund Tenica
Publication venue: eRepository @ Seton Hall
Publication date: 01/01/2010
Field of study

Seton Hall University Libraries

SQL Nulls and Two-Valued Logic

Author: Libkin Leonid
Peterfreund Liat
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 18/06/2023
Field of study

Edinburgh Research Explorer

Detecting Ambiguity in Prioritized Database Repairing

Author: Kimelfeld Benny
Livshits Ester
Peterfreund Liat
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 20th International Conference on Database Theory (ICDT 2017)
Publication date: 01/01/2017
Field of study

In its traditional definition, a repair of an inconsistent database is a consistent database that differs from the inconsistent one in a "minimal way." Often, repairs are not equally legitimate, as it is desired to prefer one over another; for example, one fact is regarded more reliable than another, or a more recent fact should be preferred to an earlier one. Motivated by these considerations, researchers have introduced and investigated the framework of preferred repairs, in the context of denial constraints and subset repairs. There, a priority relation between facts is lifted towards a priority relation between consistent databases, and repairs are restricted to the ones that are optimal in the lifted sense. Three notions of lifting (and optimal repairs) have been proposed: Pareto, global, and completion. In this paper we investigate the complexity of deciding whether the priority relation suffices to clean the database unambiguously, or in other words, whether there is exactly one optimal repair. We show that the different lifting semantics entail highly different complexities. Under Pareto optimality, the problem is coNP-complete, in data complexity, for every set of functional dependencies (FDs), except for the tractable case of (equivalence to) one FD per relation. Under global optimality, one FD per relation is still tractable, but we establish Pi-2-p-completeness for a relation with two FDs. In contrast, under completion optimality the problem is solvable in polynomial time for every set of FDs. In fact, we present a polynomial-time algorithm for arbitrary conflict hypergraphs. We further show that under a general assumption of transitivity, this algorithm solves the problem even for global optimality. The algorithm is extremely simple, but its proof of correctness is quite intricate

Dagstuhl Research Online Publication Server