3,966 research outputs found
Stream Processing using Grammars and Regular Expressions
In this dissertation we study regular expression based parsing and the use of
grammatical specifications for the synthesis of fast, streaming
string-processing programs.
In the first part we develop two linear-time algorithms for regular
expression based parsing with Perl-style greedy disambiguation. The first
algorithm operates in two passes in a semi-streaming fashion, using a constant
amount of working memory and an auxiliary tape storage which is written in the
first pass and consumed by the second. The second algorithm is a single-pass
and optimally streaming algorithm which outputs as much of the parse tree as is
semantically possible based on the input prefix read so far, and resorts to
buffering as many symbols as is required to resolve the next choice. Optimality
is obtained by performing a PSPACE-complete pre-analysis on the regular
expression.
In the second part we present Kleenex, a language for expressing
high-performance streaming string processing programs as regular grammars with
embedded semantic actions, and its compilation to streaming string transducers
with worst-case linear-time performance. Its underlying theory is based on
transducer decomposition into oracle and action machines, and a finite-state
specialization of the streaming parsing algorithm presented in the first part.
In the second part we also develop a new linear-time streaming parsing
algorithm for parsing expression grammars (PEG) which generalizes the regular
grammars of Kleenex. The algorithm is based on a bottom-up tabulation algorithm
reformulated using least fixed points and evaluated using an instance of the
chaotic iteration scheme by Cousot and Cousot
An Intelligent Text Extraction and Navigation System
We present sppc, a high-performance system for intelligent text extraction and navigation from German free text documents. The main purpose of sppc is to extract as much linguistic structure as possible for performing domain-specific processing. sppc consists of a set of domain-independent shallow core components which are realized by means of cascaded weighted finite state machines and generic dynamic tries. All extracted information is represented uniformly in one data structure (called the text chart) in a highly compact and linked form in order to support indexing and navigation through the set of solutions. Germa
Proceedings
Proceedings of the NODALIDA 2011 Workshop
Constraint Grammar Applications.
Editors: Eckhard Bick, Kristin Hagen, Kaili Müürisep, Trond Trosterud.
NEALT Proceedings Series, Vol. 14 (2011), vi+69 pp.
© 2011 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/19231
Tehokas sisäänpäindeterministisiin automaatteihin perustuva Constraint Grammar -jäsennin
Proceeding volume: 14 (2011)Pappret conceptualizes parsning med Constraint Grammar på ett nytt sätt som en process med två viktiga representationer. En representation innehåller lokala tvetydighet och den andra sammanfattar egenskaperna hos den lokala tvetydighet klasser. Båda representationer manipuleras med ren finite-state metoder, men deras samtrafik är en ad hoc -tillämpning av rationella potensserier. Den nya tolkningen av parsning systemet har flera praktiska fördelar, bland annat det inåt deterministiska sättet att beräkna, representera och räkna om alla potentiella tillämpningar av reglerna i meningen.Paperi uudelleenkonseptualisoi Constraint Grammarin sellaisena viitekehyksenä, jossa säännöt tarkentavat paikallisen ambiguiteetin tiivistä esitysmuotoa samalla kun sääntöjen ehdot sovitetaan piirrevektoreita vasten, jotka esittävät tiivistetyjen esitysmuotojen summia. Molemmat näkökulmat monitulkintaisuuteen käsitellään käyttäen puhtaita (pure) äärellistilaisia operaatioita. Tiivis esitysmuoto kuvataan piirrevektoreihin rationaalisten potenssisarjojen avulla. Tämä yhteys ei ole yhtään vähemmän puhdas kuin aikaisemmin vallalla ollut tulkinta, jonka edellyttää että leksikaalisen transduktorin tuottama sanan luentajoukko maagisesti linearisoidaan merkatuksi luentojen peräkkäinasetteluksi, joka syötetään puhtaille (äärellistilaisille) transduktoreille. Esitetyllä lähestymistavalla on useita käytännöllisiä etuja: mm. sisäänpäin deterministinen tapa laskea, esittää ja ylläpitää kaikki mahdolliset kohdat, joissa säännöt voivat soveltua virkkeeseen.The paper reconceptualizes Constraint Grammar as a framework where the rules refine the compact representations of local ambiguity while the rule conditions are matched against a string of feature vectors that summarize the compact representations. Both views to the ambiguity are processed with pure finite-state operations. The compact representations are mapped to feature vectors with the aid of a rational power series. This magical interconnection is not less pure than a prevalent interpretation that requires that the reading set provided by a lexical transducer is magically linearized to a marked concatenation of readings given to pure transducers. The current approach has several practical benefits, including the inward deterministic way to compute, represent and maintain all the applications of the rules in the sentence.Peer reviewe
A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs
International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license
Generalising tree traversals and tree transformations to DAGs:Exploiting sharing without the pain
We present a recursion scheme based on attribute grammars that can be transparently applied to trees and acyclic graphs. Our recursion scheme allows the programmer to implement a tree traversal or a tree transformation and then apply it to compact graph representations of trees instead. The resulting graph traversal or graph transformation avoids recomputation of intermediate results for shared nodes – even if intermediate results are used in different contexts. Consequently, this approach leads to asymptotic speedup proportional to the compression provided by the graph representation. In general, however, this sharing of intermediate results is not sound. Therefore, we complement our implementation of the recursion scheme with a number of correspondence theorems that ensure soundness for various classes of traversals. We illustrate the practical applicability of the implementation as well as the complementing theory with a number of examples
- …