Search CORE

280 research outputs found

An integrated architecture for shallow and deep processing

Author: Becker Markus
Crysmann Berthold
Frank Anette
Kiefer Bernd
Krieger Hans-Ulrich
Müller Stefan
Neumann Günter
Piskorski Jakub
Schäfer Ulrich
Siegel Melanie
Uszkoreit Hans
Xu Feiyu
Publication venue
Publication date: 21/12/2011
Field of study

We present an architecture for the integration of shallow and deep NLP components which is aimed at flexible combination of different language technologies for a range of practical current and future applications. In particular, we describe the integration of a high-level HPSG parsing system with different high-performance shallow components, ranging from named entity recognition to chunk parsing and shallow clause recognition. The NLP components enrich a representation of natural language text with layers of new XML meta-information using a single shared data structure, called the text chart. We describe details of the integration methods, and show how information extraction and language checking applications for realworld German text benefit from a deep grammatical analysis

Hochschulschriftenserver - Universität Frankfurt am Main

Rumble: Data Independence for Large Messy Data Sets

Author: Alonso Gustavo
Cikis Can Berker
Fourny Ghislain
Irimescu Stefan
Müller Ingo
Publication venue
Publication date: 06/05/2020
Field of study

This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

arXiv.org e-Print Archive

Repository for Publications and Research Data

Static and dynamic semantics of NoSQL languages

Author: Giuseppe Castagna
Jérôme Siméon
K.
Kim Nguyen
Martens W.
Nguyen K.
Tannen V.
Véronique Benzaken
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2013
Field of study

We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as "NoSQL". This calculus lets users define their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and flexible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type definition or type annotation in the code

arXiv.org e-Print Archive

HAL-CentraleSupelec

Crossref

Hal-Diderot

A TYPE ANALYSIS OF REWRITE STRATEGIES

Author: Mametjanov Azamat
Publication venue: DigitalCommons@UNO
Publication date: 01/12/2010
Field of study

Rewrite strategies provide an algorithmic rewriting of terms using strategic compositions of rewrite rules. Due to the programmability of rewrites, errors are often made due to incorrect compositions of rewrites or incorrect application of rewrites to a term within a strategic rewriting program. In practical applications of strategic rewriting, testing and debugging becomes substantially time-intensive for large programs applied to large inputs derived from large term grammars. In essence, determining which rewrite in what position in a term did or did not re comes down to logging, tracing and/or di -like comparison of inputs to outputs. In this thesis, we explore type-enabled analysis of strategic rewriting programs to detect errors statically. In particular, we introduce high-precision types to closely approximate the dynamic behavior of rewriting. We also use union types to track sets of types due to presence of strategic compositions. In this framework of high-precision strategic typing, we develop and implement an expressive type system for a representative strategic rewriting language TL. The results of this research are sufficiently broad to be adapted to other strategic rewriting languages. In particular, the type-inferencing algorithm does not require explicit type annotations for minimal impact on an existing language. Based on our experience with the implementation, the type system significantly reduces the time and effort to program correct rewrite strategies while performing the analysis on the order of thousands of source lines of code per second

The University of Nebraska, Omaha

an extensible tuplespace as XML-middleware

Author: Liebsch Franziska
Nguyen Duc M.
Tolksdorf Robert
Publication venue
Publication date: 01/01/2003
Field of study

XMLSpaces.NET implements the Linda concept as a middleware for XML documents. It introduces an extended matching flexibility on nested tuples and richer data types for fields, including objects and XML documents. It is completely XML-based since data, tuples and tuplespaces are seen as trees represented as XML documents. XMLSpaces.NET is extensible in that it supports a hierarchy of matching relations on tuples and an open set of matching amongst data, documents and objects. It is currently being implemented on the .NET platform

Institutional Repository of the Freie Universität Berlin

An Iteration on the Horizon Simulation Framework to Include .NET and Python Scripting

Author: Yost Morgan
Publication venue: DigitalCommons@CalPoly
Publication date: 01/06/2016
Field of study

Modeling and Simulation is a crucial element of the aerospace engineering design pro- cess because it allows designers to thoroughly test their solution before investing in the resources to create it. The Horizon Simulation Framework (HSF) v3.0 is an aerospace modeling and simulation tool that allows the user to verify system level requirements in the early phases of the design process. A low fidelity model of the system that is created by the user is exhaustively tested within the built-in Day-in-the-Life simulator to provide useful information in the form of failed requirements, system bottle necks and leverage points, and potential schedules of operations. The model can be stood up quickly with Extended Markup Language (XML) input files or can be customly created with Python Scripts that interact with the framework at runtime. The goal of the work presented in this thesis is to progress HSF from v2.3 to v3.0 in order to take advantage of current software development technologies. This includes converting the codebase from C++ and Lua scripting to C♯ and Python Scripting. The particulars of the considerations, benefits, and implementation of the new framework are discussed in detail. The simulation data and performance run time of the new framework were compared to that of the old framework. The new framework was found to produce similar data outputs with a faster run time

DigitalCommons@CalPoly