282 research outputs found

    An integrated architecture for shallow and deep processing

    Get PDF
    We present an architecture for the integration of shallow and deep NLP components which is aimed at flexible combination of different language technologies for a range of practical current and future applications. In particular, we describe the integration of a high-level HPSG parsing system with different high-performance shallow components, ranging from named entity recognition to chunk parsing and shallow clause recognition. The NLP components enrich a representation of natural language text with layers of new XML meta-information using a single shared data structure, called the text chart. We describe details of the integration methods, and show how information extraction and language checking applications for realworld German text benefit from a deep grammatical analysis

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Static and dynamic semantics of NoSQL languages

    Get PDF
    We present a calculus for processing semistructured data that spans differences of application area among several novel query languages, broadly categorized as "NoSQL". This calculus lets users define their own operators, capturing a wider range of data processing capabilities, whilst providing a typing precision so far typical only of primitive hard-coded operators. The type inference algorithm is based on semantic type checking, resulting in type information that is both precise, and flexible enough to handle structured and semistructured data. We illustrate the use of this calculus by encoding a large fragment of Jaql, including operations and iterators over JSON, embedded SQL expressions, and co-grouping, and show how the encoding directly yields a typing discipline for Jaql as it is, namely without the addition of any type definition or type annotation in the code

    Implementation of a XQuery engine for large documents in CanstoreX

    Get PDF
    XML is a markup language used for storing documents which contains structured information. Its flexibility helps in storing, processing and querying diverse and complex documents with any structure. While theoretically, XML could be used to handle any documents, the currently available parsers require large amounts of main-memory resulting into severe restriction on the size of XML documents. As a result, some technologies have been developed to break the XML documents in to smaller chunks and allow the parsers to load only a specific portion of the document when needed.;Two major but diagonally opposite approaches for storing an xml document on the disk have emerged. The first breaks an xml document into parent child pairs and stores them into relational storage. The second approach builds a native storage for xml that attempts to directly capture xml hierarchy. Canonical Storage for XML (CanStoreX) is a native storage technology being developed by our group at Iowa State University that has been tested for pagination of xml documents up to 100 Gigabytes in size. CanStoreX requires that every page is a self-contained xml document on its own right. Thus the pages themselves form an xml-like hierarchy.;XML can be used to encode a variety of data. Examples are system configuration, metadata, documents such as books, relational data, and object-oriented data. An array of technologies has developed to process xml documents. Our major interest in xml lies in the view that an xml document can be considered a database which can then be queried. There exists several query engines for xml. Kweelt is an excellent early platform that supports the Quilt query language. Quilt is a preliminary query language which has subsequently been extended to XQuery, a query language that has been standardized by the W3 Consortium. Quilt, the query language that Kweelt supports, is superseded by XQuery. The original Kweelt uses DOM parser; therefore it can only handle small documents. The main focus of this thesis is to deploy CanStoreX to query documents of the size of gigabytes. The resulting platform has been extensively tested


    Get PDF
    Rewrite strategies provide an algorithmic rewriting of terms using strategic compositions of rewrite rules. Due to the programmability of rewrites, errors are often made due to incorrect compositions of rewrites or incorrect application of rewrites to a term within a strategic rewriting program. In practical applications of strategic rewriting, testing and debugging becomes substantially time-intensive for large programs applied to large inputs derived from large term grammars. In essence, determining which rewrite in what position in a term did or did not re comes down to logging, tracing and/or di -like comparison of inputs to outputs. In this thesis, we explore type-enabled analysis of strategic rewriting programs to detect errors statically. In particular, we introduce high-precision types to closely approximate the dynamic behavior of rewriting. We also use union types to track sets of types due to presence of strategic compositions. In this framework of high-precision strategic typing, we develop and implement an expressive type system for a representative strategic rewriting language TL. The results of this research are sufficiently broad to be adapted to other strategic rewriting languages. In particular, the type-inferencing algorithm does not require explicit type annotations for minimal impact on an existing language. Based on our experience with the implementation, the type system significantly reduces the time and effort to program correct rewrite strategies while performing the analysis on the order of thousands of source lines of code per second

    an extensible tuplespace as XML-middleware

    Get PDF
    XMLSpaces.NET implements the Linda concept as a middleware for XML documents. It introduces an extended matching flexibility on nested tuples and richer data types for fields, including objects and XML documents. It is completely XML-based since data, tuples and tuplespaces are seen as trees represented as XML documents. XMLSpaces.NET is extensible in that it supports a hierarchy of matching relations on tuples and an open set of matching amongst data, documents and objects. It is currently being implemented on the .NET platform

    An Iteration on the Horizon Simulation Framework to Include .NET and Python Scripting

    Get PDF
    Modeling and Simulation is a crucial element of the aerospace engineering design pro- cess because it allows designers to thoroughly test their solution before investing in the resources to create it. The Horizon Simulation Framework (HSF) v3.0 is an aerospace modeling and simulation tool that allows the user to verify system level requirements in the early phases of the design process. A low fidelity model of the system that is created by the user is exhaustively tested within the built-in Day-in-the-Life simulator to provide useful information in the form of failed requirements, system bottle necks and leverage points, and potential schedules of operations. The model can be stood up quickly with Extended Markup Language (XML) input files or can be customly created with Python Scripts that interact with the framework at runtime. The goal of the work presented in this thesis is to progress HSF from v2.3 to v3.0 in order to take advantage of current software development technologies. This includes converting the codebase from C++ and Lua scripting to C♯ and Python Scripting. The particulars of the considerations, benefits, and implementation of the new framework are discussed in detail. The simulation data and performance run time of the new framework were compared to that of the old framework. The new framework was found to produce similar data outputs with a faster run time