1,099 research outputs found

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    A modelling method using Movie SimCon/ExSpect

    Get PDF

    FLICK: developing and running application-specific network services

    Get PDF
    Data centre networks are increasingly programmable, with application-specific network services proliferating, from custom load-balancers to middleboxes providing caching and aggregation. Developers must currently implement these services using traditional low-level APIs, which neither support natural operations on application data nor provide efficient performance isolation. We describe FLICK, a framework for the programming and execution of application-specific network services on multi-core CPUs. Developers write network services in the FLICK language, which offers high-level processing constructs and application-relevant data types. FLICK programs are translated automatically to efficient, parallel task graphs, implemented in C++ on top of a user-space TCP stack. Task graphs have bounded resource usage at runtime, which means that the graphs of multiple services can execute concurrently without interference using cooperative scheduling. We evaluate FLICK with several services (an HTTP load-balancer, a Memcached router and a Hadoop data aggregator), showing that it achieves good performance while reducing development effort

    Curriculum Guidelines for Undergraduate Programs in Data Science

    Get PDF
    The Park City Math Institute (PCMI) 2016 Summer Undergraduate Faculty Program met for the purpose of composing guidelines for undergraduate programs in Data Science. The group consisted of 25 undergraduate faculty from a variety of institutions in the U.S., primarily from the disciplines of mathematics, statistics and computer science. These guidelines are meant to provide some structure for institutions planning for or revising a major in Data Science

    CAMILA formal software engineering supported by functional programming

    Get PDF
    This paper describes two experiences in teaching a formal approach to software engineering at undergraduate level supported by Camila a functional programming based tool Carried on in di erent institutions each of them addresses a particular topic in the area requirement analysis and generic systems design in the rst case speci cation and implementation development in the second Camila the common framework to both experiences animates a set based language extended with a mild use of category theory which can be reasoned upon for program calculation and classi cation purposes The project a liates itself to but is not restricted to the research in exploring Functional Programming as a rapid prototyping environment for formal software models Its kernel is fully connectable to external applications and equipped with a component repository and distribution facilities The paper explains how Camila is being used in the educational practice as a tool to think with providing a kind of cross fertilization between students under standing of di erent parts of the curriculum Furthermore it helps in developing a number of engineering skills namely the ability to analyze and classify infor mation problems and models and to resort to the combined use of di erent programming frameworks in approaching them.Eje: Conferencia latinoamericana de programación funcionalRed de Universidades con Carreras en Informática (RedUNCI
    • …
    corecore