2,998 research outputs found

    Rumble: Data Independence for Large Messy Data Sets

    Full text link
    This paper introduces Rumble, an engine that executes JSONiq queries on large, heterogeneous and nested collections of JSON objects, leveraging the parallel capabilities of Spark so as to provide a high degree of data independence. The design is based on two key insights: (i) how to map JSONiq expressions to Spark transformations on RDDs and (ii) how to map JSONiq FLWOR clauses to Spark SQL on DataFrames. We have developed a working implementation of these mappings showing that JSONiq can efficiently run on Spark to query billions of objects into, at least, the TB range. The JSONiq code is concise in comparison to Spark's host languages while seamlessly supporting the nested, heterogeneous data sets that Spark SQL does not. The ability to process this kind of input, commonly found, is paramount for data cleaning and curation. The experimental analysis indicates that there is no excessive performance loss, occasionally even a gain, over Spark SQL for structured data, and a performance gain over PySpark. This demonstrates that a language such as JSONiq is a simple and viable approach to large-scale querying of denormalized, heterogeneous, arborescent data sets, in the same way as SQL can be leveraged for structured data sets. The results also illustrate that Codd's concept of data independence makes as much sense for heterogeneous, nested data sets as it does on highly structured tables.Comment: Preprint, 9 page

    Real-Time Data Processing With Lambda Architecture

    Get PDF
    Data has evolved immensely in recent years, in type, volume and velocity. There are several frameworks to handle the big data applications. The project focuses on the Lambda Architecture proposed by Marz and its application to obtain real-time data processing. The architecture is a solution that unites the benefits of the batch and stream processing techniques. Data can be historically processed with high precision and involved algorithms without loss of short-term information, alerts and insights. Lambda Architecture has an ability to serve a wide range of use cases and workloads that withstands hardware and human mistakes. The layered architecture enhances loose coupling and flexibility in the system. This a huge benefit that allows understanding the trade-offs and application of various tools and technologies across the layers. There has been an advancement in the approach of building the LA due to improvements in the underlying tools. The project demonstrates a simplified architecture for the LA that is maintainable

    Semantics, Modelling, and the Problem of Representation of Meaning -- a Brief Survey of Recent Literature

    Full text link
    Over the past 50 years many have debated what representation should be used to capture the meaning of natural language utterances. Recently new needs of such representations have been raised in research. Here I survey some of the interesting representations suggested to answer for these new needs.Comment: 15 pages, no figure

    PlinyCompute: A Platform for High-Performance, Distributed, Data-Intensive Tool Development

    Full text link
    This paper describes PlinyCompute, a system for development of high-performance, data-intensive, distributed computing tools and libraries. In the large, PlinyCompute presents the programmer with a very high-level, declarative interface, relying on automatic, relational-database style optimization to figure out how to stage distributed computations. However, in the small, PlinyCompute presents the capable systems programmer with a persistent object data model and API (the "PC object model") and associated memory management system that has been designed from the ground-up for high performance, distributed, data-intensive computing. This contrasts with most other Big Data systems, which are constructed on top of the Java Virtual Machine (JVM), and hence must at least partially cede performance-critical concerns such as memory management (including layout and de/allocation) and virtual method/function dispatch to the JVM. This hybrid approach---declarative in the large, trusting the programmer's ability to utilize PC object model efficiently in the small---results in a system that is ideal for the development of reusable, data-intensive tools and libraries. Through extensive benchmarking, we show that implementing complex objects manipulation and non-trivial, library-style computations on top of PlinyCompute can result in a speedup of 2x to more than 50x or more compared to equivalent implementations on Spark.Comment: 48 pages, including references and Appendi

    The SDSS Galaxy Angular Two-Point Correlation Function

    Full text link
    We present the galaxy two-point angular correlation function for galaxies selected from the seventh data release of the Sloan Digital Sky Survey. The galaxy sample was selected with rr-band apparent magnitudes between 17 and 21; and we measure the correlation function for the full sample as well as for the four magnitude ranges: 17-18, 18-19, 19-20, and 20-21. We update the flag criteria to select a clean galaxy catalog and detail specific tests that we perform to characterize systematic effects, including the effects of seeing, Galactic extinction, and the overall survey uniformity. Notably, we find that optimally we can use observed regions with seeing < 1\farcs5, and rr-band extinction < 0.13 magnitudes, smaller than previously published results. Furthermore, we confirm that the uniformity of the SDSS photometry is minimally affected by the stripe geometry. We find that, overall, the two-point angular correlation function can be described by a power law, ω(θ)=Aωθ(1γ)\omega(\theta) = A_\omega \theta^{(1-\gamma)} with γ1.72\gamma \simeq 1.72, over the range 0\fdg005--10\degr. We also find similar relationships for the four magnitude subsamples, but the amplitude within the same angular interval for the four subsamples is found to decrease with fainter magnitudes, in agreement with previous results. We find that the systematic signals are well below the galaxy angular correlation function for angles less than approximately 5\degr, which limits the modeling of galaxy angular correlations on larger scales. Finally, we present our custom, highly parallelized two-point correlation code that we used in this analysis.Comment: 22 pages, 17 figures, accepted by MNRA

    Code Generation for Efficient Query Processing in Managed Runtimes

    Get PDF
    In this paper we examine opportunities arising from the conver-gence of two trends in data management: in-memory database sys-tems (IMDBs), which have received renewed attention following the availability of affordable, very large main memory systems; and language-integrated query, which transparently integrates database queries with programming languages (thus addressing the famous ‘impedance mismatch ’ problem). Language-integrated query not only gives application developers a more convenient way to query external data sources like IMDBs, but also to use the same querying language to query an application’s in-memory collections. The lat-ter offers further transparency to developers as the query language and all data is represented in the data model of the host program-ming language. However, compared to IMDBs, this additional free-dom comes at a higher cost for query evaluation. Our vision is to improve in-memory query processing of application objects by introducing database technologies to managed runtimes. We focus on querying and we leverage query compilation to im-prove query processing on application objects. We explore dif-ferent query compilation strategies and study how they improve the performance of query processing over application data. We take C] as the host programming language as it supports language-integrated query through the LINQ framework. Our techniques de-liver significant performance improvements over the default LINQ implementation. Our work makes important first steps towards a future where data processing applications will commonly run on machines that can store their entire datasets in-memory, and will be written in a single programming language employing language-integrated query and IMDB-inspired runtimes to provide transparent and highly efficient querying. 1
    corecore