1,076 research outputs found

    Low-latency, query-driven analytics over voluminous multidimensional, spatiotemporal datasets

    Get PDF
    2017 Summer.Includes bibliographical references.Ubiquitous data collection from sources such as remote sensing equipment, networked observational devices, location-based services, and sales tracking has led to the accumulation of voluminous datasets; IDC projects that by 2020 we will generate 40 zettabytes of data per year, while Gartner and ABI estimate 20-35 billion new devices will be connected to the Internet in the same time frame. The storage and processing requirements of these datasets far exceed the capabilities of modern computing hardware, which has led to the development of distributed storage frameworks that can scale out by assimilating more computing resources as necessary. While challenging in its own right, storing and managing voluminous datasets is only the precursor to a broader field of study: extracting knowledge, insights, and relationships from the underlying datasets. The basic building block of this knowledge discovery process is analytic queries, encompassing both query instrumentation and evaluation. This dissertation is centered around query-driven exploratory and predictive analytics over voluminous, multidimensional datasets. Both of these types of analysis represent a higher-level abstraction over classical query models; rather than indexing every discrete value for subsequent retrieval, our framework autonomously learns the relationships and interactions between dimensions in the dataset (including time series and geospatial aspects), and makes the information readily available to users. This functionality includes statistical synopses, correlation analysis, hypothesis testing, probabilistic structures, and predictive models that not only enable the discovery of nuanced relationships between dimensions, but also allow future events and trends to be predicted. This requires specialized data structures and partitioning algorithms, along with adaptive reductions in the search space and management of the inherent trade-off between timeliness and accuracy. The algorithms presented in this dissertation were evaluated empirically on real-world geospatial time-series datasets in a production environment, and are broadly applicable across other storage frameworks

    GeoLens: enabling interactive visual analytics over large-scale, multidimensional geospatial datasets

    Get PDF
    2015 Spring.Includes bibliographical references.With the rapid increase of scientific data volumes, interactive tools that enable effective visual representation for scientists are needed. This is critical when scientists are manipulating voluminous datasets and especially when they need to explore datasets interactively to develop their hypotheses. In this paper, we present an interactive visual analytics framework, GeoLens. GeoLens provides fast and expressive interactions with voluminous geospatial datasets. We provide an expressive visual query evaluation scheme to support advanced interactive visual analytics technique, such as brushing and linking. To achieve this, we designed and developed the geohash based image tile generation algorithm that automatically adjusts the range of data to access based on the minimum acceptable size of the image tile. In addition, we have also designed an autonomous histogram generation algorithm that generates histograms of user-defined data subsets that do not have pre-computed data properties. Using our approach, applications can generate histograms of datasets containing millions of data points with sub-second latency. The work builds on our visual query coordinating scheme that evaluates geospatial query and orchestrates data aggregation in a distributed storage environment while preserving data locality and minimizing data movements. This paper includes empirical benchmarks of our framework encompassing a billion-file dataset published by the National Climactic Data Center

    On the evaluation of exact-match and range queries over multidimensional data in distributed hash tables

    Get PDF
    2012 Fall.Includes bibliographical references.The quantity and precision of geospatial and time series observational data being collected has increased alongside the steady expansion of processing and storage capabilities in modern computing hardware. The storage requirements for this information are vastly greater than the capabilities of a single computer, and are primarily met in a distributed manner. However, distributed solutions often impose strict constraints on retrieval semantics. In this thesis, we investigate the factors that influence storage and retrieval operations on large datasets in a cloud setting, and propose a lightweight data partitioning and indexing scheme to facilitate these operations. Our solution provides expressive retrieval support through range-based and exact-match queries and can be applied over massive quantities of multidimensional data. We provide benchmarks to illustrate the relative advantage of using our solution over a general-purpose cloud storage engine in a distributed network of heterogeneous computing resources

    bdbms -- A Database Management System for Biological Data

    Full text link
    Biologists are increasingly using databases for storing and managing their data. Biological databases typically consist of a mixture of raw data, metadata, sequences, annotations, and related data obtained from various sources. Current database technology lacks several functionalities that are needed by biological databases. In this paper, we introduce bdbms, an extensible prototype database management system for supporting biological data. bdbms extends the functionalities of current DBMSs to include: (1) Annotation and provenance management including storage, indexing, manipulation, and querying of annotation and provenance as first class objects in bdbms, (2) Local dependency tracking to track the dependencies and derivations among data items, (3) Update authorization to support data curation via content-based authorization, in contrast to identity-based authorization, and (4) New access methods and their supporting operators that support pattern matching on various types of compressed biological data types. This paper presents the design of bdbms along with the techniques proposed to support these functionalities including an extension to SQL. We also outline some open issues in building bdbms.Comment: This article is published under a Creative Commons License Agreement (http://creativecommons.org/licenses/by/2.5/.) You may copy, distribute, display, and perform the work, make derivative works and make commercial use of the work, but, you must attribute the work to the author and CIDR 2007. 3rd Biennial Conference on Innovative Data Systems Research (CIDR) January 710, 2007, Asilomar, California, US

    Dataplane Specialization for High-performance OpenFlow Software Switching

    Get PDF
    OpenFlow is an amazingly expressive dataplane program- ming language, but this expressiveness comes at a severe performance price as switches must do excessive packet clas- sification in the fast path. The prevalent OpenFlow software switch architecture is therefore built on flow caching, but this imposes intricate limitations on the workloads that can be supported efficiently and may even open the door to mali- cious cache overflow attacks. In this paper we argue that in- stead of enforcing the same universal flow cache semantics to all OpenFlow applications and optimize for the common case, a switch should rather automatically specialize its dat- aplane piecemeal with respect to the configured workload. We introduce ES WITCH , a novel switch architecture that uses on-the-fly template-based code generation to compile any OpenFlow pipeline into efficient machine code, which can then be readily used as fast path. We present a proof- of-concept prototype and we demonstrate on illustrative use cases that ES WITCH yields a simpler architecture, superior packet processing speed, improved latency and CPU scala- bility, and predictable performance. Our prototype can eas- ily scale beyond 100 Gbps on a single Intel blade even with complex OpenFlow pipelines
    • …
    corecore