Search CORE

66 research outputs found

Reducing Query Latency for Information Retrieval

Author: Kamble Swapnil Satish
Publication venue: SJSU ScholarWorks
Publication date: 22/05/2017
Field of study

As the world is moving towards Big Data, NoSQL (Not only SQL) databases are gaining much more popularity. Among the other advantages of NoSQL databases, one of their key advantage is that they facilitate faster retrieval for huge volumes of data, as compared to traditional relational databases. This project deals with one such popular NoSQL database, Apache HBase. It performs quite efficiently in cases of retrieving information using the rowkey (similar to a primary key in a SQL database). But, in cases where one needs to get information based on non-rowkey columns, the response latency is higher than what we observe in the previous case. This project discusses an approach which aims towards decreasing this latency. It also compares the performance of the existing approach and the proposed approach for various scenarios

SJSU ScholarWorks

Physical Design for Non-relational Data Systems

Author: Mior Michael
Publication venue: 'University of Waterloo'
Publication date: 03/08/2018
Field of study

Decades of research have gone into the optimization of physical designs, query execution, and related tools for relational databases. These techniques and tools make it possible for non-expert users to make effective use of relational database management systems. However, the drive for flexible data models and increased scalability has spawned a new generation of data management systems which largely eschew the relational model. These include systems such as NoSQL databases and distributed analytics frameworks such as Apache Spark which make use of a diverse set of data models. Optimization techniques and tools developed for relational data do not directly apply in this setting. This leaves developers making use of these systems with the need to become intimately familiar with system details to obtain good performance. We present techniques and tools for physical design for non-relational data systems. We explore two settings: NoSQL database systems and distributed analytics frameworks. While NoSQL databases often avoid explicit schema definitions, many choices on how to structure data remain. These choices can have a significant impact on application performance. The data structuring process normally requires expert knowledge of the underlying database. We present the NoSQL Schema Evaluator (NoSE). Given a target workload, NoSE provides an optimized physical design for NoSQL database applications which compares favourably to schemas designed by expert users. To enable existing applications to benefit from conceptual modeling, we also present an algorithm to recover a logical model from a denormalized database instance. Our second setting is distributed analytics frameworks such as Apache Spark. As is the case for NoSQL databases, expert knowledge of Spark is often required to construct efficient data pipelines. In NoSQL systems, a key challenge is how to structure stored data, while in Spark, a key challenge is how to cache intermediate results. We examine a particularly common scenario in Spark which involves performing iterative analysis on an input dataset. We show that jobs written in an intuitive manner using existing Spark APIs can have poor performance. We propose ReSpark, which automates caching decisions for iterative Spark analyses. Like NoSE, ReSpark makes it possible for non-expert users to obtain good performance from a non-relational data system

University of Waterloo's Institutional Repository

Recommended from our members

A Paradigm for Scalable, Transactional, and Efficient Spatial Indexes

Author: Gao Ning
Publication venue: University of Colorado Boulder
Publication date: 01/01/2018
Field of study

With large volumes of geo-tagged data collected in various applications, spatial query pro- cessing becomes essential. Query engines depend on efficient indexes to expedite processing. There are three main challenges: scaling out to accommodate large volumes of spatial data, support- ing transactional primitives for strong consistency guarantees, and adapting to highly dynamic workloads. This thesis proposes a paradigm for scalable, transactional, and efficient spatial indexes to significantly reduce development efforts in designing and comparing multiple spatial indexes.This thesis first introduces a distributed and transactional key value store called DTranx to persist the spatial indexes. DTranx follows the SEDA architecture to exploit high concurrency in multi-core environments and it adopts a hybrid of optimistic concurrency control and two-phase commit protocols to narrow down the critical sections of distributed locking during transaction com- mits. Moreover, DTranx integrates a persistent memory based write-ahead log to reduce durability overhead and combines a garbage collection mechanism without affecting normal transactions. To maintain high throughput for search workloads when databases are constantly updated, snapshot transactions are introduced.Then, a paradigm is presented with a set of intuitive APIs and a Mempool runtime to re- duce development efforts. Mempool transparently synchronizes local states of data structures with DTranx and it handles two critical tasks: address translation and transparent server synchroniza- tion, of which the latter includes transaction construction and data synchronization. Furthermore, a dynamic partitioning strategy is integrated into DTranx to generate partitioning and replication plans that reduce inter-server communications and balance resource usage.Lastly, single-threaded data structures BTree and RTree are converted into distributed versions within two weeks. The BTree and RTree achieve 253.07 kops/sec and 77.83 kops/sec through- put respectively for pure search operations in a 25-server cluster

CU Scholar Institutional Repository

Methods to Improve Applicability and Efficiency of Distributed Data-Centric Compute Frameworks

Author: Kambatla Karthik Shashank
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

The success of modern applications depends on the insights they collect from their data repositories. Data repositories for such applications currently exceed exabytes and are rapidly increasing in size, as they collect data from varied sources - web applications, mobile phones, sensors and other connected devices. Distributed storage and data-centric compute frameworks have been invented to store and analyze these large datasets. This dissertation focuses on extending the applicability and improving the efficiency of distributed data-centric compute frameworks

Purdue E-Pubs

Artificial Life as a Vehicle for Anomalies Detection on Industrial Control System: The Behaviour of Bird Swarms and How it can be Applied in ICS

Author: Okeke Michael
Publication venue
Publication date: 01/03/2018
Field of study

University of South Wales Research Explorer

WRITE-INTENSIVE DATA MANAGEMENT IN LOG-STRUCTURED STORAGE

Author: WANG SHENG
Publication venue
Publication date: 19/04/2016
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

The Family of MapReduce and Large Scale Data Processing Systems

Author: Anna Liu
Ayman G. Fayoumi
King Abdulaziz
See Profile
Sherif Sakr
Sherif Sakr
South Wales
South Wales
Publication venue
Publication date: 12/02/2013
Field of study

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

arXiv.org e-Print Archive

CiteSeerX

Contributions to Edge Computing

Author: Bumgardner Vernon K.
Publication venue: UKnowledge
Publication date: 01/01/2017
Field of study

Efforts related to Internet of Things (IoT), Cyber-Physical Systems (CPS), Machine to Machine (M2M) technologies, Industrial Internet, and Smart Cities aim to improve society through the coordination of distributed devices and analysis of resulting data. By the year 2020 there will be an estimated 50 billion network connected devices globally and 43 trillion gigabytes of electronic data. Current practices of moving data directly from end-devices to remote and potentially distant cloud computing services will not be sufficient to manage future device and data growth. Edge Computing is the migration of computational functionality to sources of data generation. The importance of edge computing increases with the size and complexity of devices and resulting data. In addition, the coordination of global edge-to-edge communications, shared resources, high-level application scheduling, monitoring, measurement, and Quality of Service (QoS) enforcement will be critical to address the rapid growth of connected devices and associated data. We present a new distributed agent-based framework designed to address the challenges of edge computing. This actor-model framework implementation is designed to manage large numbers of geographically distributed services, comprised from heterogeneous resources and communication protocols, in support of low-latency real-time streaming applications. As part of this framework, an application description language was developed and implemented. Using the application description language a number of high-order management modules were implemented including solutions for resource and workload comparison, performance observation, scheduling, and provisioning. A number of hypothetical and real-world use cases are described to support the framework implementation

University of Kentucky