57 research outputs found

    Benchmarking Scalability of NoSQL Databases for Geospatial Queries

    Get PDF
    NoSQL databases provide an edge when it comes to dealing with big unstructured data. Flexibility, agility, and scalability offered by NoSQL databases become increasingly essential when dealing with geospatial data. The proliferation of geospatial applications has tremendously increased the variety, velocity, and volume of data that the data stores must manage. Such characteristics of big spatial data surpassed the capability and anticipated use cases of relational databases. Because we can choose from an extensive collection of NoSQL databases these days, it becomes vital for organizations to make an informed decision. NoSQL Database benchmarks provide system architects, who shoulder a considerable burden of selecting the right technology for their data stores, with a vital start point and source of information. The major utility of these benchmarks is reproducing experiments on similar experimental data that can verify and optimize the process of selecting an optimum tool for data management needs in the early phases of the development. The goal of this research is to develop a benchmark that can compare the performance of NoSQL databases for querying complex geospatial data. We have analyzed throughputs, latencies, and runtime of MongoDB and Couchbase to identify the correct fit for our use case. This way we have also demonstrated a systematic process that can be followed to make an optimum choice of datastore. This benchmark can be extended easily to any NoSQL database that supports geospatial querying

    Survey of NoSQL Database Engines for Big Data

    Get PDF
    Cloud computing is a paradigm shift that provides computing over Internet. With growing outreach of Internet in the lives of people, everyday large volume of data is generated from different sources such as cellphones, electronic gadgets, e-commerce transactions, social media, and sensors. Eventually, the size of generated data is so large that it is also referred as Big Data. Companies harvesting business opportunities in digital world need to invest their budget and time to scale their IT infrastructure for the expansion of their businesses. The traditional relational databases have limitations in scaling for large Internet scale distributed systems. To store rapidly expanding high volume Big Data efficiently, NoSQL data stores have been developed as an alternative solution to the relational databases. The purpose of this thesis is to provide a holistic overview of different NoSQL data stores. We cover different fundamental principles supporting the NoSQL data store development. Many NoSQL data stores have specific and exclusive features and properties. They also differ in their architecture, data model, storage system, and fault tolerance abilities. This thesis describes different aspects of few NoSQL data stores in detail. The thesis also covers the experiments to evaluate and compare the performance of different NoSQL data stores on a distributed cluster. In the scope of this thesis, HBase, Cassandra, MongoDB, and Riak are four NoSQL data stores selected for the benchmarking experiments

    Performance evaluation of various deployment scenarios of the 3-replicated Cassandra NoSQL cluster on AWS

    Get PDF
    A concept of distributed replicated NoSQL data storages Cassandra-like, HBase, MongoDB has been proposed to effectively manage Big Data set whose volume, velocity and variability are difficult to deal with by using the traditional Relational Database Management Systems. Tradeoffs between consistency, availability, partition tolerance and latency is intrinsic to such systems. Although relations between these properties have been previously identified by the well-known CAP and PACELC theorems in qualitative terms, it is still necessary to quantify how different consistency settings, deployment patterns and other properties affect system performance.This experience report analysis performance of the Cassandra NoSQL database cluster and studies the tradeoff between data consistency guaranties and performance in distributed data storages. The primary focus is on investigating the quantitative interplay between Cassandra response time, throughput and its consistency settings considering different single- and multi-region deployment scenarios. The study uses the YCSB benchmarking framework and reports the results of the read and write performance tests of the three-replicated Cassandra cluster deployed in the Amazon AWS. In this paper, we also put forward a notation which can be used to formally describe distributed deployment of Cassandra cluster and its nodes relative to each other and to a client application. We present quantitative results showing how different consistency settings and deployment patterns affect Cassandra performance under different workloads. In particular, our experiments show that strong consistency costs up to 22 % of performance in case of the centralized Cassandra cluster deployment and can cause a 600 % increase in the read/write requests if Cassandra replicas and its clients are globally distributed across different AWS Regions

    Exploring Timeout As A Performance And Availability Factor Of Distributed Replicated Database Systems

    Get PDF
    A concept of distributed replicated data storages like Cassandra, HBase, MongoDB has been proposed to effectively manage the Big Data sets whose volume, velocity, and variability are difficult to deal with by using the traditional Relational Database Management Systems. Trade-offs between consistency, availability, partition tolerance, and latency are intrinsic to such systems. Although relations between these properties have been previously identified by the well-known CAP theorem in qualitative terms, it is still necessary to quantify how different consistency and timeout settings affect system latency. The paper reports results of Cassandra's performance evaluation using the YCSB benchmark and experimentally demonstrates how to read latency depends on the consistency settings and the current database workload. These results clearly show that stronger data consistency increases system latency, which is in line with the qualitative implication of the CAP theorem. Moreover, Cassandra latency and its variation considerably depend on the system workload. The distributed nature of such a system does not always guarantee that the client receives a response from the database within a finite time. If this happens, it causes so-called timing failures when the response is received too late or is not received at all. In the paper, we also consider the role of the application timeout which is the fundamental part of all distributed fault tolerance mechanisms working over the Internet and used as the main error detection mechanism here. The role of the application timeout as the main determinant in the interplay between system availability and responsiveness is also examined in the paper. It is quantitatively shown how different timeout settings could affect system availability and the average servicing and waiting time. Although many modern distributed systems including Cassandra use static timeouts it was shown that the most promising approach is to set timeouts dynamically at run time to balance performance, availability and improve the efficiency of the fault-tolerance mechanisms

    Interplaying Cassandra NoSQL Consistency and Performance: A Benchmarking Approach

    Get PDF
    This experience report analyses performance of the Cassandra NoSQL database and studies the fundamental trade-off between data consistency and delays in distributed data storages. The primary focus is on investigating the interplay between the Cassandra performance (response time) and its consistency settings. The paper reports the results of the read and write performance benchmarking for a replicated Cassandra cluster, deployed in the Amazon EC2 Cloud. We present quantitative results showing how different consistency settings affect the Cassandra performance under different workloads. One of our main findings is that it is possible to minimize Cassandra delays and still guarantee the strong data consistency by optimal coordination of consistency settings for both read and write requests. Our experiments show that (i) strong consistency costs up to 25% of performance and (ii) the best setting for strong consistency depends on the ratio of read and write operations. Finally, we generalize our experience by proposing a benchmarking-based methodology for run-time optimization of consistency settings to achieve the maximum Cassandra performance and still guarantee the strong data consistency under mixed workloads

    Dependability Evaluation of Middleware Technology for Large-scale Distributed Caching

    Full text link
    Distributed caching systems (e.g., Memcached) are widely used by service providers to satisfy accesses by millions of concurrent clients. Given their large-scale, modern distributed systems rely on a middleware layer to manage caching nodes, to make applications easier to develop, and to apply load balancing and replication strategies. In this work, we performed a dependability evaluation of three popular middleware platforms, namely Twemproxy by Twitter, Mcrouter by Facebook, and Dynomite by Netflix, to assess availability and performance under faults, including failures of Memcached nodes and congestion due to unbalanced workloads and network link bandwidth bottlenecks. We point out the different availability and performance trade-offs achieved by the three platforms, and scenarios in which few faulty components cause cascading failures of the whole distributed system.Comment: 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE 2020

    GeoYCSB: A Benchmark Framework for the Performance and Scalability Evaluation of Geospatial NoSQL Databases

    Get PDF
    The proliferation of geospatial applications has tremendously increased the variety, velocity, and volume of spatial data that data stores have to manage. Traditional relational databases reveal limitations in handling such big geospatial data, mainly due to their rigid schema requirements and limited scalability. Numerous NoSQL databases have emerged and actively serve as alternative data stores for big spatial data. This study presents a framework, called GeoYCSB, developed for benchmarking NoSQL databases with geospatial workloads. To develop GeoYCSB, we extend YCSB, a de facto benchmark framework for NoSQL systems, by integrating into its design architecture the new components necessary to support geospatial workloads. GeoYCSB supports both microbenchmarks and macrobenchmarks and facilitates the use of real datasets in both. It is extensible to evaluate any NoSQL database, provided they support spatial queries, using geospatial workloads performed on datasets of any geometric complexity. We use GeoYCSB to benchmark two leading document stores, MongoDB and Couchbase, and present the experimental results and analysis. Finally, we demonstrate the extensibility of GeoYCSB by including a new dataset consisting of complex geometries and using it to benchmark a system with a wide variety of geospatial queries: Apache Accumulo, a wide-column store, with the GeoMesa framework applied on top

    Performance Analysis Of Scalable Sql And Nosql Databases : A Quantitative Approach

    Get PDF
    Benchmarking is a common method in evaluating and choosing a NoSQL database. There are already lots of benchmarking reports available in internet and research papers. Most of the benchmark reports measure the database performance only by overall throughput and latency. This is an adequate performance analysis but need not to be the end. We define some new perspectives which also need to be considered during NoSQL performance analysis. We have demonstrated this approach by benchmarking HBase, MongoDB and sharded MySQL using YCSB. Based on the results we observe that NoSQL databases do not consider the capability of the data nodes while assigning data to it. And these databases\u27 performance is seriously affected by the bottleneck nodes and the databases are not attempting to resolve this bottleneck situation automatically
    corecore