565 research outputs found
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Graphulo Implementation of Server-Side Sparse Matrix Multiply in the Accumulo Database
The Apache Accumulo database excels at distributed storage and indexing and
is ideally suited for storing graph data. Many big data analytics compute on
graph data and persist their results back to the database. These graph
calculations are often best performed inside the database server. The GraphBLAS
standard provides a compact and efficient basis for a wide range of graph
applications through a small number of sparse matrix operations. In this
article, we implement GraphBLAS sparse matrix multiplication server-side by
leveraging Accumulo's native, high-performance iterators. We compare the
mathematics and performance of inner and outer product implementations, and
show how an outer product implementation achieves optimal performance near
Accumulo's peak write rate. We offer our work as a core component to the
Graphulo library that will deliver matrix math primitives for graph analytics
within Accumulo.Comment: To be presented at IEEE HPEC 2015: http://www.ieee-hpec.org
Creating a Relational Distributed Object Store
In and of itself, data storage has apparent business utility. But when we can
convert data to information, the utility of stored data increases dramatically.
It is the layering of relation atop the data mass that is the engine for such
conversion. Frank relation amongst discrete objects sporadically ingested is
rare, making the process of synthesizing such relation all the more
challenging, but the challenge must be met if we are ever to see an equivalent
business value for unstructured data as we already have with structured data.
This paper describes a novel construct, referred to as a relational distributed
object store (RDOS), that seeks to solve the twin problems of how to
persistently and reliably store petabytes of unstructured data while
simultaneously creating and persisting relations amongst billions of objects.Comment: 12 pages, 5 figure
Performance Optimizations of NoSQL Databases in Distributed Systems
Databases store information about a system and provide a mechanism for data to be accessed and manipulated. While advancements in the 1970s provided a relational database model that has persisted to this day, web-scale era mass data needs surfacing in the 1990s and the early 2000s revealed limitations in the scalability of the relational model. As systems grew and transitioned into distributed architectures to support mass data storage and parallel processing, a complete overhaul of distributed computing technologies evolved that fundamentally departed from the relational data model in favor of the NoSQL data model. The course of this research details the scaling problems encountered by relational databases and the NoSQL solutions that made web-scale systems possible
SiteWit Corporation: SQL or NoSQL that is the Question
This teaching case focuses on a start-up company in the Web analytics and online advertising space, which faces a database scaling challenge. The case covers the rapidly emerging NoSQL database products that can be used to implement very large distributed databases. These are exciting times in the database marketplace, with a flock of new companies offering scalable database systems for the cloud. These products challenge the existing relational database vendors that have come to dominate the market. The case outlines four potential solutions and asks students to make a choice or suggest a different alternative
On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark
Querying very large RDF data sets in an efficient manner requires a
sophisticated distribution strategy. Several innovative solutions have recently
been proposed for optimizing data distribution with predefined query workloads.
This paper presents an in-depth analysis and experimental comparison of five
representative and complementary distribution approaches. For achieving fair
experimental results, we are using Apache Spark as a common parallel computing
framework by rewriting the concerned algorithms using the Spark API. Spark
provides guarantees in terms of fault tolerance, high availability and
scalability which are essential in such systems. Our different implementations
aim to highlight the fundamental implementation-independent characteristics of
each approach in terms of data preparation, load balancing, data replication
and to some extent to query answering cost and performance. The presented
measures are obtained by testing each system on one synthetic and one
real-world data set over query workloads with differing characteristics and
different partitioning constraints.Comment: 16 pages, 3 figure
NewSQL Monitoring System
NewSQL is the new breed of databases that combines the best of RDBMS and NoSQL databases. They provide full ACID compliance like RDBMS and are highly scalable and fault-tolerant similar to NoSQL databases. Thus, NewSQL databases are ideal candidates for supporting big data and applications, particularly financial transaction and fraud detection systems, requiring ACID guarantees. Since NewSQL databases can scale to thousands of nodes, it becomes tedious to monitor the entire cluster and each node. Hence, we are building a NewSQL monitoring system using open-source tools. We will consider VoltDB, a popular open-source NewSQL database, as the database to be monitored. Although a monitoring dashboard exists for VoltDB, it only provides the bird’s eye view of the cluster and the nodes and focuses on CPU usages and security aspects. Therefore, several components of a monitoring system have to be considered and have to be open source to be readily available and congruent with the scalability and fault tolerance of VoltDB. Databases like Cassandra (NoSQL), YugabyteDB (NewSQL), and InfluxDB (Time Series) will be used based on their read/write performances and scalability, fault tolerance to store the monitoring data. We will also consider the role of Amazon Kinesis, a popular queueing, messaging, and streaming engine, since it provides fault-tolerant streaming and batching data pipelines between application and system. This project is implemented using Python and Java
- …