4,565 research outputs found
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
FactorBase: SQL for Learning A Multi-Relational Graphical Model
We describe FactorBase, a new SQL-based framework that leverages a relational
database management system to support multi-relational model discovery. A
multi-relational statistical model provides an integrated analysis of the
heterogeneous and interdependent data resources in the database. We adopt the
BayesStore design philosophy: statistical models are stored and managed as
first-class citizens inside a database. Whereas previous systems like
BayesStore support multi-relational inference, FactorBase supports
multi-relational learning. A case study on six benchmark databases evaluates
how our system supports a challenging machine learning application, namely
learning a first-order Bayesian network model for an entire database. Model
learning in this setting has to examine a large number of potential statistical
associations across data tables. Our implementation shows how the SQL
constructs in FactorBase facilitate the fast, modular, and reliable development
of highly scalable model learning systems.Comment: 14 pages, 10 figures, 10 tables, Published on 2015 IEEE International
Conference on Data Science and Advanced Analytics (IEEE DSAA'2015), Oct
19-21, 2015, Paris, Franc
The MADlib Analytics Library or MAD Skills, the SQL
MADlib is a free, open source library of in-database analytic methods. It
provides an evolving suite of SQL-based algorithms for machine learning, data
mining and statistics that run at scale within a database engine, with no need
for data import/export to other tools. The goal is for MADlib to eventually
serve a role for scalable database systems that is similar to the CRAN library
for R: a community repository of statistical methods, this time written with
scale and parallelism in mind. In this paper we introduce the MADlib project,
including the background that led to its beginnings, and the motivation for its
open source nature. We provide an overview of the library's architecture and
design patterns, and provide a description of various statistical methods in
that context. We include performance and speedup results of a core design
pattern from one of those methods over the Greenplum parallel DBMS on a
modest-sized test cluster. We then report on two initial efforts at
incorporating academic research into MADlib, which is one of the project's
goals. MADlib is freely available at http://madlib.net, and the project is open
for contributions of both new methods, and ports to additional database
platforms.Comment: VLDB201
SkyQuery: An Implementation of a Parallel Probabilistic Join Engine for Cross-Identification of Multiple Astronomical Databases
Multi-wavelength astronomical studies require cross-identification of
detections of the same celestial objects in multiple catalogs based on
spherical coordinates and other properties. Because of the large data volumes
and spherical geometry, the symmetric N-way association of astronomical
detections is a computationally intensive problem, even when sophisticated
indexing schemes are used to exclude obviously false candidates. Legacy
astronomical catalogs already contain detections of more than a hundred million
objects while the ongoing and future surveys will produce catalogs of billions
of objects with multiple detections of each at different times. The varying
statistical error of position measurements, moving and extended objects, and
other physical properties make it necessary to perform the cross-identification
using a mathematically correct, proper Bayesian probabilistic algorithm,
capable of including various priors. One time, pair-wise cross-identification
of these large catalogs is not sufficient for many astronomical scenarios.
Consequently, a novel system is necessary that can cross-identify multiple
catalogs on-demand, efficiently and reliably. In this paper, we present our
solution based on a cluster of commodity servers and ordinary relational
databases. The cross-identification problems are formulated in a language based
on SQL, but extended with special clauses. These special queries are
partitioned spatially by coordinate ranges and compiled into a complex workflow
of ordinary SQL queries. Workflows are then executed in a parallel framework
using a cluster of servers hosting identical mirrors of the same data sets
On Big Data Benchmarking
Big data systems address the challenges of capturing, storing, managing,
analyzing, and visualizing big data. Within this context, developing benchmarks
to evaluate and compare big data systems has become an active topic for both
research and industry communities. To date, most of the state-of-the-art big
data benchmarks are designed for specific types of systems. Based on our
experience, however, we argue that considering the complexity, diversity, and
rapid evolution of big data systems, for the sake of fairness, big data
benchmarks must include diversity of data and workloads. Given this motivation,
in this paper, we first propose the key requirements and challenges in
developing big data benchmarks from the perspectives of generating data with 4V
properties (i.e. volume, velocity, variety and veracity) of big data, as well
as generating tests with comprehensive workloads for big data systems. We then
present the methodology on big data benchmarking designed to address these
challenges. Next, the state-of-the-art are summarized and compared, following
by our vision for future research directions.Comment: 7 pages, 4 figures, 2 tables, accepted in BPOE-04
(http://prof.ict.ac.cn/bpoe_4_asplos/
Benchmarking Big Data Systems: State-of-the-Art and Future Directions
The great prosperity of big data systems such as Hadoop in recent years makes
the benchmarking of these systems become crucial for both research and industry
communities. The complexity, diversity, and rapid evolution of big data systems
gives rise to various new challenges about how we design generators to produce
data with the 4V properties (i.e. volume, velocity, variety and veracity), as
well as implement application-specific but still comprehensive workloads.
However, most of the existing big data benchmarks can be described as attempts
to solve specific problems in benchmarking systems. This article investigates
the state-of-the-art in benchmarking big data systems along with the future
challenges to be addressed to realize a successful and efficient benchmark.Comment: 9 pages, 2 figures. arXiv admin note: substantial text overlap with
arXiv:1402.519
Development of grid frameworks for clinical trials and epidemiological studies
E-Health initiatives such as electronic clinical trials and epidemiological studies require access to and usage of a range of both clinical and other data sets. Such data sets are typically only available over many heterogeneous domains where a plethora of often legacy based or in-house/bespoke IT solutions exist. Considerable efforts and investments are being made across the UK to upgrade the IT infrastructures across the National Health Service (NHS) such as the National Program for IT in the NHS (NPFIT) [1]. However, it is the case that currently independent and largely non-interoperable IT solutions exist across hospitals, trusts, disease registries and GP practices – this includes security as well as more general compute and data infrastructures. Grid technology allows issues of distribution and heterogeneity to be overcome, however the clinical trials domain places special demands on security and data which hitherto the Grid community have not satisfactorily addressed. These challenges are often common across many studies and trials hence the development of a re-usable framework for creation and subsequent management of such infrastructures is highly desirable. In this paper we present the challenges in developing such a framework and outline initial scenarios and prototypes developed within the MRC funded Virtual Organisations for Trials and Epidemiological Studies (VOTES) project [2]
Big Data: Understanding Big Data
Steve Jobs, one of the greatest visionaries of our time was quoted in 1996
saying "a lot of times, people do not know what they want until you show it to
them" [38] indicating he advocated products to be developed based on human
intuition rather than research. With the advancements of mobile devices, social
networks and the Internet of Things, enormous amounts of complex data, both
structured and unstructured are being captured in hope to allow organizations
to make better business decisions as data is now vital for an organizations
success. These enormous amounts of data are referred to as Big Data, which
enables a competitive advantage over rivals when processed and analyzed
appropriately. However Big Data Analytics has a few concerns including
Management of Data-lifecycle, Privacy & Security, and Data Representation. This
paper reviews the fundamental concept of Big Data, the Data Storage domain, the
MapReduce programming paradigm used in processing these large datasets, and
focuses on two case studies showing the effectiveness of Big Data Analytics and
presents how it could be of greater good in the future if handled
appropriately.Comment: 8 pages, Big Data Analytics, Data Storage, MapReduce,
Knowledge-Space, Big Data Inconsistencie
Recommended from our members
Fault diversity among off-the-shelf SQL database servers
Fault tolerance is often the only viable way of obtaining the required system dependability from systems built out of "off-the-shelf" (OTS) products. We have studied a sample of bug reports from four off-the-shelf SQL servers so as to estimate the possible advantages of software fault tolerance - in the form of modular redundancy with diversity - in complex off-the-shelf software. We checked whether these bugs would cause coincident failures in more than one of the servers. We found that very few bugs affected two of the four servers, and none caused failures in more than two. We also found that only four of these bugs would cause identical, undetectable failures in two servers. Therefore, a fault-tolerant server, built with diverse off-the-shelf servers, seems to have a good chance of delivering improvements in availability and failure rates compared with the individual off-the-shelf servers or their replicated, nondiverse configurations
A Comparative Taxonomy and Survey of Public Cloud Infrastructure Vendors
An increasing number of technology enterprises are adopting cloud-native
architectures to offer their web-based products, by moving away from
privately-owned data-centers and relying exclusively on cloud service
providers. As a result, cloud vendors have lately increased, along with the
estimated annual revenue they share. However, in the process of selecting a
provider's cloud service over the competition, we observe a lack of universal
common ground in terms of terminology, functionality of services and billing
models. This is an important gap especially under the new reality of the
industry where each cloud provider has moved towards his own service taxonomy,
while the number of specialized services has grown exponentially. This work
discusses cloud services offered by four dominant, in terms of their current
market share, cloud vendors. We provide a taxonomy of their services and
sub-services that designates major service families namely computing, storage,
databases, analytics, data pipelines, machine learning, and networking. The aim
of such clustering is to indicate similarities, common design approaches and
functional differences of the offered services. The outcomes are essential both
for individual researchers, and bigger enterprises in their attempt to identify
the set of cloud services that will utterly meet their needs without
compromises. While we acknowledge the fact that this is a dynamic industry,
where new services arise constantly, and old ones experience important updates,
this study paints a solid image of the current offerings and gives prominence
to the directions that cloud service providers are following
- …