1,437 research outputs found
Storage Solutions for Big Data Systems: A Qualitative Study and Comparison
Big data systems development is full of challenges in view of the variety of
application areas and domains that this technology promises to serve.
Typically, fundamental design decisions involved in big data systems design
include choosing appropriate storage and computing infrastructures. In this age
of heterogeneous systems that integrate different technologies for optimized
solution to a specific real world problem, big data system are not an exception
to any such rule. As far as the storage aspect of any big data system is
concerned, the primary facet in this regard is a storage infrastructure and
NoSQL seems to be the right technology that fulfills its requirements. However,
every big data application has variable data characteristics and thus, the
corresponding data fits into a different data model. This paper presents
feature and use case analysis and comparison of the four main data models
namely document oriented, key value, graph and wide column. Moreover, a feature
analysis of 80 NoSQL solutions has been provided, elaborating on the criteria
and points that a developer must consider while making a possible choice.
Typically, big data storage needs to communicate with the execution engine and
other processing and visualization technologies to create a comprehensive
solution. This brings forth second facet of big data storage, big data file
formats, into picture. The second half of the research paper compares the
advantages, shortcomings and possible use cases of available big data file
formats for Hadoop, which is the foundation for most big data computing
technologies. Decentralized storage and blockchain are seen as the next
generation of big data storage and its challenges and future prospects have
also been discussed
Big Data Meets Telcos: A Proactive Caching Perspective
Mobile cellular networks are becoming increasingly complex to manage while
classical deployment/optimization techniques and current solutions (i.e., cell
densification, acquiring more spectrum, etc.) are cost-ineffective and thus
seen as stopgaps. This calls for development of novel approaches that leverage
recent advances in storage/memory, context-awareness, edge/cloud computing, and
falls into framework of big data. However, the big data by itself is yet
another complex phenomena to handle and comes with its notorious 4V: velocity,
voracity, volume and variety. In this work, we address these issues in
optimization of 5G wireless networks via the notion of proactive caching at the
base stations. In particular, we investigate the gains of proactive caching in
terms of backhaul offloadings and request satisfactions, while tackling the
large-amount of available data for content popularity estimation. In order to
estimate the content popularity, we first collect users' mobile traffic data
from a Turkish telecom operator from several base stations in hours of time
interval. Then, an analysis is carried out locally on a big data platform and
the gains of proactive caching at the base stations are investigated via
numerical simulations. It turns out that several gains are possible depending
on the level of available information and storage size. For instance, with 10%
of content ratings and 15.4 Gbyte of storage size (87% of total catalog size),
proactive caching achieves 100% of request satisfaction and offloads 98% of the
backhaul when considering 16 base stations.Comment: 8 pages, 5 figure
An Experiment on Bare-Metal BigData Provisioning
Many BigData customers use on-demand platforms in the cloud, where they can get a dedicated virtual cluster in a couple of minutes and pay only for the time they use. Increasingly, there is a demand for bare-metal bigdata solutions for applications that cannot tolerate the unpredictability and performance degradation of virtualized systems. Existing bare-metal solutions can introduce delays of 10s of minutes to provision a cluster by installing operating systems and applications on the local disks of servers. This has motivated recent research developing sophisticated mechanisms to optimize this installation. These approaches assume that using network mounted boot disks incur unacceptable run-time overhead. Our analysis suggest that while this assumption is true for application data, it is incorrect for operating systems and applications, and network mounting the boot disk and applications result in negligible run-time impact while leading to faster provisioning time.This research was supported in part by the MassTech
Collaborative Research Matching Grant Program, NSF
awards 1347525 and 1414119 and several commercial
partners of the Massachusetts Open Cloud who may be
found at http://www.massopencloud.or
i2MapReduce: Incremental MapReduce for Mining Evolving Big Data
As new data and updates are constantly arriving, the results of data mining
applications become stale and obsolete over time. Incremental processing is a
promising approach to refreshing mining results. It utilizes previously saved
states to avoid the expense of re-computation from scratch.
In this paper, we propose i2MapReduce, a novel incremental processing
extension to MapReduce, the most widely used framework for mining big data.
Compared with the state-of-the-art work on Incoop, i2MapReduce (i) performs
key-value pair level incremental processing rather than task level
re-computation, (ii) supports not only one-step computation but also more
sophisticated iterative computation, which is widely used in data mining
applications, and (iii) incorporates a set of novel techniques to reduce I/O
overhead for accessing preserved fine-grain computation states. We evaluate
i2MapReduce using a one-step algorithm and three iterative algorithms with
diverse computation characteristics. Experimental results on Amazon EC2 show
significant performance improvements of i2MapReduce compared to both plain and
iterative MapReduce performing re-computation
- …