695 research outputs found
Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service
An increasing number of Analytics-as-a-Service solutions has recently seen
the light, in the landscape of cloud-based services. These services allow
flexible composition of compute and storage components, that create powerful
data ingestion and processing pipelines. This work is a first attempt at an
experimental evaluation of analytic application performance executed using a
wide range of storage service configurations. We present an intuitive notion of
data locality, that we use as a proxy to rank different service compositions in
terms of expected performance. Through an empirical analysis, we dissect the
performance achieved by analytic workloads and unveil problems due to the
impedance mismatch that arise in some configurations. Our work paves the way to
a better understanding of modern cloud-based analytic services and their
performance, both for its end-users and their providers.Comment: Longer version of the paper in Submission at IEEE CLOUD'1
Any Data, Any Time, Anywhere: Global Data Access for Science
Data access is key to science driven by distributed high-throughput computing
(DHTC), an essential technology for many major research projects such as High
Energy Physics (HEP) experiments. However, achieving efficient data access
becomes quite difficult when many independent storage sites are involved
because users are burdened with learning the intricacies of accessing each
system and keeping careful track of data location. We present an alternate
approach: the Any Data, Any Time, Anywhere infrastructure. Combining several
existing software products, AAA presents a global, unified view of storage
systems - a "data federation," a global filesystem for software delivery, and a
workflow management system. We present how one HEP experiment, the Compact Muon
Solenoid (CMS), is utilizing the AAA infrastructure and some simple performance
metrics.Comment: 9 pages, 6 figures, submitted to 2nd IEEE/ACM International Symposium
on Big Data Computing (BDC) 201
Overview of Caching Mechanisms to Improve Hadoop Performance
Nowadays distributed computing environments, large amounts of data are
generated from different resources with a high velocity, rendering the data
difficult to capture, manage, and process within existing relational databases.
Hadoop is a tool to store and process large datasets in a parallel manner
across a cluster of machines in a distributed environment. Hadoop brings many
benefits like flexibility, scalability, and high fault tolerance; however, it
faces some challenges in terms of data access time, I/O operation, and
duplicate computations resulting in extra overhead, resource wastage, and poor
performance. Many researchers have utilized caching mechanisms to tackle these
challenges. For example, they have presented approaches to improve data access
time, enhance data locality rate, remove repetitive calculations, reduce the
number of I/O operations, decrease the job execution time, and increase
resource efficiency. In the current study, we provide a comprehensive overview
of caching strategies to improve Hadoop performance. Additionally, a novel
classification is introduced based on cache utilization. Using this
classification, we analyze the impact on Hadoop performance and discuss the
advantages and disadvantages of each group. Finally, a novel hybrid approach
called Hybrid Intelligent Cache (HIC) that combines the benefits of two methods
from different groups, H-SVM-LRU and CLQLMRS, is presented. Experimental
results show that our hybrid method achieves an average improvement of 31.2% in
job execution time
Big Data Meets Telcos: A Proactive Caching Perspective
Mobile cellular networks are becoming increasingly complex to manage while
classical deployment/optimization techniques and current solutions (i.e., cell
densification, acquiring more spectrum, etc.) are cost-ineffective and thus
seen as stopgaps. This calls for development of novel approaches that leverage
recent advances in storage/memory, context-awareness, edge/cloud computing, and
falls into framework of big data. However, the big data by itself is yet
another complex phenomena to handle and comes with its notorious 4V: velocity,
voracity, volume and variety. In this work, we address these issues in
optimization of 5G wireless networks via the notion of proactive caching at the
base stations. In particular, we investigate the gains of proactive caching in
terms of backhaul offloadings and request satisfactions, while tackling the
large-amount of available data for content popularity estimation. In order to
estimate the content popularity, we first collect users' mobile traffic data
from a Turkish telecom operator from several base stations in hours of time
interval. Then, an analysis is carried out locally on a big data platform and
the gains of proactive caching at the base stations are investigated via
numerical simulations. It turns out that several gains are possible depending
on the level of available information and storage size. For instance, with 10%
of content ratings and 15.4 Gbyte of storage size (87% of total catalog size),
proactive caching achieves 100% of request satisfaction and offloads 98% of the
backhaul when considering 16 base stations.Comment: 8 pages, 5 figure
Gaining insight from large data volumes with ease
Efficient handling of large data-volumes becomes a necessity in today's
world. It is driven by the desire to get more insight from the data and to gain
a better understanding of user trends which can be transformed into economic
incentives (profits, cost-reduction, various optimization of data workflows,
and pipelines). In this paper, we discuss how modern technologies are
transforming well established patterns in HEP communities. The new data insight
can be achieved by embracing Big Data tools for a variety of use-cases, from
analytics and monitoring to training Machine Learning models on a terabyte
scale. We provide concrete examples within context of the CMS experiment where
Big Data tools are already playing or would play a significant role in daily
operations
MapReduce analysis for cloud-archived data
Public storage clouds have become a popular choice for archiving certain classes of enterprise data - for example, application and infrastructure logs. These logs contain sensitive information like IP addresses or user logins due to which regulatory and security requirements often require data to be encrypted before moved to the cloud. In order to leverage such data for any business value, analytics systems (e.g. Hadoop/MapReduce) first download data from these public clouds, decrypt it and then process it at the secure enterprise site. We propose VNCache: an efficient solution for MapReduceanalysis of such cloud-archived log data without requiring an apriori data transfer and loading into the local Hadoop cluster. VNcache dynamically integrates cloud-archived data into a virtual namespace at the enterprise Hadoop cluster. Through a seamless data streaming and prefetching model, Hadoop jobs can begin execution as soon as they are launched without requiring any apriori downloading. With VNcache's accurate pre-fetching and caching, jobs often run on a local cached copy of the data block significantly improving performance. When no longer needed, data is safely evicted from the enterprise cluster reducing the total storage footprint. Uniquely, VNcache is implemented with NO changes to the Hadoop application stack. © 2014 IEEE
- …