Search CORE

62,046 research outputs found

Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster

Author: Garg Rakhi
Mishra P. K.
Singh Sudhakar
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/01/2017
Field of study

Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing efficient algorithms on MapReduce framework to process and analyze big datasets is contemporary research nowadays. In this paper, we have focused on the performance of MapReduce based Apriori on homogeneous as well as on heterogeneous Hadoop cluster. We have investigated a number of factors that significantly affects the execution time of MapReduce based Apriori running on homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both algorithmic and non-algorithmic improvements. Considered factors specific to algorithmic improvements are filtered transactions and data structures. Experimental results show that how an appropriate data structure and filtered transactions technique drastically reduce the execution time. The non-algorithmic factors include speculative execution, nodes with poor performance, data locality & distribution of data blocks, and parallelism control with input split size. We have applied strategies against these factors and fine tuned the relevant parameters in our particular application. Experimental results show that if cluster specific parameters are taken care of then there is a significant reduction in execution time. Also we have discussed the issues regarding MapReduce implementation of Apriori which may significantly influence the performance.Comment: 8 pages, 8 figures, International Conference on Computing, Communication and Automation (ICCCA2016

arXiv.org e-Print Archive

Crossref

CRAID: Online RAID upgrades using dynamic hot data reorganization

Author: Cortés Toni
Miranda Bueno Alberto
Publication venue: USENIX Association
Publication date: 01/01/2014
Field of study

Current algorithms used to upgrade RAID arrays typically require large amounts of data to be migrated, even those that move only the minimum amount of data required to keep a balanced data load. This paper presents CRAID, a self-optimizing RAID array that performs an online block reorganization of frequently used, long-term accessed data in order to reduce this migration even further. To achieve this objective, CRAID tracks frequently used, long-term data blocks and copies them to a dedicated partition spread across all the disks in the array. When new disks are added, CRAID only needs to extend this process to the new devices to redistribute this partition, thus greatly reducing the overhead of the upgrade process. In addition, the reorganized access patterns within this partition improve the array’s performance, amortizing the copy overhead and allowing CRAID to offer a performance competitive with traditional RAIDs. We describe CRAID’s motivation and design and we evaluate it by replaying seven real-world workloads including a file server, a web server and a user share. Our experiments show that CRAID can successfully detect hot data variations and begin using new disks as soon as they are added to the array. Also, the usage of a dedicated partition improves the sequentiality of relevant data access, which amortizes the cost of reorganizations. Finally, we prove that a full-HDD CRAID array with a small distributed partition (<1.28% per disk) can compete in performance with an ideally restriped RAID-5 and a hybrid RAID-5 with a small SSD cache.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

State of The Art and Hot Aspects in Cloud Data Storage Security

Author: Gehrmann Christian
Morenius Fredric
Paladi Nicolae
Publication venue: Swedish Institute of Computer Science
Publication date: 01/01/2013
Field of study

Along with the evolution of cloud computing and cloud storage towards matu- rity, researchers have analyzed an increasing range of cloud computing security aspects, data security being an important topic in this area. In this paper, we examine the state of the art in cloud storage security through an overview of selected peer reviewed publications. We address the question of defining cloud storage security and its different aspects, as well as enumerate the main vec- tors of attack on cloud storage. The reviewed papers present techniques for key management and controlled disclosure of encrypted data in cloud storage, while novel ideas regarding secure operations on encrypted data and methods for pro- tection of data in fully virtualized environments provide a glimpse of the toolbox available for securing cloud storage. Finally, new challenges such as emergent government regulation call for solutions to problems that did not receive enough attention in earlier stages of cloud computing, such as for example geographical location of data. The methods presented in the papers selected for this review represent only a small fraction of the wide research effort within cloud storage security. Nevertheless, they serve as an indication of the diversity of problems that are being addressed

CiteSeerX

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive

LERC: Coordinated Cache Management for Data-Parallel Systems

Author: Letaief Khaled B.
Wang Wei
Yu Yinghao
Zhang Jun
Publication venue
Publication date: 26/08/2017
Field of study

Memory caches are being aggressively used in today's data-parallel frameworks such as Spark, Tez and Storm. By caching input and intermediate data in memory, compute tasks can witness speedup by orders of magnitude. To maximize the chance of in-memory data access, existing cache algorithms, be it recency- or frequency-based, settle on cache hit ratio as the optimization objective. However, unlike the conventional belief, we show in this paper that simply pursuing a higher cache hit ratio of individual data blocks does not necessarily translate into faster task completion in data-parallel environments. A data-parallel task typically depends on multiple input data blocks. Unless all of these blocks are cached in memory, no speedup will result. To capture this all-or-nothing property, we propose a more relevant metric, called effective cache hit ratio. Specifically, a cache hit of a data block is said to be effective if it can speed up a compute task. In order to optimize the effective cache hit ratio, we propose the Least Effective Reference Count (LERC) policy that persists the dependent blocks of a compute task as a whole in memory. We have implemented the LERC policy as a memory manager in Spark and evaluated its performance through Amazon EC2 deployment. Evaluation results demonstrate that LERC helps speed up data-parallel jobs by up to 37% compared with the widely employed least-recently-used (LRU) policy

arXiv.org e-Print Archive

Crossref