35 research outputs found

    Observations on Factors Affecting Performance of MapReduce based Apriori on Hadoop Cluster

    Full text link
    Designing fast and scalable algorithm for mining frequent itemsets is always being a most eminent and promising problem of data mining. Apriori is one of the most broadly used and popular algorithm of frequent itemset mining. Designing efficient algorithms on MapReduce framework to process and analyze big datasets is contemporary research nowadays. In this paper, we have focused on the performance of MapReduce based Apriori on homogeneous as well as on heterogeneous Hadoop cluster. We have investigated a number of factors that significantly affects the execution time of MapReduce based Apriori running on homogeneous and heterogeneous Hadoop Cluster. Factors are specific to both algorithmic and non-algorithmic improvements. Considered factors specific to algorithmic improvements are filtered transactions and data structures. Experimental results show that how an appropriate data structure and filtered transactions technique drastically reduce the execution time. The non-algorithmic factors include speculative execution, nodes with poor performance, data locality & distribution of data blocks, and parallelism control with input split size. We have applied strategies against these factors and fine tuned the relevant parameters in our particular application. Experimental results show that if cluster specific parameters are taken care of then there is a significant reduction in execution time. Also we have discussed the issues regarding MapReduce implementation of Apriori which may significantly influence the performance.Comment: 8 pages, 8 figures, International Conference on Computing, Communication and Automation (ICCCA2016

    HadoopSec: Sensitivity-aware Secure Data Placement Strategy for Big Data/Hadoop Platform using Prescriptive Analytics

    Get PDF
    Hadoop has become one of the key player in offeringdata analytics and data processing support for any organizationthat handles different shades of data management. Consideringthe current security offerings of Hadoop, companies areconcerned of building a single large cluster and onboardingmultiple projects on to the same common Hadoop cluster.Security vulnerability and privacy invasion due to maliciousattackers or inner users are the main argument points in anyHadoop implementation. In particular, various types of securityvulnerability occur due to the mode of data placement in HadoopCluster. When sensitive information is accessed by anunauthorized user or misused by an authorized person, they cancompromise privacy. In this paper, we intend to address theapproach of data placement across distributed DataNodes in asecure way by considering the sensitivity and security of theunderlying data. Our data placement strategy aims to adaptivelydistribute the data across the cluster using advanced machinelearning techniques to realize a more secured data/infrastructure.The data placement strategy discussed in this paper is highlyextensible and scalable to suit different sort of sensitivity/securityrequirements

    HadoopSec: Sensitivity-aware Secure Data Placement Strategy for Big Data/Hadoop Platform using Prescriptive Analytics

    Get PDF
    Hadoop has become one of the key player in offeringdata analytics and data processing support for any organizationthat handles different shades of data management. Consideringthe current security offerings of Hadoop, companies areconcerned of building a single large cluster and onboardingmultiple projects on to the same common Hadoop cluster.Security vulnerability and privacy invasion due to maliciousattackers or inner users are the main argument points in anyHadoop implementation. In particular, various types of securityvulnerability occur due to the mode of data placement in HadoopCluster. When sensitive information is accessed by anunauthorized user or misused by an authorized person, they cancompromise privacy. In this paper, we intend to address theapproach of data placement across distributed DataNodes in asecure way by considering the sensitivity and security of theunderlying data. Our data placement strategy aims to adaptivelydistribute the data across the cluster using advanced machinelearning techniques to realize a more secured data/infrastructure.The data placement strategy discussed in this paper is highlyextensible and scalable to suit different sort of sensitivity/securityrequirements

    Improving the Performance of Heterogeneous Hadoop Clusters Using Map Reduce

    Get PDF
    The key issue that emerges because of the tremendous development of connectivity among devices and frameworks is making such a great amount of data at an exponential rate that an achievable answer for preparing it is getting to be troublesome step by step. Thusly, building up a stage for such propelled dimension of data handling, equipment just as programming improvements should be led to come in level with such generous data. To enhance the proficiency of Hadoop bunches in putting away and dissecting big data, we have proposed an algorithmic methodology that will provide food the necessities of heterogeneous data put away .over Hadoop groups and enhance the execution just as effectiveness. The proposed paper intends to discover the adequacy of new calculation, correlation, proposals, and an aggressive way to deal with discover the best answer for enhancing the big data situation. The Map Reduce method from Hadoop will help in keeping up a nearby watch over the unstructured or heterogeneous Hadoop bunches with bits of knowledge on results obviously from the algorithm.in this paper we proposed new Generating another calculation to tackle these issues for the business just as non-business uses can help the advancement of network. The proposed calculation can help enhance the situation of data ordering calculation MapReduce in heterogeneous Hadoop groups. The exposition work and analyses directed under this work have copied very amazing outcomes, some of them being the selection of schedulers to plan employments, arrangement of data in similitude lattice, bunching before planning inquiries and in addition, iterative, mapping and diminishing and restricting the inner conditions together to stay away from question slowing down and execution times. The test led additionally sets up the way that if a procedure is characterized to deal with the diverse use case situations, one could generally lessen the expense of processing and can profit on depending on disseminated frameworks for quick executions

    Experimental Performance Evaluation of Cloud-Based Analytics-as-a-Service

    Full text link
    An increasing number of Analytics-as-a-Service solutions has recently seen the light, in the landscape of cloud-based services. These services allow flexible composition of compute and storage components, that create powerful data ingestion and processing pipelines. This work is a first attempt at an experimental evaluation of analytic application performance executed using a wide range of storage service configurations. We present an intuitive notion of data locality, that we use as a proxy to rank different service compositions in terms of expected performance. Through an empirical analysis, we dissect the performance achieved by analytic workloads and unveil problems due to the impedance mismatch that arise in some configurations. Our work paves the way to a better understanding of modern cloud-based analytic services and their performance, both for its end-users and their providers.Comment: Longer version of the paper in Submission at IEEE CLOUD'1

    Earlier stage for straggler detection and handling using combined CPU test and LATE methodology

    Get PDF
    Using MapReduce in Hadoop helps in lowering the execution time and power consumption for large scale data. However, there can be a delay in job processing in circumstances where tasks are assigned to bad or congested machines called "straggler tasks"; which increases the time, power consumptions and therefore increasing the costs and leading to a poor performance of computing systems. This research proposes a hybrid MapReduce framework referred to as the combinatory late-machine (CLM) framework. Implementation of this framework will facilitate early and timely detection and identification of stragglers thereby facilitating prompt appropriate and effective actions

    Stocator: A High Performance Object Store Connector for Spark

    Full text link
    We present Stocator, a high performance object store connector for Apache Spark, that takes advantage of object store semantics. Previous connectors have assumed file system semantics, in particular, achieving fault tolerance and allowing speculative execution by creating temporary files to avoid interference between worker threads executing the same task and then renaming these files. Rename is not a native object store operation; not only is it not atomic, but it is implemented using a costly copy operation and a delete. Instead our connector leverages the inherent atomicity of object creation, and by avoiding the rename paradigm it greatly decreases the number of operations on the object store as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object stores. We have implemented Stocator and shared it in open source. Performance testing shows that it is as much as 18 times faster for write intensive workloads and performs as much as 30 times fewer operations on the object store than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider

    Cloud solutions for high performance computing: oxymoron or realm?

    Get PDF
    U posljednje je vrijeme zamjetan veliki interes HPC (High Performance Computing) zajednice prema Cloud Computing paradigmi, odnosno računalstvu u oblacima. Mnoge su prednosti izvođenja HPC aplikacija u računalnom oblaku, od kojih su najvažnije bolje korištenje računalnih resursa, učinkovita naplata njihovog korištenja, te dinamička i bez prekida u radu aplikacije preraspodjela računalnih resursa. Bez obzira na očite prednosti Cloud Computing okruženja za HPC, trenutni omjer korištenja HPC aplikacija u računalnim oblacima u odnosu na one koje se izvode u tradicionalnom HPC okruženju još je uvijek vrlo malen. Neki od razloga za takvo stanje su očiti, dok kod drugih nije tako. Tako na primjer tradicionalni proizvođači HPC opreme i aplikacija još uvijek pokušavaju iskoristiti njihova ulaganja u postojeća rješenja na način koji pogoduje tradicionalnom načinu rada HPC-a. Nadalje, iako je prisutan snažan razvoj virtualizacijske tehnologije, još uvijek postoje otvorena pitanja, poput skaliranja HPC aplikacija u virtualnom okruženju, te suradnje između fizičkih i virtualnih resursa. Tu se također nameće i pitanje podrške standardnih HPC alata za uspostavu virtualiziranih okolina. Na kraju, ali ne i manje važno, pojava heterogenog HPC računalstva koje se zasniva na kombinaciji standardnih procesora (CPU) i onih specijalizirane namjene (GPU procesori i akceleratori) nameće izazovna pitanja, poput skalabilnosti, učinkovitog Linpack mjerenja performansi, te razvoja HPC aplikacija u heterogenom CPU/GPU okolišu. Uz sve to, unatoč tome što mnogi proizvođači HPC opreme i aplikacija govore o tome kako imaju potpuno funkcionalna HPC rješenja u računalnim oblacima, postoji potreba da se osigura odgovore i na dodatna pitanja: Jesu li potpuno ispunjena obećanja mogućnosti virtualizacije svih komponenti sustava, ne samo onih virtualnih, nego i fizičkih? Koje vrste računalnih oblaka su podržane: privatne, javne ili njihova kombinacija, tj. hibridni oblaci? Kako dobro pojedina HPC aplikacija skalira na nekom računalnom oblaku? U pokušaju odgovora na postavljena pitanja, u ovom radu je dat pregled trenutnih HPC rješenja u oblacima. Osim toga, namjera ovog rada je da posluži kao vodič za onoga tko je spreman napraviti pomak od standardnih HPC rješenja prema onima u Cloud Computing okruženju. Naposljetku, da bi se što je moguće više olakšao izbor HPC Cloud rješenja, predložena je njihova klasifikacija u tri kategorije, od kojih svaka reflektira područja i načine primjene pojedinih rješenja, stupanj podržanosti virtualizacije, te, zbog uglavnom javno nedostupnih rezultata mjerenja performansi, do sada iskazanu HPC ekspertizu proizvođača tih rješenja.In the last years a strong interest of the HPC (High Performance Computing) community raised towards cloud computing. There are many apparent benefits of doing HPC in a cloud, the most important of them being better utilization of computational resources, efficient charge back of used resources and applications, on-demand and dynamic reallocation of computational resources between users and HPC applications, and automatic bursting of additional resources when needed. Surprisingly, the amount of HPC cloud solutions related to standard solutions is still negligible. Some of the reasons for the current situation are evident, some not so. For example, traditional HPC vendors are still trying to exploit their current investment as much as possible, thus favoring traditional way of doing HPC. The next, although virtualization techniques are developing in ever increasing rate, there are still open questions, like scaling of HPC applications in virtual environment, collaboration between physical and virtual resources, and support of the standard HPC tools for virtualized environments. At last but not the least, the advent of GPU computing has raised difficult questions even in the traditional and well developed HPC segment, like scalability, development of HPC GPU based applications. In addition, because many HPC vendors are speaking about already having a fully functional HPC cloud solution, there is a need to provide answers to the following questions: Are they fulfilling virtualization promises, both physical and virtual? Which type of clouds do they support: private, public or both, i.e. hybrid clouds? How well HPC applications scale on their cloud solutions? This paper is an overview of the current HPC cloud solutions, for sure not complete and solely a view of authors, however intended to be a helpful compass for someone trying to shift from standard HPC to large computations in cloud environments
    corecore