849 research outputs found

    MOON: MapReduce On Opportunistic eNvironments

    Get PDF
    Abstract—MapReduce offers a flexible programming model for processing and generating large data sets on dedicated resources, where only a small fraction of such resources are every unavailable at any given time. In contrast, when MapReduce is run on volunteer computing systems, which opportunistically harness idle desktop computers via frameworks like Condor, it results in poor performance due to the volatility of the resources, in particular, the high rate of node unavailability. Specifically, the data and task replication scheme adopted by existing MapReduce implementations is woefully inadequate for resources with high unavailability. To address this, we propose MOON, short for MapReduce On Opportunistic eNvironments. MOON extends Hadoop, an open-source implementation of MapReduce, with adaptive task and data scheduling algorithms in order to offer reliable MapReduce services on a hybrid resource architecture, where volunteer computing systems are supplemented by a small set of dedicated nodes. The adaptive task and data scheduling algorithms in MOON distinguish between (1) different types of MapReduce data and (2) different types of node outages in order to strategically place tasks and data on both volatile and dedicated nodes. Our tests demonstrate that MOON can deliver a 3-fold performance improvement to Hadoop in volatile, volunteer computing environments

    Performance Characterization of In-Memory Data Analytics on a Modern Cloud Server

    Full text link
    In last decade, data analytics have rapidly progressed from traditional disk-based processing to modern in-memory processing. However, little effort has been devoted at enhancing performance at micro-architecture level. This paper characterizes the performance of in-memory data analytics using Apache Spark framework. We use a single node NUMA machine and identify the bottlenecks hampering the scalability of workloads. We also quantify the inefficiencies at micro-architecture level for various data analysis workloads. Through empirical evaluation, we show that spark workloads do not scale linearly beyond twelve threads, due to work time inflation and thread level load imbalance. Further, at the micro-architecture level, we observe memory bound latency to be the major cause of work time inflation.Comment: Accepted to The 5th IEEE International Conference on Big Data and Cloud Computing (BDCloud 2015

    MaxHadoop: An Efficient Scalable Emulation Tool to Test SDN Protocols in Emulated Hadoop Environments

    Get PDF
    AbstractThis paper presents MaxHadoop, a flexible and scalable emulation tool, which allows the efficient and accurate emulation of Hadoop environments over Software Defined Networks (SDNs). Hadoop has been designed to manage endless data-streams over networks, making it a tailored candidate to support the new class of network services belonging to Big Data. The development of Hadoop is contemporary with the evolution of networks towards the new architectures "Software Defined." To create our emulation environment, tailored to SDNs, we employ MaxiNet, given its capability of emulating large-scale SDNs. We make it possible to emulate realistic Hadoop scenarios on large-scale SDNs using low-cost commodity hardware, by resolving a few key limitations of MaxiNet through appropriate configuration settings. We validate the MaxHadoop emulator by executing two benchmarks, namely WordCount and TeraSort, to evaluate a set of Key Performance Indicators. The tests' outcomes evidence that MaxHadoop outperforms other existing emulation tools running over commodity hardware. Finally, we show the potentiality of MaxHadoop by utilizing it to perform a comparison of SDN-based network protocols

    A study on performance measures for auto-scaling CPU-intensive containerized applications

    Get PDF
    Autoscaling of containers can leverage performance measures from the different layers of the computational stack. This paper investigate the problem of selecting the most appropriate performance measure to activate auto-scaling actions aiming at guaranteeing QoS constraints. First, the correlation between absolute and relative usage measures and how a resource allocation decision can be influenced by them is analyzed in different workload scenarios. Absolute and relative measures could assume quite different values. The former account for the actual utilization of resources in the host system, while the latter account for the share that each container has of the resources used. Then, the performance of a variant of Kubernetes’ auto-scaling algorithm, that transparently uses the absolute usage measures to scale-in/out containers, is evaluated through a wide set of experiments. Finally, a detailed analysis of the state-of-the-art is presented

    INPUT SPLITS DESIGN TECHNIQUES FOR NETWORK INTRUSION DETECTION ON HADOOP CLUSTER

    Get PDF
    Intrusion detection system (IDS) is one of the most important components being used to monitor network for possible cyber-attacks. However, the amount of data that should be inspected imposes a great challenge to IDSs. With recent emerge of variousbig data technologies, there are ways for overcoming the problem of the increased amount of data. Nevertheless, some of this technologies inherit data distribution techniques that can be a problem when splitting a sensitive data such as network data frames across a cluster nodes. The goal of this paper is design and implementation of Hadoop based IDS. In this paper we propose different input split techniques suitable for network data distribution across cloud nodes and test the performances of their Apache Hadoop implementation. Four different data split techniques will be proposed and analysed. The techniques will be described in detail. The system will be evaluated on Apache Hadoop cluster with 17 slave nodes. We will show that processing speed can differ for more than 30% depending on chosen input split design strategy. Additionally, we’ll show that malicious level of network traffic can slow down the processing time, in our case, for nearly 20%. The scalability of the system will al so be discussed

    Implementation of Parallel K-Means Algorithm to Estimate Adhesion Failure in Warm Mix Asphalt

    Get PDF
    Warm Mix Asphalt (WMA) and Hot Mix Asphalt (HMA) are prepared at lower temperatures, making it more susceptible to moisture damage, which eventually leads to stripping due to the adhesion failure. Moreover, the assessment of the adhesion failure depends on the expertise of the investigator’s subjective visual assessment skills. Nowadays, image processing has gained popularity to address the inaccuracy of visual assessment. To attain high accuracy from image processing algorithms, the loss of pixels plays an essential role. In high-quality image samples, processing takes more execution time due to the greater resolution of the image. Therefore, the execution time of the image processing algorithm is also an essential aspect of quality. This manuscript proposes a parallel k means for image processing (PKIP) algorithm using multiprocessing and distributed computing to assess the adhesion failure in WMA and HMA samples subjected to three different moisture sensitivity tests (dry, one, and three freeze-thaw cycles) and fractured by indirect tensile test. For the proposed experiment, the number of clusters was chosen as ten (k = 10) based on k value and cost of k means function was computed to analyse the adhesion failure. The results showed that the PKIP algorithm decreases the execution time up to 30% to 46% if compared with the sequential k means algorithm when implemented using multiprocessing and distributed computing. In terms of results concerning adhesion failure, the WMA specimens subjected to a higher degree of moisture effect showed relatively lower adhesion failure compared to the Hot Mix Asphalt (HMA) samples when subjected to different levels of moisture sensitivity
    corecore