753 research outputs found

    Benchmarking and performance modelling of MapReduce communication pattern

    Get PDF
    Funding: UK EPSRC EP/R010528/1 and IsDBUnderstanding and predicting the performance of big data applications running in the cloud or on-premises could help minimise the overall cost of operations and provide opportunities in efforts to identify performance bottlenecks. The complexity of the low-level internals of big data frameworks and the ubiquity of application and workload configuration parameters makes it challenging and expensive to come up with comprehensive performance modelling solutions. In this paper, instead of focusing on a wide range of configurable parameters, we studied the low-level internals of the MapReduce communication pattern and used a minimal set of performance drivers to develop a set of phase level parametric models for approximating the execution time of a given application on a given cluster. Models can be used to infer the performance of unseen applications and approximate their performance when an arbitrary dataset is used as input. Our approach is validated by running empirical experiments in two setups. On average, the error rate in both setups is ±10% from the measured values.Postprin

    Performance modelling, analysis and prediction of Spark jobs in Hadoop cluster : a thesis by publications presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science, School of Mathematical & Computational Sciences, Massey University, Auckland, New Zealand

    Get PDF
    Big Data frameworks have received tremendous attention from the industry and from academic research over the past decade. The advent of distributed computing frameworks such as Hadoop MapReduce and Spark are powerful frameworks that offer an efficient solution for analysing large-scale datasets running under the Hadoop cluster. Spark has been established as one of the most popular large-scale data processing engines because of its speed, low latency in-memory computation, and advanced analytics. Spark computational performance heavily depends on the selection of suitable parameters, and the configuration of these parameters is a challenging task. Although Spark has default parameters and can deploy applications without much effort, a significant drawback of default parameter selection is that it is not always the best for cluster performance. A major limitation for Spark performance prediction using existing models is that it requires either large input data or system configuration that is time-consuming. Therefore, an analytical model could be a better solution for performance prediction and for establishing appropriate job configurations. This thesis proposes two distinct parallelisation models for performance prediction: the 2D-Plate model and the Fully-Connected Node model. Both models were constructed based on serial boundaries for a certain arrangement of executors and size of the data. In order to evaluate the cluster performance, various HiBench workloads were used, and workload’s empirical data were fitted with the models for performance prediction analysis. The developed models were benchmarked with the existing models such as Amdahl’s, Gustafson, ERNEST, and machine learning. Our experimental results show that the two proposed models can quickly and accurately predict performance in terms of runtime, and they can outperform the accuracy of machine learning models when extrapolating predictions

    Benchmarking business analytics techniques in Big Data

    Get PDF
    Technological developments and the growing dependence of organizations and society in the world of the internet led to the growth and variety of data. This growth and variety have become a challenge to the traditional techniques of Business Analytics. In this project, we conducted a benchmarking process that aimed to assess the performance of some Data Mining tools, like RapidMiner, in Big Data environment. Firstly, was analyzed a study where a group of Data Mining tools are evaluated and determined what is the best Data Mining tool, according to the evaluation criteria. After that, the best two tools considered in the study are analyzed regarding their ability to analyze data in a Big Data environment. Finally, studies were carried out on the evaluations of the RapidMiner and KNIME tools for their performance in the Big Data environment.This work has been supported by national funds through FCT -Fundacao para a Ciencia e Tecnologia within the Project Scope: UID/CEC/00319/2019 and Deus ex Machina (DEM): Symbiotic technology for societal efficiency gains -NORTE-01-0145-FEDER-000026

    Implementing Parallel Differential Evolution on Spark

    Get PDF
    [Abstract] Metaheuristics are gaining increased attention as an efficient way of solving hard global optimization problems. Differential Evolution (DE) is one of the most popular algorithms in that class. However, its application to realistic problems results in excessive computation times. Therefore, several parallel DE schemes have been proposed, most of them focused on traditional parallel programming interfaces and infrastruc- tures. However, with the emergence of Cloud Computing, new program- ming models, like Spark, have appeared to suit with large-scale data processing on clouds. In this paper we investigate the applicability of Spark to develop parallel DE schemes to be executed in a distributed environment. Both the master-slave and the island-based DE schemes usually found in the literature have been implemented using Spark. The speedup and efficiency of all the implementations were evaluated on the Amazon Web Services (AWS) public cloud, concluding that the island- based solution is the best suited to the distributed nature of Spark. It achieves a good speedup versus the serial implementation, and shows a decent scalability when the number of nodes grows.[Resumen] Las metaheurísticas están recibiendo una atención creciente como técnica eficiente en la resolución de problemas difíciles de optimización global. Differential Evolution (DE) es una de las metaheurísticas más populares, sin embargo su aplicación en problemas reales deriva en tiempos de cómputo excesivos. Por ello se han realizado diferentes propuestas para la paralelización del DE, en su mayoría utilizando infraestructuras e interfaces de programación paralela tradicionales. Con la aparición de la computación en la nube también se han propuesto nuevos modelos de programación, como Spark, que permiten manejar el procesamiento de datos a gran escala en la nube. En este artículo investigamos la aplicabilidad de Spark en el desarrollo de implementaciones paralelas del DE para su ejecución en entornos distribuidos. Se han implementado tanto la aproximación master-slave como la basada en islas, que son las más comunes. También se han evaluado la aceleración y la eficiencia de todas las implementaciones usando el cloud público de Amazon (AWS, Amazon Web Services), concluyéndose que la implementación basada en islas es la más adecuada para el esquema de distribución usado por Spark. Esta implementación obtiene una buena aceleración en relación a la implementación serie y muestra una escalabilidad bastante buena cuando el número de nodos aumenta.[Resume] As metaheurísticas están recibindo unha atención a cada vez maior como técnica eficiente na resolución de problemas difíciles de optimización global. Differential Evolution (DE) é unha das metaheurísticas mais populares, ainda que a sua aplicación a problemas reais deriva en tempos de cómputo excesivos. É por iso que se propuxeron diferentes esquemas para a paralelización do DE, na sua maioría utilizando infraestruturas e interfaces de programación paralela tradicionais. Coa aparición da computación na nube tamén se propuxeron novos modelos de programación, como Spark, que permiten manexar o procesamento de datos a grande escala na nube. Neste artigo investigamos a aplicabilidade de Spark no desenvolvimento de implementacións paralelas do DE para a sua execución en contornas distribuidas. Implementáronse tanto a aproximación master-slave como a baseada en illas, que son as mais comúns. Tamén se avaliaron a aceleración e a eficiencia de todas as implementacións usando o cloud público de Amazon (AWS, Amazon Web Services), tirando como conclusión que a implementación baseada en illas é a mais acaida para o esquema de distribución usado por Spark. Esta implementación obtén unha boa aceleración en relación á implementación serie e amosa unha escalabilidade bastante boa cando o número de nos aumenta.Ministerio de Economía y Competitividad; DPI2014-55276-C5-2-RXunta de Galicia; GRC2013/055Xunta de Galicia; R2014/04

    RootPath: Root Cause and Critical Path Analysis to Ensure Sustainable and Resilient Consumer-Centric Big Data Processing under Fault Scenarios

    Get PDF
    The exponential growth of consumer-centric big data has led to increased concerns regarding the sustainability and resilience of data processing systems, particularly in the face of fault scenarios. This paper presents an innovative approach integrating Root Cause Analysis (RCA) and Critical Path Analysis (CPA) to address these challenges and ensure sustainable, resilient consumer-centric big data processing. The proposed methodology enables the identification of root causes behind system faults probabilistically, implementing Bayesian networks. Furthermore, an Artificial Neural Network (ANN)-based critical path method is employed to identify the critical path that causes high makespan in MapReduce workflows to enhance fault tolerance and optimize resource allocation. To evaluate the effectiveness of the proposed methodology, we conduct a series of fault injection experiments, simulating various real-world fault scenarios commonly encountered in operational environments. The experiment results show that both models perform very well with high accuracies, 95%, and 98%, respectively, enabling the development of more robust and reliable consumer-centric systems

    Big Data Analysis

    Get PDF
    The value of big data is predicated on the ability to detect trends and patterns and more generally to make sense of the large volumes of data that is often comprised of a heterogeneous mix of format, structure, and semantics. Big data analysis is the component of the big data value chain that focuses on transforming raw acquired data into a coherent usable resource suitable for analysis. Using a range of interviews with key stakeholders in small and large companies and academia, this chapter outlines key insights, state of the art, emerging trends, future requirements, and sectorial case studies for data analysis

    Why High-Performance Modelling and Simulation for Big Data Applications Matters

    Get PDF
    Modelling and Simulation (M&S) offer adequate abstractions to manage the complexity of analysing big data in scientific and engineering domains. Unfortunately, big data problems are often not easily amenable to efficient and effective use of High Performance Computing (HPC) facilities and technologies. Furthermore, M&S communities typically lack the detailed expertise required to exploit the full potential of HPC solutions while HPC specialists may not be fully aware of specific modelling and simulation requirements and applications. The COST Action IC1406 High-Performance Modelling and Simulation for Big Data Applications has created a strategic framework to foster interaction between M&S experts from various application domains on the one hand and HPC experts on the other hand to develop effective solutions for big data applications. One of the tangible outcomes of the COST Action is a collection of case studies from various computing domains. Each case study brought together both HPC and M&S experts, giving witness of the effective cross-pollination facilitated by the COST Action. In this introductory article we argue why joining forces between M&S and HPC communities is both timely in the big data era and crucial for success in many application domains. Moreover, we provide an overview on the state of the art in the various research areas concerned
    • …
    corecore