38 research outputs found

    Fluid Petri Nets for the Performance Evaluation of MapReduce Applications

    Get PDF
    Big Data applications allow to successfully analyze large amounts of data not necessarily structured, though at the same time they present new challenges. For example, predicting the performance of frameworks such as Hadoop can be a costly task, hence the necessity to provide models that can be a valuable support for designers and developers. This paper provides a new contribution in studying a novel modeling approach based on fluid Petri nets to predict MapReduce jobs execution time. The experiments we performed at CINECA, the Italian supercomputing center, have shown that the achieved accuracy is within 16% of the actual measurements on average

    Towards the Performance Analysis of Apache Tez Applications

    Get PDF
    Apache Tez is an application framework for large data processing using interactive queries. When a Tez developer faces the ful llment of performance requirements s/he needs to con gure and optimize the Tez application to speci c execution contexts. However, these are not easy tasks, though the Apache Tez con guration will im- pact in the performance of the application signi cantly. Therefore, we propose some steps, towards the modeling and simulation of Apache Tez applications, that can help in the performance assessment of Tez designs. For the modeling, we propose a UML pro le for Apache Tez. For the simulation, we propose to transform the stereotypes of the pro le into stochastic Petri nets, which can be eventually used for computing performance metrics

    Performance Prediction of Cloud-Based Big Data Applications

    Get PDF
    Big data analytics have become widespread as a means to extract knowledge from large datasets. Yet, the heterogeneity and irregular- ity usually associated with big data applications often overwhelm the existing software and hardware infrastructures. In such con- text, the exibility and elasticity provided by the cloud computing paradigm o er a natural approach to cost-e ectively adapting the allocated resources to the application’s current needs. However, these same characteristics impose extra challenges to predicting the performance of cloud-based big data applications, a key step to proper management and planning. This paper explores three modeling approaches for performance prediction of cloud-based big data applications. We evaluate two queuing-based analytical models and a novel fast ad hoc simulator in various scenarios based on di erent applications and infrastructure setups. The three ap- proaches are compared in terms of prediction accuracy, nding that our best approaches can predict average application execution times with 26% relative error in the very worst case and about 7% on average

    Analytical composite performance models for Big Data applications

    Get PDF
    In the era of Big Data, whose digital industry is facing the massive growth of data size and development of data intensive software, more and more companies are moving to use new frameworks and paradigms capable of handling data at scale. The outstanding MapRe- duce (MR) paradigm and its implementation framework, Hadoop are among the most re- ferred ones, and basis for later and more advanced frameworks like Tez and Spark. Accurate prediction of the execution time of a Big Data application helps improving design time de- cisions, reduces over allocation charges, and assists budget management. In this regard, we propose analytical models based on the Stochastic Activity Networks (SANs) to accurately model the execution of MR, Tez and Spark applications in Hadoop environments governed by the YARN Capacity scheduler. We evaluate the accuracy of the proposed models over the TPC-DS industry benchmark across different configurations. Results obtained by numeri- cally solving analytical SAN models show an average error of 6% in estimating the execution time of an application compared to the data gathered from experiments and moreover the model evaluation time is lower than simulation time of state of the art solutions

    Quantitative Analysis of Apache Storm Applications: The NewsAsset Case Study

    Get PDF
    The development of Information Systems today faces the era of Big Data. Large volumes of information need to be processed in real-time, for example, for Facebook or Twitter analysis. This paper addresses the redesign of NewsAsset, a commercial product that helps journalists by providing services, which analyzes millions of media items from the social network in real-time. Technologies like Apache Storm can help enormously in this context. We have quantitatively analyzed the new design of NewsAsset to assess whether the introduction of Apache Storm can meet the demanding performance requirements of this media product. Our assessment approach, guided by the Unified Modeling Language (UML), takes advantage, for performance analysis, of the software designs already used for development. In addition, we converted UML into a domain-specific modeling language (DSML) for Apache Storm, thus creating a profile for Storm. Later, we transformed said DSML into an appropriate language for performance evaluation, specifically, stochastic Petri nets. The assessment ended with a successful software design that certainly met the scalability requirements of NewsAsset

    RootPath: Root Cause and Critical Path Analysis to Ensure Sustainable and Resilient Consumer-Centric Big Data Processing under Fault Scenarios

    Get PDF
    The exponential growth of consumer-centric big data has led to increased concerns regarding the sustainability and resilience of data processing systems, particularly in the face of fault scenarios. This paper presents an innovative approach integrating Root Cause Analysis (RCA) and Critical Path Analysis (CPA) to address these challenges and ensure sustainable, resilient consumer-centric big data processing. The proposed methodology enables the identification of root causes behind system faults probabilistically, implementing Bayesian networks. Furthermore, an Artificial Neural Network (ANN)-based critical path method is employed to identify the critical path that causes high makespan in MapReduce workflows to enhance fault tolerance and optimize resource allocation. To evaluate the effectiveness of the proposed methodology, we conduct a series of fault injection experiments, simulating various real-world fault scenarios commonly encountered in operational environments. The experiment results show that both models perform very well with high accuracies, 95%, and 98%, respectively, enabling the development of more robust and reliable consumer-centric systems

    A Combined Analytical Modeling Machine Learning Approach for Performance Prediction of MapReduce Jobs in Hadoop Clusters

    Get PDF
    Nowadays MapReduce and its open source implementation, Apache Hadoop, are the most widespread solutions for handling massive dataset on clusters of commodity hardware. At the expense of a somewhat reduced performance in comparison to HPC technologies, the MapReduce framework provides fault tolerance and automatic parallelization without any efforts by developers. Since in many cases Hadoop is adopted to support business critical activities, it is often important to predict with fair confidence the execution time of submitted jobs, for instance when SLAs are established with end-users. In this work, we propose and validate a hybrid approach exploiting both queuing networks and support vector regression, in order to achieve a good accuracy without too many costly experiments on a real setup. The experimental results show how the proposed approach attains a 21% improvement in accuracy over applying machine learning techniques without any support from analytical models

    Context-aware Data Quality Assessment for Big Data

    Get PDF
    Big data changed the way in which we collect and analyze data. In particular, the amount of available information is constantly growing and organizations rely more and more on data analysis in order to achieve their competitive ad- vantage. However, such amount of data can create a real value only if combined with quality: good decisions and actions are the results of correct, reliable and complete data. In such a scenario, methods and techniques for the data quality assessment can support the identification of suitable data to process. If in tra- ditional database numerous assessment methods are proposed, in the big data scenario new algorithms have to be designed in order to deal with novel require- ments related to variety, volume and velocity issues. In particular, in this paper we highlight that dealing with heterogeneous sources requires an adaptive ap- proach able to trigger the suitable quality assessment methods on the basis of the data type and context in which data have to be used. Furthermore, we show that in some situations it is not possible to evaluate the quality of the entire dataset due to performance and time constraints. For this reason, we suggest to focus the data quality assessment only on a portion of the dataset and to take into account the consequent loss of accuracy by introducing a confidence factor as a measure of the reliability of the quality assessment procedure. We propose a methodology to build a data quality adapter module which selects the best configuration for the data quality assessment based on the user main require- ments: time minimization, confidence maximization, and budget minimization. Experiments are performed by considering real data gathered from a smart city case study

    An optimization framework for the capacity allocation and admission control of MapReduce jobs in cloud systems

    Get PDF
    Nowadays, we live in a Big Data world and many sectors of our economy are guided by data-driven decision processes. Big Data and Business Intelligence applications are facilitated by the MapReduce programming model, while, at infrastructural layer, cloud computing provides flexible and cost-effective solutions to provide on-demand large clusters. Capacity allocation in such systems, meant as the problem of providing computational power to support concurrent MapReduce applications in a cost-effective fashion, represents a challenge of paramount importance. In this paper we lay the foundation for a solution implementing admission control and capacity allocation for MapReduce jobs with a priori deadline guarantees. In particular, shared Hadoop 2.x clusters supporting batch and/or interactive jobs are targeted. We formulate a linear programming model able to minimize cloud resources costs and rejection penalties for the execution of jobs belonging to multiple classes with deadline guarantees. Scalability analyses demonstrated that the proposed method is able to determine the global optimal solution of the linear problem for systems including up to 10,000 classes in less than 1 s

    Workflow models for heterogeneous distributed systems

    Get PDF
    The role of data in modern scientific workflows becomes more and more crucial. The unprecedented amount of data available in the digital era, combined with the recent advancements in Machine Learning and High-Performance Computing (HPC), let computers surpass human performances in a wide range of fields, such as Computer Vision, Natural Language Processing and Bioinformatics. However, a solid data management strategy becomes crucial for key aspects like performance optimisation, privacy preservation and security. Most modern programming paradigms for Big Data analysis adhere to the principle of data locality: moving computation closer to the data to remove transfer-related overheads and risks. Still, there are scenarios in which it is worth, or even unavoidable, to transfer data between different steps of a complex workflow. The contribution of this dissertation is twofold. First, it defines a novel methodology for distributed modular applications, allowing topology-aware scheduling and data management while separating business logic, data dependencies, parallel patterns and execution environments. In addition, it introduces computational notebooks as a high-level and user-friendly interface to this new kind of workflow, aiming to flatten the learning curve and improve the adoption of such methodology. Each of these contributions is accompanied by a full-fledged, Open Source implementation, which has been used for evaluation purposes and allows the interested reader to experience the related methodology first-hand. The validity of the proposed approaches has been demonstrated on a total of five real scientific applications in the domains of Deep Learning, Bioinformatics and Molecular Dynamics Simulation, executing them on large-scale mixed cloud-High-Performance Computing (HPC) infrastructures
    corecore