36 research outputs found

    A Big Data Analyzer for Large Trace Logs

    Full text link
    Current generation of Internet-based services are typically hosted on large data centers that take the form of warehouse-size structures housing tens of thousands of servers. Continued availability of a modern data center is the result of a complex orchestration among many internal and external actors including computing hardware, multiple layers of intricate software, networking and storage devices, electrical power and cooling plants. During the course of their operation, many of these components produce large amounts of data in the form of event and error logs that are essential not only for identifying and resolving problems but also for improving data center efficiency and management. Most of these activities would benefit significantly from data analytics techniques to exploit hidden statistical patterns and correlations that may be present in the data. The sheer volume of data to be analyzed makes uncovering these correlations and patterns a challenging task. This paper presents BiDAl, a prototype Java tool for log-data analysis that incorporates several Big Data technologies in order to simplify the task of extracting information from data traces produced by large clusters and server farms. BiDAl provides the user with several analysis languages (SQL, R and Hadoop MapReduce) and storage backends (HDFS and SQLite) that can be freely mixed and matched so that a custom tool for a specific task can be easily constructed. BiDAl has a modular architecture so that it can be extended with other backends and analysis languages in the future. In this paper we present the design of BiDAl and describe our experience using it to analyze publicly-available traces from Google data clusters, with the goal of building a realistic model of a complex data center.Comment: 26 pages, 10 figure

    Predicting Scheduling Failures in the Cloud

    Full text link
    Cloud Computing has emerged as a key technology to deliver and manage computing, platform, and software services over the Internet. Task scheduling algorithms play an important role in the efficiency of cloud computing services as they aim to reduce the turnaround time of tasks and improve resource utilization. Several task scheduling algorithms have been proposed in the literature for cloud computing systems, the majority relying on the computational complexity of tasks and the distribution of resources. However, several tasks scheduled following these algorithms still fail because of unforeseen changes in the cloud environments. In this paper, using tasks execution and resource utilization data extracted from the execution traces of real world applications at Google, we explore the possibility of predicting the scheduling outcome of a task using statistical models. If we can successfully predict tasks failures, we may be able to reduce the execution time of jobs by rescheduling failed tasks earlier (i.e., before their actual failing time). Our results show that statistical models can predict task failures with a precision up to 97.4%, and a recall up to 96.2%. We simulate the potential benefits of such predictions using the tool kit GloudSim and found that they can improve the number of finished tasks by up to 40%. We also perform a case study using the Hadoop framework of Amazon Elastic MapReduce (EMR) and the jobs of a gene expression correlations analysis study from breast cancer research. We find that when extending the scheduler of Hadoop with our predictive models, the percentage of failed jobs can be reduced by up to 45%, with an overhead of less than 5 minutes

    Testing Data Transformations in MapReduce Programs

    Get PDF
    MapReduce is a parallel data processing paradigm oriented to process large volumes of information in data-intensive applications, such as Big Data environments. A characteristic of these applications is that they can have different data sources and data formats. For these reasons, the inputs could contain some poor quality data that could produce a failure if the program functionality does not handle properly the variety of input data. The output of these programs is obtained from a number of input transformations that represent the program logic. This paper proposes the testing technique called MRFlow that is based on data flow test criteria and oriented to transformations analysis between the input and the output in order to detect defects in MapReduce programs. MRFlow is applied over some MapReduce programs and detects several defect

    Cost-minimizing preemptive scheduling of mapreduce workloads on hybrid clouds

    Get PDF
    MapReduce has become the dominant programming model for processing massive amounts of data on cloud platforms. More and more enterprises are now utilizing hybrid clouds, consisting of private infrastructure owned by themselves and public clouds such as Amazon EC2, to process their spiky MapReduce workloads, which fully utilize their own on-premise resources while outsourcing the tasks only when needed. With disparate workloads of different MapReduce tasks, an efficient scheduling mechanism is in need to enable efficient utilization of the on-premise resources and to minimize the task outsourcing cost, while meeting the task completion time requirements as well. In this paper, a fine-grained model is described to characterize the scheduling of heterogeneous MapReduce workloads, and an online algorithm is proposed for joint task admission control into the private cloud, task outsourcing to the public cloud, and VM allocation to execute the admitted tasks on the private cloud, such that the time-averaged task outsourcing cost is minimized over the long run. The online algorithm features preemptive scheduling of the tasks, where a task executed partially on the on-premise infrastructure can be paused and scheduled to run later. It also achieves desirable properties such as meeting a pre-set task admission ratio and bounding the worst-case task completion time, as proven by our rigorous theoretical analysis. © 2013 IEEE.published_or_final_versio

    Testing data transformations in MapReduce programs

    Full text link

    Application du contrôle pour garantir la performance des systèmes Big Data

    No full text
    International audienceNous sommes à l'aube d'une énorme explosion de données et la quantité à traiter par les entreprises est de plus en plus grande. Pour faire face à ce chalenge, Google a développé MapReduce, un modèle de programmation parallèle qui est en train de devenir l'outil de facto pour l'analyse des systèmes Big Data. Bien que dans une certaine mesure son utilisation est déjà très répandue dans l'industrie, garantir les performances d'un système aussi complexe pose de grands problèmes et sa gestion nécessite un haut niveau d'expertise. Cet article répond à ces défis en proposant le premier système autonome qui garantit des contraintes de temps de réponse pour une charge de travail MapReduce simultanée. Nous développons le premier modèle dynamique d'une grappe MapRe- duce. De plus, un contrôle en boucle fermée est conçu et implémenté pour garantir un temps de réponse donné. Un contrôle d'anticipation de type ""feedforward"" est également rajouté pour amé- liorer la réponse du système en présence de perturbations, en l'occurrence, la variation du nombre de clients. L'approche est validée en ligne sur une grappe MapReduce avec 40 nœuds utilisant une charge de travail intensive de type Business Intelligence. Nos expériences montrent que le contrôle ainsi conçu peut garantir les contraintes de temps de réponse
    corecore