141 research outputs found

    An evaluation of parallel job scheduling for ASCI Blue-Pacific

    Full text link
    In this paper we analyze the behavior of a gang-scheduling strategy that we are developing for the ASCI Blue-Pacific machines. Using actual job logs for one of the ASCI machines we generate a statistical model of the current workload with hyper Erlang distributions. We then vary the parameters of those distributions to generate various workloads, representative of different operating points of the machine. Through simulation we obtain performance parameters for three different scheduling strategies: (i) first-come first-serve, (ii) gang-scheduling, and (iii) backfilling. Our results show that backfilling, can be very effective for the common operating points in the 60-70% utilization range. However, for higher utilization rates, time-sharing techniques such as gang-scheduling offer much better performance

    Failure analysis and reliability -aware resource allocation of parallel applications in High Performance Computing systems

    Get PDF
    The demand for more computational power to solve complex scientific problems has been driving the physical size of High Performance Computing (HPC) systems to hundreds and thousands of nodes. Uninterrupted execution of large scale parallel applications naturally becomes a major challenge because a single node failure interrupts the entire application, and the reliability of a job completion decreases with increasing the number of nodes. Accurate reliability knowledge of a HPC system enables runtime systems such as resource management and applications to minimize performance loss due to random failures while also providing better Quality Of Service (QOS) for computational users. This dissertation makes three major contributions for reliability evaluation and resource management in HPC systems. First we study the failure properties of HPC systems and observe that Times To Failure (TTF\u27s) of individual compute nodes follow a time-varying failure rate based distribution like Weibull distribution. We then propose a model for the TTF distribution of a system of k independent nodes when individual nodes exhibit time varying failure rates. Based on the reliability of the proposed TTF model, we develop reliability-aware resource allocation algorithms and evaluated them on actual parallel workloads and failure data of a HPC system. Our observations indicate that applying time varying failure rate-based reliability function combined with some heuristics reduce the performance loss due to unexpected failures by as much as 30 to 53 percent. Finally, we also study the effect of reliability with respect to the number of nodes and propose reliability-aware optimal k node allocation algorithm for large scale parallel applications. Our simulation results of comparing the optimal k node algorithm indicate that choosing the number of nodes for large scale parallel applications based on the reliability of compute nodes can reduce the overall completion time and waste time when the k may be smaller than the total number of nodes in the system

    Modeling Coordinated Checkpointing for Large-Scale Supercomputers,”

    Get PDF

    G-LOMARC-TS: Lookahead group matchmaking for time/space sharing on multi-core parallel machines

    Get PDF
    Parallel machines with multi-core nodes are becoming increasingly popular. The performances of applications running on these machines are improved gradually due to the resource competition in each node. Researches have found that coscheduling different applications with complementary resource characteristics on the same set of nodes (semi time sharing) may improve the performance. We propose a scheduling algorithm G-LOMARC-TS which incorporates both space and semi time sharing scheduling methods and matches groups of jobs if possible for coscheduling. Since matchmaking may select jobs further down the waiting queue and the jobs in front of the queue may be delayed subsequently, fairness for each individual job will be watched and the delay will be kept within a limited bound. Several heuristics are used to solve the NP-complete problem of forming groups. Our experiment results show both utilization gain and average relative response time improvements of G-LOMARC-TS over other several scheduling policies

    Inter-Operating Grids Through Delegated MatchMaking

    Get PDF

    Impact of Noise on Scaling of Collectives: An Empirical Evaluation

    Full text link
    Abstract. It is increasingly becoming evident that operating system interference in the form of daemon activity and interrupts contribute significantly to perfor-mance degradation of parallel applications in large clusters. An earlier theoretical study has evaluated the impact of system noise on application performance for different noise distributions [1]. Our work complements the theoretical analysis by presenting an empirical study of noise in production clusters. We designed a parallel benchmark that was used on large clusters at SanDeigo Supercomputing Center for collecting noise related data. This data was fed to a simulator that pre-dicts the performance of collective operations using the model of [1]. We report our comparison of the predicted and the observed performance. Additionally, the tools developed in the process have been instrumental in identifying anomalous nodes that could potentially be affecting application performance if undetected.

    A Process Model for the Integrated Reasoning about Quantitative IT Infrastructure Attributes

    Get PDF
    IT infrastructures can be quantitatively described by attributes, like performance or energy efficiency. Ever-changing user demands and economic attempts require varying short-term and long-term decisions regarding the alignment of an IT infrastructure and particularly its attributes to this dynamic surrounding. Potentially conflicting attribute goals and the central role of IT infrastructures presuppose decision making based upon reasoning, the process of forming inferences from facts or premises. The focus on specific IT infrastructure parts or a fixed (small) attribute set disqualify existing reasoning approaches for this intent, as they neither cover the (complex) interplay of all IT infrastructure components simultaneously, nor do they address inter- and intra-attribute correlations sufficiently. This thesis presents a process model for the integrated reasoning about quantitative IT infrastructure attributes. The process model’s main idea is to formalize the compilation of an individual reasoning function, a mathematical mapping of parametric influencing factors and modifications on an attribute vector. Compilation bases upon model integration to benefit from the multitude of existing specialized, elaborated, and well-established attribute models. The achieved reasoning function consumes an individual tuple of IT infrastructure components, attributes, and external influencing factors to expose a broad applicability. The process model formalizes a reasoning intent in three phases. First, reasoning goals and parameters are collected in a reasoning suite, and formalized in a reasoning function skeleton. Second, the skeleton is iteratively refined, guided by the reasoning suite. Third, the achieved reasoning function is employed for What-if analyses, optimization, or descriptive statistics to conduct the concrete reasoning. The process model provides five template classes that collectively formalize all phases in order to foster reproducibility and to reduce error-proneness. Process model validation is threefold. A controlled experiment reasons about a Raspberry Pi cluster’s performance and energy efficiency to illustrate feasibility. Besides, a requirements analysis on a world-class supercomputer and on the European-wide execution of hydro meteorology simulations as well as a related work examination disclose the process model’s level of innovation. Potential future work employs prepared automation capabilities, integrates human factors, and uses reasoning results for the automatic generation of modification recommendations.IT-Infrastrukturen können mit Attributen, wie Leistung und Energieeffizienz, quantitativ beschrieben werden. Nutzungsbedarfsänderungen und ökonomische Bestrebungen erfordern Kurz- und Langfristentscheidungen zur Anpassung einer IT-Infrastruktur und insbesondere ihre Attribute an dieses dynamische Umfeld. Potentielle Attribut-Zielkonflikte sowie die zentrale Rolle von IT-Infrastrukturen erfordern eine Entscheidungsfindung mittels Reasoning, einem Prozess, der Rückschlüsse (rein) aus Fakten und Prämissen zieht. Die Fokussierung auf spezifische Teile einer IT-Infrastruktur sowie die Beschränkung auf (sehr) wenige Attribute disqualifizieren bestehende Reasoning-Ansätze für dieses Vorhaben, da sie weder das komplexe Zusammenspiel von IT-Infrastruktur-Komponenten, noch Abhängigkeiten zwischen und innerhalb einzelner Attribute ausreichend berücksichtigen können. Diese Arbeit präsentiert ein Prozessmodell für das integrierte Reasoning über quantitative IT-Infrastruktur-Attribute. Die grundlegende Idee des Prozessmodells ist die Herleitung einer individuellen Reasoning-Funktion, einer mathematischen Abbildung von Einfluss- und Modifikationsparametern auf einen Attributvektor. Die Herleitung basiert auf der Integration bestehender (Attribut-)Modelle, um von deren Spezialisierung, Reife und Verbreitung profitieren zu können. Die erzielte Reasoning-Funktion verarbeitet ein individuelles Tupel aus IT-Infrastruktur-Komponenten, Attributen und externen Einflussfaktoren, um eine breite Anwendbarkeit zu gewährleisten. Das Prozessmodell formalisiert ein Reasoning-Vorhaben in drei Phasen. Zunächst werden die Reasoning-Ziele und -Parameter in einer Reasoning-Suite gesammelt und in einem Reasoning-Funktions-Gerüst formalisiert. Anschließend wird das Gerüst entsprechend den Vorgaben der Reasoning-Suite iterativ verfeinert. Abschließend wird die hergeleitete Reasoning-Funktion verwendet, um mittels “What-if”–Analysen, Optimierungsverfahren oder deskriptiver Statistik das Reasoning durchzuführen. Das Prozessmodell enthält fünf Template-Klassen, die den Prozess formalisieren, um Reproduzierbarkeit zu gewährleisten und Fehleranfälligkeit zu reduzieren. Das Prozessmodell wird auf drei Arten validiert. Ein kontrolliertes Experiment zeigt die Durchführbarkeit des Prozessmodells anhand des Reasonings zur Leistung und Energieeffizienz eines Raspberry Pi Clusters. Eine Anforderungsanalyse an einem Superrechner und an der europaweiten Ausführung von Hydro-Meteorologie-Modellen erläutert gemeinsam mit der Betrachtung verwandter Arbeiten den Innovationsgrad des Prozessmodells. Potentielle Erweiterungen nutzen die vorbereiteten Automatisierungsansätze, integrieren menschliche Faktoren, und generieren Modifikationsempfehlungen basierend auf Reasoning-Ergebnissen
    • …
    corecore