431 research outputs found

    Failure prediction for high-performance computing systems

    Get PDF
    The failure rate in high-performance computing (HPC) systems continues to escalate as the number of components in these systems increases. This affects the scalability and the performance of parallel applications in large-scale HPC systems. Fault tolerance (FT) mechanisms help mitigating the impact of failures on parallel applications. However, utilizing such mechanisms requires additional overhead. Besides, the overuse of FT mechanisms results in unnecessarily large overhead in the parallel applications. Knowing when and where failures will occur can greatly reduce the excessive overhead. As such, failure prediction is critical in order to effectively utilize FT mechanisms. In addition, it also helps in system administration and management, as the predicted failure can be handled beforehand with limited impact to the running systems. This dissertation proposes new proficiency metrics for failure prediction based on failure impact in UPC environment that the existing proficiency metrics tire unable to reflect. Furthermore, an efficient log message clustering algorithm is proposed for system event log data preprocessing and analysis. Then, two novel association rule mining approaches are introduced and employed for HPC failure prediction. Finally, the performances of the existing and the proposed association rule mining methods are compared and analyzed

    Research and Education in Computational Science and Engineering

    Get PDF
    Over the past two decades the field of computational science and engineering (CSE) has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers of all persuasions with algorithmic inventions and software systems that transcend disciplines and scales. Carried on a wave of digital technology, CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society; and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution that engulfs the planet, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. This report describes the rapid expansion of CSE and the challenges to sustaining its bold advances. The report also presents strategies and directions for CSE research and education for the next decade.Comment: Major revision, to appear in SIAM Revie

    Developing sustainability pathways for social simulation tools and services

    Get PDF
    The use of cloud technologies to teach agent-based modelling and simulation (ABMS) is an interesting application of a nascent technological paradigm that has received very little attention in the literature. This report fills that gap and aims to help instructors, teachers and demonstrators to understand why and how cloud services are appropriate solutions to common problems they face delivering their study programmes, as well as outlining the many cloud options available. The report first introduces social simulation and considers how social simulation is taught. Following this factors affecting the implementation of agent-based models are explored, with attention focused primarily on the modelling and execution platforms currently available, the challenges associated with implementing agent-based models, and the technical architectures that can be used to support the modelling, simulation and teaching process. This sets the context for an extended discussion on cloud computing including service and deployment models, accessing cloud resources, the financial implications of adopting the cloud, and an introduction to the evaluation of cloud services within the context of developing, executing and teaching agent-based models

    Research and Education in Computational Science and Engineering

    Get PDF
    This report presents challenges, opportunities, and directions for computational science and engineering (CSE) research and education for the next decade. Over the past two decades the field of CSE has penetrated both basic and applied research in academia, industry, and laboratories to advance discovery, optimize systems, support decision-makers, and educate the scientific and engineering workforce. Informed by centuries of theory and experiment, CSE performs computational experiments to answer questions that neither theory nor experiment alone is equipped to answer. CSE provides scientists and engineers with algorithmic inventions and software systems that transcend disciplines and scales. CSE brings the power of parallelism to bear on troves of data. Mathematics-based advanced computing has become a prevalent means of discovery and innovation in essentially all areas of science, engineering, technology, and society, and the CSE community is at the core of this transformation. However, a combination of disruptive developments---including the architectural complexity of extreme-scale computing, the data revolution and increased attention to data-driven discovery, and the specialization required to follow the applications to new frontiers---is redefining the scope and reach of the CSE endeavor. With these many current and expanding opportunities for the CSE field, there is a growing demand for CSE graduates and a need to expand CSE educational offerings. This need includes CSE programs at both the undergraduate and graduate levels, as well as continuing education and professional development programs, exploiting the synergy between computational science and data science. Yet, as institutions consider new and evolving educational programs, it is essential to consider the broader research challenges and opportunities that provide the context for CSE education and workforce development

    Using Object Detection to Navigate a Game Playfield

    Get PDF
    Perhaps the crown jewel of AI is the self-navigating agent. To take many sources of data as input and use it to traverse complex and varied areas while mitigating risk and damage to the vehicle that is being controlled, visual object detection is a key part of the overall suite of this technology. While much efforts are being put towards real-world applications, for example self-driving cars, healthcare related issues and automated manufacturing, we apply object detection in a different way; the automation of movement across a video game play field. We take the TensorFlow Object Detection API and use it to craft an avoidance system in conjunction with a Java front end that allows fire and forget movement to augment normal play

    The terminator : an AI-based framework to handle dependability threats in large-scale distributed systems

    Get PDF
    With the advent of resource-hungry applications such as scientific simulations and artificial intelligence (AI), the need for high-performance computing (HPC) infrastructure is becoming more pressing. HPC systems are typically characterised by the scale of the resources they possess, containing a large number of sophisticated HW components that are tightly integrated. This scale and design complexity inherently contribute to sources of uncertainties, i.e., there are dependability threats that perturb the system during application execution. During system execution, these HPC systems generate a massive amount of log messages that capture the health status of the various components. Several previous works have leveraged those systems’ logs for dependability purposes, such as failure prediction, with varying results. In this work, three novel AI-based techniques are proposed to address two major dependability problems, those of (i) error detection and (ii) failure prediction. The proposed error detection technique leverages the sentiments embedded in log messages in a novel way, making the approach HPC system-independent, i.e., the technique can be used to detect errors in any HPC system. On the other hand, two novel self-supervised transformer neural networks are developed for failure prediction, thereby obviating the need for labels, which are notoriously difficult to obtain in HPC systems. The first transformer technique, called Clairvoyant, accurately predicts the location of the failure, while the second technique, called Time Machine, extends Clairvoyant by also accurately predicting the lead time to failure (LTTF). Time Machine addresses the typical regression problem of LTTF as a novel multi-class classification problem, using a novel oversampling method for online time-based task training. Results from six real-world HPC clusters’ datasets show that our approaches significantly outperform the state-of-the-art methods on various metrics

    FY18 Annual ITACS Accountability Report

    Get PDF

    Novelty Detection And Cluster Analysis In Time Series Data Using Variational Autoencoder Feature Maps

    Get PDF
    The identification of atypical events and anomalies in complex data systems is an essential yet challenging task. The dynamic nature of these systems produces huge volumes of data that is often heterogeneous, and the failure to account for this will impede the detection of anomalies. Time series data encompass these issues and its high dimensional nature intensifies these challenges. This research presents a framework for the identification of anomalies in temporal data. A comparative analysis of Centroid, Density and Neural Network-based clustering techniques was performed and their scalability was assessed. This facilitated the development of a new algorithm called the Variational Autoencoder Feature Map (VAEFM) which is an ensemble method that is based on Kohonen’s Self-Organizing Maps (SOM) and Variational Autoencoders. The VAEFM is an unsupervised learning algorithm that models the distribution of temporal data without making a priori assumptions. It incorporates principles of novelty detection to enhance the representational capacity of SOMs neurons, which improves their ability to generalize with novel data. The VAEFM technique was demonstrated on a dataset of accumulated aircraft sensor recordings, to detect atypical events that transpired in the approach phase of flight. This is a proactive means of accident prevention and is therefore advantageous to the Aviation industry. Furthermore, accumulated aircraft data presents big data challenges, which requires scalable analytical solutions. The results indicated that VAEFM successfully identified temporal dependencies in the flight data and produced several clusters and outliers. It analyzed over 2500 flights in under 5 minutes and identified 12 clusters, two of which contained stabilized approaches. The remaining comprised of aborted approaches, excessively high/fast descent patterns and other contributory factors for unstabilized approaches. Outliers were detected which revealed oscillations in aircraft trajectories; some of which would have a lower detection rate using traditional flight safety analytical techniques. The results further indicated that VAEFM facilitates large-scale analysis and its scaling efficiency was demonstrated on a High Performance Computing System, by using an increased number of processors, where it achieved an average speedup of 70%

    Proceedings, MSVSCC 2017

    Get PDF
    Proceedings of the 11th Annual Modeling, Simulation & Visualization Student Capstone Conference held on April 20, 2017 at VMASC in Suffolk, Virginia. 211 pp
    • …
    corecore