20 research outputs found

    Novel Representation Learning Technique using Graphs for Performance Analytics

    Full text link
    The performance analytics domain in High Performance Computing (HPC) uses tabular data to solve regression problems, such as predicting the execution time. Existing Machine Learning (ML) techniques leverage the correlations among features given tabular datasets, not leveraging the relationships between samples directly. Moreover, since high-quality embeddings from raw features improve the fidelity of the downstream predictive models, existing methods rely on extensive feature engineering and pre-processing steps, costing time and manual effort. To fill these two gaps, we propose a novel idea of transforming tabular performance data into graphs to leverage the advancement of Graph Neural Network-based (GNN) techniques in capturing complex relationships between features and samples. In contrast to other ML application domains, such as social networks, the graph is not given; instead, we need to build it. To address this gap, we propose graph-building methods where nodes represent samples, and the edges are automatically inferred iteratively based on the similarity between the features in the samples. We evaluate the effectiveness of the generated embeddings from GNNs based on how well they make even a simple feed-forward neural network perform for regression tasks compared to other state-of-the-art representation learning techniques. Our evaluation demonstrates that even with up to 25% random missing values for each dataset, our method outperforms commonly used graph and Deep Neural Network (DNN)-based approaches and achieves up to 61.67% & 78.56% improvement in MSE loss over the DNN baseline respectively for HPC dataset and Machine Learning Datasets.Comment: This paper has been accepted at 22nd International Conference on Machine Learning and Applications (ICMLA2023

    Novel proposals for FAIR, automated, recommendable, and robust workflows

    Get PDF
    Funding: This work is partly funded by NSF award OAC-1839900. This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract number DE-AC02-06CH11357. libEnsemble was developed as part of the Exascale Computing Project (17-SC-20-SC), a collaborative effort of the U.S. Department of Energy Office of Science and the National Nuclear Security Administration. This research used resources of the OLCF at ORNL, which is supported by the Office of Science of the U.S. DOE under Contract No. DE-AC05-00OR22725.Lightning talks of the Workflows in Support of Large-Scale Science (WORKS) workshop are a venue where the workflow community (researchers, developers, and users) can discuss work in progress, emerging technologies and frameworks, and training and education materials. This paper summarizes the WORKS 2022 lightning talks, which cover five broad topics: data integrity of scientific workflows; a machine learning-based recommendation system; a Python toolkit for running dynamic ensembles of simulations; a cross-platform, high-performance computing utility for processing shell commands; and a meta(data) framework for reproducing hybrid workflows.Postprin

    An Eight-year Study Report on Arsenic Contamination in Groundwater and Health Effects in Eruani Village, Bangladesh and an Approach for Its Mitigation

    Get PDF
    Based on several surveys during 1997-2005 and visits of a medical team to Eruani village, Laksham upazila, Comilla district, Bangladesh, the arsenic contamination situation and consequent clinical manifestations of arsenicosis among the villagers, including dermatology, neuropathy, and obstetric outcome, are reported here. Analysis of biological samples from patients and non-patients showed high body burden of arsenic. Even after eight years of known exposure, village children were still drinking arsenic-contaminated water, and many of them had arsenical skin lesions. There were social problems due to the symptoms of arsenicosis. The last survey established that there is a lack of proper awareness among villagers about different aspects of arsenic toxicity. The viability of different options of safe water, such as dugwells, deep tubewells, rainwater harvesting, and surface water with watershed management in the village, was studied. Finally, based on 19 years of field experience, it was felt that, for any successful mitigation programme, emphasis should be given to creating awareness among villagers about the arsenic problem, role of arsenic-free water, better nutrition from local fruits and vegetables, and, above all, active participation of women along with others in the struggle against the arsenic menace

    Reliable and scalable checkpointing systems for distributed computing environments

    No full text
    By leveraging the enormous amount of computational capabilities, scientists today are being able to make significant progress in solving problems, ranging from finding cure to cancer -- to using fusion in solving world\u27s clean energy crisis. The number of computational components in extreme scale computing environments is growing exponentially. Since the failure rate of each component starts factoring in, the reliability of overall systems decreases proportionately. Hence, in spite of having enormous computational capabilities, these groundbreaking simulations may never run to completion. The only way to ensure their timely completion is by making these systems reliable, so that no failure can hinder the progress of science. On such systems, long running scientific applications periodically store their execution states in checkpoint files on stable storage, and recover from a failure by restarting from the last saved checkpoint file. Resilient high-throughput and high-performance systems enable applications to simulate scientific problems at granularities finer than ever thought possible. Unfortunately, this explosion in scientific computing capabilities generates large amounts of state. As a result, today\u27s checkpointing systems crumble under the increased amount of checkpoint data. Additionally, the network I/O bandwidth is not growing nearly as fast as the compute cycles. These two factors have caused scalability challenges for checkpointing systems. The focus of this thesis is to develop scalable checkpointing systems for two different execution environments – high-throughput grids and high-performance clusters. In grid environment, machine owners voluntarily share their idle CPU cycles with other users of the system, as long as the performance degradation of host processes remain under certain threshold. The challenge of such an environment is to ensure end-to-end application performance given the high-rate of unavailability of machines and that of guest-job eviction. Today\u27s systems often use expensive, high-performance dedicated checkpoint servers. In this thesis, we present a system – FALCON, which uses available disk resources of the grid machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of storage hosts and predict the availability of checkpoint repositories. Experiments run on production high-throughput system – DiaGrid show that FALCON improves the overall performance of benchmark applications, that write gigabytes of checkpoint data, between 11% and 44% compared to the widely used Condor checkpointing solutions. In high-performance computing (HPC) systems, applications store their states in checkpoints on a parallel file system (PFS). As applications scale up, checkpoint-restart incurs high overheads due to contention for PFS resources. The high overheads force large-scale applications to reduce checkpoint frequency, which means more compute time is lost in the event of failure. We alleviate this problem by developing a scalable checkpoint-restart system, MCRENGINE. MCRENGINE aggregates checkpoints from multiple application processes with knowledge of the data semantics available through widely used I/O libraries, e.g., HDF5 and netCDF, and compresses them. Our novel scheme improves compressibility of checkpoints up to 115% over simple concatenation and compression. Our evaluation with large-scale application checkpoints show that MCRENGINE reduces checkpointing overhead by up to 87% and restart overhead by up to 62% over a baseline with no aggregation or compression. We believe that the contributions made in this thesis serve as a good foundation for further research in improving scalability of checkpointing systems in large-scale, distributed computing environments

    Challenges for Implementing FAIR Digital Objects with High Performance Workflows

    No full text
    New types of workflows are being used in science that couple traditional distributed and high-performance computing (HPC) with data-intensive approaches, and orchestrate ensembles of numerical simulations and artificial intelligence (AI) models. Such workflows may use AI models to supplement computation where numerical simulations may be too computationally expensive, to automate trivial yet time consuming operations, to perform preliminary selections among intractable numbers of combinations in domains as diverse as protein binding, fine-grid climate simulations, and drug discovery. They offer renewed opportunities for scientific research but exhibit high computational, storage and communications requirements [Goble et al. 2020, Al-Saadi et al. 2021, da Silva et al. 2021]. These workflows can be orchestrated by workflow management systems (WMS) and built upon composable blocks that facilitate task placement and resource allocation for parallel executions on high performance systems [Lee et al. 2021, Merzky et al. 2021].The scientific computing communities running these kinds of workflows have been slow to adopt Findable, Accessible, Interpretable, and Re-usable (FAIR) principles, in part due to the complexity of workflow life cycles, the numerous WMS, and the specificity of HPC systems with rapidly evolving architectures and software stacks, and execution modes that require resource managers and batch schedulers [Plale et al. 2021]. FAIR Digital Objects (FDO) that encapsulate bit sequences of data, metadata, types and persistent identifiers (PID) can help promote the adoption of FAIR, enable knowledge extraction and dissemination, and contribute to re-use [De Smedt et al. 2020]. As workflows typically use data and software during planning and execution, FDOs are particularly adapted to enable re-use [Wittenburg et al. 2020]. But the benefits of FDOs such as automating data processing and actionable DO collections cannot be realized without the main components of FAIR, rich metadata and clear identifiers, being universally adopted in the community. These components are still elusive for HPC digital objects. Some metadata are added after results have been produced, are not described by controlled vocabularies, and typically left unconstrained, resulting in inefficient processes and loss of knowledge. Persistent identifiers are added at the time of publication to data supporting conclusions, so only a very small amount of data are being shared outside a small community of researchers “in the know”. In this conceptual work, one can distinguish several kinds of FDOs for HPC workflows that present both common and specific challenges to the development of canonical DO infrastructure and the implementation of FDO workflows that we discuss below:result FDOs represent computational results obtained when program execution complete,performance FDOs that contain performance measures and results from code optimization on parallel, heterogeneous architectures,intermediate FDOs from intermediate states of workflow execution, obtained from HPC checkpointing.All these FDOs for HPC workflows should include the computing environment and system specifications on which code was executed for metadata rich enough to enable re-usability [Pouchard et al. 2019]. Containers are often being used to capture dependencies between underlying libraries and versions in the execution environment for the installation and re-use of software code [Lofstead et al. 2015, Olaya et al. 2020]. But containers published in code repositories are made available without identifiers registered with resolvers. For instance, to attribute a Digital Object Identifier to software shared in github, one must perform the additional step of registering the code into Zenodo. FDOs extracted and built in the context of a canonical workflow framework including collections will help with the attribution of persistent identifiers and the linking of execution environment with data and workflow.Computational results may include machine learning predictions resulting form stochastic training of non-deterministic models. Neural networks and deep learning models present specific challenges to result FDOs related to provenance and the selection of quantities needed to include in an FDO for the re-use of results. What information needs to be included in a FAIR Digital Object encapsulating deep learning results to make it persistent and re-usable? The description of method, data and experiment recommended in [Gundersen and Kjensmo 2018] can be instantiated in a FDO collection. To make it re-usable, it should include the model architecture, the machine learning platform and its version, a submission script that contains hyperparameters, the loss function, batch size and number of epochs [Pouchard et al. 2020]. Challenges specific to digital objects containing performance measures for HPC workflows are those related to size, selection and reduction. Performance data at scale tends to be very large, thus a principled approach to selection is needed to determine which execution counters must be included in FDOs for performance reproducibility of an application [Patki et al. 2019]. Performance FDOs should include the variables selected to show their impact on performance and the methods used for selection: do such variables represent outliers in performance metrics? What methods and thresholds are used to qualify as outliers, what impact do these outliers have on overall performance of an execution? A key contributor to the failure to capture important information in HPC workflows is that metadata and provenance capture is often “bolted on” after the fact and in a piecemeal, cumbersome, inefficient manner that impedes further analysis. An FDO approach including DO collections at the appropriate level of abstraction and rich metadata is needed. Capturing metadata automatically must take into account the appropriate granularity level for re-use across system layers and abstraction levels. Intermediate FDOs capture and fuse metadata across multiple sources during the planning and execution stages [Nicolae 2022]. Some tools already exist. Darshan is a scalable tool summarizing Input/Output file characteristics [Dai et al. 2019], Radical Cybertools [Merzky et al. 2021] can produce the provenance task graph of an execution. Such tools could be included in a canonical workflow framework as they present a path forward for composable services for HPC and would guarantee a level of encapsulation into DOs favorable to re-use

    FALCON: a system for reliable checkpoint recovery in shared grid environments

    No full text
    In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such failures . Today\u27s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model failures of storage hosts and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availabilit

    EUNOMIA: A Fast and Collaborative Interference Avoidance Technique for Wireless Medical Devices

    No full text
    Interference has long been a serious problem in wireless communication systems, causing packet loss and degradation in communication quality. The problem becomes life-threatening when it happens with medical devices. The need to address this problem is exacerbated by the increasing use of wireless embedded devices in a range of medical applications. This paper presents an interference avoidance protocol, called Eunomia, that is especially suited for wireless medical devices. Eunomia is based on the notion of dynamic channel switching upon detection of interference. There are three novel features of Eunomia. The first is proactive monitoring of one or more backup channels, and channel switching to a known available backup channel which results in minimal downtime during channel switching—a necessity for medical devices. The second is exchange of priority information between interfering wireless devices wherein low-priority devices defer to safety critical devices and switch channels. The third novelty is the ability to exploit the availability of additional radios on a device to alleviate the overhead associated with interference avoidance. We have implemented and evaluated Eunomia on the TelosB and MicaZ wireless sensor nodes and present experimental results that demonstrate its efficacy
    corecore