1,786 research outputs found

    Many-Task Computing and Blue Waters

    Full text link
    This report discusses many-task computing (MTC) generically and in the context of the proposed Blue Waters systems, which is planned to be the largest NSF-funded supercomputer when it begins production use in 2012. The aim of this report is to inform the BW project about MTC, including understanding aspects of MTC applications that can be used to characterize the domain and understanding the implications of these aspects to middleware and policies. Many MTC applications do not neatly fit the stereotypes of high-performance computing (HPC) or high-throughput computing (HTC) applications. Like HTC applications, by definition MTC applications are structured as graphs of discrete tasks, with explicit input and output dependencies forming the graph edges. However, MTC applications have significant features that distinguish them from typical HTC applications. In particular, different engineering constraints for hardware and software must be met in order to support these applications. HTC applications have traditionally run on platforms such as grids and clusters, through either workflow systems or parallel programming systems. MTC applications, in contrast, will often demand a short time to solution, may be communication intensive or data intensive, and may comprise very short tasks. Therefore, hardware and software for MTC must be engineered to support the additional communication and I/O and must minimize task dispatch overheads. The hardware of large-scale HPC systems, with its high degree of parallelism and support for intensive communication, is well suited for MTC applications. However, HPC systems often lack a dynamic resource-provisioning feature, are not ideal for task communication via the file system, and have an I/O system that is not optimized for MTC-style applications. Hence, additional software support is likely to be required to gain full benefit from the HPC hardware

    Couplers for linking environmental models: scoping study and potential next steps

    Get PDF
    This report scopes out what couplers there are available in the hydrology and atmospheric modelling fields. The work reported here examines both dynamic runtime and one way file based coupling. Based on a review of the peer-reviewed literature and other open sources, there are a plethora of coupling technologies and standards relating to file formats. The available approaches have been evaluated against criteria developed as part of the DREAM project. Based on these investigations, the following recommendations are made: • The most promising dynamic coupling technologies for use within BGS are OpenMI 2.0 and CSDMS (either 1.0 or 2.0) • Investigate the use of workflow engines: Trident and Pyxis, the latter as part of the TSB/AHRC project “Confluence” • There is a need to include database standards CSW and GDAL and use data formats from the climate community NetCDF and CF standards. • Development of a “standard” composition which will consist of two process models and a 3D geological model all linked to data stored in the BGS corporate database and flat file format. Web Feature Services should be included in these compositions. There is also a need to investigate other approaches in different disciplines: The Loss Modelling Framework, OASIS-LMF is the best candidate

    Helmholtz Portfolio Theme Large-Scale Data Management and Analysis (LSDMA)

    Get PDF
    The Helmholtz Association funded the "Large-Scale Data Management and Analysis" portfolio theme from 2012-2016. Four Helmholtz centres, six universities and another research institution in Germany joined to enable data-intensive science by optimising data life cycles in selected scientific communities. In our Data Life cycle Labs, data experts performed joint R&D together with scientific communities. The Data Services Integration Team focused on generic solutions applied by several communities

    An Introduction to Programming for Bioscientists: A Python-based Primer

    Full text link
    Computing has revolutionized the biological sciences over the past several decades, such that virtually all contemporary research in the biosciences utilizes computer programs. The computational advances have come on many fronts, spurred by fundamental developments in hardware, software, and algorithms. These advances have influenced, and even engendered, a phenomenal array of bioscience fields, including molecular evolution and bioinformatics; genome-, proteome-, transcriptome- and metabolome-wide experimental studies; structural genomics; and atomistic simulations of cellular-scale molecular assemblies as large as ribosomes and intact viruses. In short, much of post-genomic biology is increasingly becoming a form of computational biology. The ability to design and write computer programs is among the most indispensable skills that a modern researcher can cultivate. Python has become a popular programming language in the biosciences, largely because (i) its straightforward semantics and clean syntax make it a readily accessible first language; (ii) it is expressive and well-suited to object-oriented programming, as well as other modern paradigms; and (iii) the many available libraries and third-party toolkits extend the functionality of the core language into virtually every biological domain (sequence and structure analyses, phylogenomics, workflow management systems, etc.). This primer offers a basic introduction to coding, via Python, and it includes concrete examples and exercises to illustrate the language's usage and capabilities; the main text culminates with a final project in structural bioinformatics. A suite of Supplemental Chapters is also provided. Starting with basic concepts, such as that of a 'variable', the Chapters methodically advance the reader to the point of writing a graphical user interface to compute the Hamming distance between two DNA sequences.Comment: 65 pages total, including 45 pages text, 3 figures, 4 tables, numerous exercises, and 19 pages of Supporting Information; currently in press at PLOS Computational Biolog

    Scaling full seismic waveform inversions

    Get PDF
    The main goal of this research study is to scale full seismic waveform inversions using the adjoint-state method to the data volumes that are nowadays available in seismology. Practical issues hinder the routine application of this, to a certain extent theoretically well understood, method. To a large part this comes down to outdated or flat out missing tools and ways to automate the highly iterative procedure in a reliable way. This thesis tackles these issues in three successive stages. It first introduces a modern and properly designed data processing framework sitting at the very core of all the consecutive developments. The ObsPy toolkit is a Python library providing a bridge for seismology into the scientific Python ecosystem and bestowing seismologists with effortless I/O and a powerful signal processing library, amongst other things. The following chapter deals with a framework designed to handle the specific data management and organization issues arising in full seismic waveform inversions, the Large-scale Seismic Inversion Framework. It has been created to orchestrate the various pieces of data accruing in the course of an iterative waveform inversion. Then, the Adaptable Seismic Data Format, a new, self-describing, and scalable data format for seismology is introduced along with the rationale why it is needed for full waveform inversions in particular and seismology in general. Finally, these developments are put into service to construct a novel full seismic waveform inversion model for elastic subsurface structure beneath the North American continent and the Northern Atlantic well into Europe. The spectral element method is used for the forward and adjoint simulations coupled with windowed time-frequency phase misfit measurements. Later iterations use 72 events, all happening after the USArray project has commenced, resulting in approximately 150`000 three components recordings that are inverted for. 20 L-BFGS iterations yield a model that can produce complete seismograms at a period range between 30 and 120 seconds while comparing favorably to observed data

    Evaluating and Enabling Scalable High Performance Computing Workloads on Commercial Clouds

    Get PDF
    Performance, usability, and accessibility are critical components of high performance computing (HPC). Usability and performance are especially important to academic researchers as they generally have little time to learn a new technology and demand a certain type of performance in order to ensure the quality and quantity of their research results. We have observed that while not all workloads run well in the cloud, some workloads perform well. We have also observed that although commercial cloud adoption by industry has been growing at a rapid pace, its use by academic researchers has not grown as quickly. We aim to help close this gap and enable researchers to utilize the commercial cloud more efficiently and effectively. We present our results on architecting and benchmarking an HPC environment on Amazon Web Services (AWS) where we observe that there are particular types of applications that are and are not suited for the commercial cloud. Then, we present our results on architecting and building a provisioning and workflow management tool (PAW), where we developed an application that enables a user to launch an HPC environment in the cloud, execute a customizable workflow, and after the workflow has completed delete the HPC environment automatically. We then present our results on the scalability of PAW and the commercial cloud for compute intensive workloads by deploying a 1.1 million vCPU cluster. We then discuss our research into the feasibility of utilizing commercial cloud infrastructure to help tackle the large spikes and data-intensive characteristics of Transportation Cyberphysical Systems (TCPS) workloads. Then, we present our research in utilizing the commercial cloud for urgent HPC applications by deploying a 1.5 million vCPU cluster to process 211TB of traffic video data to be utilized by first responders during an evacuation situation. Lastly, we present the contributions and conclusions drawn from this work

    Raster Time Series: Learning and Processing

    Get PDF
    As the amount of remote sensing data is increasing at a high rate, due to great improvements in sensor technology, efficient processing capabilities are of utmost importance. Remote sensing data from satellites is crucial in many scientific domains, like biodiversity and climate research. Because weather and climate are of particular interest for almost all living organisms on earth, the efficient classification of clouds is one of the most important problems. Geostationary satellites such as Meteosat Second Generation (MSG) offer the only possibility to generate long-term cloud data sets with high spatial and temporal resolution. This work, therefore, addresses research problems on efficient and parallel processing of MSG data to enable new applications and insights. First, we address the lack of a suitable processing chain to generate a long-term Fog and Low Stratus (FLS) time series. We present an efficient MSG data processing chain that processes multiple tasks simultaneously, and raster data in parallel using the Open Computing Language (OpenCL). The processing chain delivers a uniform FLS classification that combines day and night approaches in a single method. As a result, it is possible to calculate a year of FLS rasters quite easy. The second topic presents the application of Convolutional Neural Networks (CNN) for cloud classification. Conventional approaches to cloud detection often only classify single pixels and ignore the fact that clouds are highly dynamic and spatially continuous entities. Therefore, we propose a new method based on deep learning. Using a CNN image segmentation architecture, the presented Cloud Segmentation CNN (CS-CNN) classifies all pixels of a scene simultaneously. We show that CS-CNN is capable of processing multispectral satellite data to identify continuous phenomena such as highly dynamic clouds. The proposed approach provides excellent results on MSG satellite data in terms of quality, robustness, and runtime, in comparison to Random Forest (RF), another widely used machine learning method. Finally, we present the processing of raster time series with a system for Visualization, Transformation, and Analysis (VAT) of spatio-temporal data. It enables data-driven research with explorative workflows and uses time as an integral dimension. The combination of various raster and vector data time series enables new applications and insights. We present an application that combines weather information and aircraft trajectories to identify patterns in bad weather situations
    • …
    corecore