39 research outputs found

    Cross-scale efficient tensor contractions for coupled cluster computations through multiple programming model backends

    Full text link
    Coupled-cluster methods provide highly accurate models of molecular structure through explicit numerical calculation of tensors representing the correlation between electrons. These calculations are dominated by a sequence of tensor contractions, motivating the development of numerical libraries for such operations. While based on matrix–matrix multiplication, these libraries are specialized to exploit symmetries in the molecular structure and in electronic interactions, and thus reduce the size of the tensor representation and the complexity of contractions. The resulting algorithms are irregular and their parallelization has been previously achieved via the use of dynamic scheduling or specialized data decompositions. We introduce our efforts to extend the Libtensor framework to work in the distributed memory environment in a scalable and energy-efficient manner. We achieve up to 240Ă— speedup compared with the optimized shared memory implementation of Libtensor. We attain scalability to hundreds of thousands of compute cores on three distributed-memory architectures (Cray XC30 and XC40, and IBM Blue Gene/Q), and on a heterogeneous GPU-CPU system (Cray XK7). As the bottlenecks shift from being compute-bound DGEMM's to communication-bound collectives as the size of the molecular system scales, we adopt two radically different parallelization approaches for handling load-imbalance, tasking and bulk synchronous models. Nevertheless, we preserve a unified interface to both programming models to maintain the productivity of computational quantum chemists

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications

    An MPI-based 2D data redistribution library for dense arrays

    Get PDF
    In HPC, data redistributions (reorganizations) are used in parallel applications to improve performance and/or provide data-locality compatibility with sequences of parallel operations. Data reorganization refers to changing the logical and physical arrangement of data (such as dense arrays or matrices distributed over peer processes in a parallel program). This operation can be achieved by applying transformations such as transpositions or rotations or changing how data is mapped across the process grid P by Q, all of which are accomplished either with message passing or distributed shared memory operations. In this project, we restrict ourselves to a distributed memory model and message passing, not a shared-memory model, nor do we use distributed shared memory APIs. Our primary goal is to generate a library capable of diverse data reorganizations. We aim to develop a high-level Application Programming Interface (API) that directly works with the Message Passing Interface (MPI) to accomplish data redistributions in data-parallel applications and libraries, such as the polymath library, a library of many algorithms all of which implement parallel dense matrix multiplication algorithms with flexible data layouts. Using the reorganization mode of process grid shapes with constant total processes, we plan to observe the performance trends of the polymath dense parallel matrix-multiplication library (which is related research) based on different grid shapes, problem sizes, numbers of processes and decide whether it is more efficient to redistribute data or not to achieve the highest performance. We will test other redistributions for some of the process shapes used with the polymath library to identify how redistribution impacts performance. These tests will provide us with the information to determine if redistribution is worthwhile for non-optimal process layout (as established with the polymath system). Besides changing the shape of the data in terms of its layout on a logical process topology, we also plan to study and demonstrate data transpose algorithms in this library, another useful redistribution mechanism for dense linear algebra in distributed-memory parallel computing

    GPU Accelerated protocol analysis for large and long-term traffic traces

    Get PDF
    This thesis describes the design and implementation of GPF+, a complete general packet classification system developed using Nvidia CUDA for Compute Capability 3.5+ GPUs. This system was developed with the aim of accelerating the analysis of arbitrary network protocols within network traffic traces using inexpensive, massively parallel commodity hardware. GPF+ and its supporting components are specifically intended to support the processing of large, long-term network packet traces such as those produced by network telescopes, which are currently difficult and time consuming to analyse. The GPF+ classifier is based on prior research in the field, which produced a prototype classifier called GPF, targeted at Compute Capability 1.3 GPUs. GPF+ greatly extends the GPF model, improving runtime flexibility and scalability, whilst maintaining high execution efficiency. GPF+ incorporates a compact, lightweight registerbased state machine that supports massively-parallel, multi-match filter predicate evaluation, as well as efficient arbitrary field extraction. GPF+ tracks packet composition during execution, and adjusts processing at runtime to avoid redundant memory transactions and unnecessary computation through warp-voting. GPF+ additionally incorporates a 128-bit in-thread cache, accelerated through register shuffling, to accelerate access to packet data in slow GPU global memory. GPF+ uses a high-level DSL to simplify protocol and filter creation, whilst better facilitating protocol reuse. The system is supported by a pipeline of multi-threaded high-performance host components, which communicate asynchronously through 0MQ messaging middleware to buffer, index, and dispatch packet data on the host system. The system was evaluated using high-end Kepler (Nvidia GTX Titan) and entry level Maxwell (Nvidia GTX 750) GPUs. The results of this evaluation showed high system performance, limited only by device side IO (600MBps) in all tests. GPF+ maintained high occupancy and device utilisation in all tests, without significant serialisation, and showed improved scaling to more complex filter sets. Results were used to visualise captures of up to 160 GB in seconds, and to extract and pre-filter captures small enough to be easily analysed in applications such as Wireshark

    The Monge array--an abstraction and its applications

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1991.Includes bibliographical references (p. 211-219).by James KIimbrough Park.Ph.D

    How are we doing? A self-assessment of the quality of services and systems at NERSC (2001)

    Full text link

    How Are We Doing? A Self-Assessment of the Quality of Services andSystems at NERSC, 2005-2006

    Full text link

    BASEBAND RADIO MODEM DESIGN USING GRAPHICS PROCESSING UNITS

    Get PDF
    A modern radio or wireless communications transceiver is programmed via software and firmware to change its functionalities at the baseband. However, the actual implementation of the radio circuits relies on dedicated hardware, and the design and implementation of such devices are time consuming and challenging. Due to the need for real-time operation, dedicated hardware is preferred in order to meet stringent requirements on throughput and latency. With increasing need for higher throughput and shorter latency, while supporting increasing bandwidth across a fragmented spectrum, dedicated subsystems are developed in order to service individual frequency bands and specifications. Such a dedicated-hardware-intensive approach leads to high resource costs, including costs due to multiple instantiations of mixers, filters, and samplers. Such increases in hardware requirements in turn increases device size, power consumption, weight, and financial cost. If it can meet the required real-time constraints, a more flexible and reconfigurable design approach, such as a software-based solution, is often more desirable over a dedicated hardware solution. However, significant challenges must be overcome in order to meet constraints on throughput and latency while servicing different frequency bands and bandwidths. Graphics processing unit (GPU) technology provides a promising class of platforms for addressing these challenges. GPUs, which were originally designed for rendering images and video sequences, have been adapted as general purpose high-throughput computation engines for a wide variety of application areas beyond their original target domains. Linear algebra and signal processing acceleration are examples of such application areas. In this thesis, we apply GPUs as software-based, baseband radios and demonstrate novel, software-based implementations of key subsystems in modern wireless transceivers. In our work, we develop novel implementation techniques that allow communication system designers to use GPUs as accelerators for baseband processing functions, including real-time filtering and signal transformations. More specifically, we apply GPUs to accelerate several computationally-intensive, frontend radio subsystems, including filtering, signal mixing, sample rate conversion, and synchronization. These are critical subsystems that must operate in real-time to reliably receive waveforms. The contributions of this thesis can be broadly organized into 3 major areas: (1) channelization, (2) arbitrary resampling, and (3) synchronization. 1. Channelization: a wideband signal is shared between different users and channels, and a channelizer is used to separate the components of the shared signal in the different channels. A channelizer is often used as a pre-processing step in selecting a specific channel-of-interest. A typical channelization process involves signal conversion, resampling, and filtering to reject adjacent channels. We investigate GPU acceleration for a particularly efficient form of channelizer called a polyphase filterbank channelizer, and demonstrate a real-time implementation of our novel channelizer design. 2. Arbitrary resampling: following a channelization process, a signal is often resampled to at least twice the data rate in order to further condition the signal. Since different communication standards require different resampling ratios, it is desirable for a resampling subsystem to support a variety of different ratios. We investigate optimized, GPU-based methods for resampling using polyphase filter structures that are mapped efficiently into GPU hardware. We investigate these GPU implementation techniques in the context of interpolation (integer-factor increases in sampling rate), decimation (integer-factor decreases in sampling rate), and rational resampling. Finally, we demonstrate an efficient implementation of arbitrary resampling using GPUs. This implementation exploits specialized hardware units within the GPU to enable efficient and accurate resampling processes involving arbitrary changes in sample rate. 3. Synchronization: incoming signals in a wireless communications transceiver must be synchronized in order to recover the transmitted data properly from complex channel effects such as thermal noise, fading, and multipath propagation. We investigate timing recovery in GPUs to accelerate the most computationally intensive part of the synchronization process, and correctly align the incoming data symbols in the receiver. Furthermore, we implement fully-parallel timing error detection to accelerate maximum likelihood estimation

    Project and development of hardware accelerators for fast computing in multimedia processing

    Get PDF
    2017 - 2018The main aim of the present research work is to project and develop very large scale electronic integrated circuits, with particular attention to the ones devoted to image processing applications and the related topics. In particular, the candidate has mainly investigated four topics, detailed in the following. First, the candidate has developed a novel multiplier circuit capable of obtaining floating point (FP32) results, given as inputs an integer value from a fixed integer range and a set of fixed point (FI) values. The result has been accomplished exploiting a series of theorems and results on a number theory problem, known as Bachet’s problem, which allows the development of a new Distributed Arithmetic (DA) based on 3’s partitions. This kind of application results very fit for filtering applications working on an integer fixed input range, such in image processing applications, in which the pixels are coded on 8 bits per channel. In fact, in these applications the main problem is related to the high area and power consumption due to the presence of many Multiply and Accumulate (MAC) units, also compromising real-time requirements due to the complexity of FP32 operations. For these reasons, FI implementations are usually preferred, at the cost of lower accuracies. The results for the single multiplier and for a filter of dimensions 3x3 show respectively delay of 2.456 ns and 4.7 ns on FPGA platform and 2.18 ns and 4.426 ns on 90nm std_cell TSMC 90 nm implementation. Comparisons with state-of-the-art FP32 multipliers show a speed increase of up to 94.7% and an area reduction of 69.3% on FPGA platform. ... [edited by Author]XXXI cicl
    corecore