21 research outputs found

    Domain decomposition, irregular applications, and parallel computers.

    Full text link
    Many large-scale computational problems are based on irregular (unstructured) domains. Some examples are finite element methods in structural analysis, finite volume methods in fluid dynamics, and circuit simulation for VLSI design. Domain decomposition is a common technique for distributing the data and work of irregular scientific applications across a distributed memory parallel machine. To obtain efficiency, subdomains must be constructed such that the work is divided with a reasonable balance among the processors while the communication-causing subdomain boundary is kept small. Application and machine specific information can be used in conjunction with domain decomposition to achieve a level of performance not possible with traditional domain decomposition methods. Application profiling characterizes the performance of an application on a specific machine. We present a method that uses curve-fitting of application profile data to calculate vertex and edge weights for use with weighted graph decomposition algorithms. We demonstrate its potential on two routines from a production finite element application running on the IBM SP2. Our method combined with a multilevel spectral algorithm reduced load imbalance from 52% to less than 10% for one routine in our study. Many irregular applications have several phases, that must be load balanced individually to achieve high overall application performance. We propose finding one decomposition that can be used effectively for each phase of the application, and introduce a decomposition that can be used effectively for each phase of the application, and introduce a decomposition algorithm which load balances according to two vertex weight sets for use on two-phase applications. We show that this dual weight algorithm call be as successful at load balancing two individual routines together as the traditional single weight algorithm is at load balancing each routine independently. Domain decomposition algorithms take a simplistic view of multiprocessor communication. Higher performance can be achieved by considering the communication characteristics of the target multiprocessor in conjunction with decomposition techniques. We provide a methodology for tuning an application for a shared-address space multiprocessor by using intelligent layout of the application data to reduce coherence traffic and employing latency hiding mechanisms to overlap communication with useful work. These techniques have been applied to a finite element radar application running on the Kendall Square KSR1.Ph.D.Computer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/104873/1/9610252.pdfDescription of 9610252.pdf : Restricted to UM users only

    Challenges to Deploying a Software Ecosystem for Science

    No full text
    <p>Whitepaper for CSESSP'15 workshop.</p> <p>Karen Tomko and Scott Brozell</p

    ABSTRACT Domain Decomposition, Irregular Applications, and Parallel Computers

    No full text
    Many large-scale computational problems are based on irregular (unstructured) domains. Some examples are finite element methods in structural analysis, finite volume methods in fluid dynamics, and circuit simulation for VLSI design. Domain decomposition is a common technique for distributing the data and work of irregular scientific applications across a distributed memory parallel machine. To obtain efficiency, subdomains must be constructed such that the work is divided with a reasonable balance among the processors while the communication-causing subdomain boundary is kept small. Application and machine specific information can be used in conjunction with domain decomposition to achieve a level of performance not possible with traditional domain decomposition methods. Application profiling characterizes the performance of an application on a specific machine. We present a method that uses curve-fitting of application profile data to calculate vertex and edge weights for use with weighted graph decomposition algorithms. We demonstrate its potential on two routines from a production finite element application running on the IBM SP2. Our method combined with a multilevel spectral algorithm reduced load imbalance from 52 % to less than 10 % for one routine in our study. Many irregular applications have several phases, that must be load balanced individually to achieve high overall application performance. We propose finding one decomposition that can be used effectively for each phase of the application, and introduce a decomposition algorithm which load balances according to two vertex weight sets for use on two-phase applications. We show that this dual weight algorithm can be as successful at load balancing two individual routines together as the traditional single weight algorithm is at load balancing each routine independently

    Partitioning Regular Applications for Cache-Coherent Multiprocessors

    No full text
    In all massively parallel systems (MPPs), whether message-passing or shared-address space, the memory is physically distributed for scalability and the latency of accessing remote data is orders of magnitude higher than the processor cycle time. Therefore, the programmer/compiler must not only identify parallelism but also specify the distribution of data among the processor memories in order to obtain reasonable efficiency. Shared-address MPPs provide an easier paradigm for programmers than message passing systems since the communication is automatically handled by the hardware and/or operating system. However, it is just as important to optimize the communication in shared-address systems if high performance is to be achieved. Since communication is implied by the data layout and data reference pattern of the application, the data layout scheme and data access pattern must be controlled by the compiler in order to optimize communication. Machine specific parameters, such as cache siz..

    Data and Program Restructuring of Irregular Applications for Cache-Coherent Multiprocessors

    No full text
    Applications with irregular data structures such as sparse matrices or finite element meshes account for a large fraction of engineering and scientific applications. Domain decomposition techniques are commonly used to partition these applications to reduce interprocessor communication on message passing parallel systems. Our work investigates the use of domain decomposition techniques on cache-coherent parallel systems. Many good domain decomposition algorithms are now available. We show that further application improvements are attainable using data and program restructuring in conjunction with domain decomposition. We give techniques for data layout to reduce communication, blocking with subdomains to improve uniprocessor cache behavior, and insertion of prefetches to hide the latency of interprocessor communication. This paper details our restructuring techniques and provides experimental results on the KSR1 multiprocessor for a sparse matrix application. The experimental results ..

    Data Buffering and Allocation in Mapping Generalized Template Matching on Reconfigurable Systems

    No full text
    Image processing algorithms for 2D digital filtering, morphologic operations, motion estimation, and template matching involve massively parallel computations that can benefit from using reconfigurable systems with massive field programmable gate array (FPGA) hardware resources. In addition, each algorithm can be considered a special case of a &quot;generalized template matching&quot; (GTM) operation. Application performance on reconfigurable computer systems is often limited by the bandwidth to host or off chip memory. This paper describes the GTM operation and characterizes the data allocation and buffering strategies for GTM operation on reconfigurable computers. Several mechanisms that support different levels of parallelism are proposed and summarized in the paper. Keywords: Template Matching, Configurable Computing, Field Programmable Gate Array (FPGA), Reconfiguration 1 Introduction Computing systems that use co-processor boards based on field programmable gate array (FPGA) chips may a..
    corecore