23,773 research outputs found
Efficient algorithms for reconfiguration in VLSI/WSI arrays
The issue of developing efficient algorithms for reconfiguring processor arrays in the presence of faulty processors and fixed hardware resources is discussed. The models discussed consist of a set of identical processors embedded in a flexible interconnection structure that is configured in the form of a rectangular grid. An array grid model based on single-track switches is considered. An efficient polynomial time algorithm is proposed for determining feasible reconfigurations for an array with a given distribution of faulty processors. In the process, it is shown that the set of conditions in the reconfigurability theorem is not necessary. A polynomial time algorithm is developed for finding feasible reconfigurations in an augmented single-track model and in array grid models with multiple-track switche
Rectilinear partitioning of irregular data parallel computations
New mapping algorithms for domain oriented data-parallel computations, where the workload is distributed irregularly throughout the domain, but exhibits localized communication patterns are described. Researchers consider the problem of partitioning the domain for parallel processing in such a way that the workload on the most heavily loaded processor is minimized, subject to the constraint that the partition be perfectly rectilinear. Rectilinear partitions are useful on architectures that have a fast local mesh network. Discussed here is an improved algorithm for finding the optimal partitioning in one dimension, new algorithms for partitioning in two dimensions, and optimal partitioning in three dimensions. The application of these algorithms to real problems are discussed
The finite element machine: An experiment in parallel processing
The finite element machine is a prototype computer designed to support parallel solutions to structural analysis problems. The hardware architecture and support software for the machine, initial solution algorithms and test applications, and preliminary results are described
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
Verified AIG Algorithms in ACL2
And-Inverter Graphs (AIGs) are a popular way to represent Boolean functions
(like circuits). AIG simplification algorithms can dramatically reduce an AIG,
and play an important role in modern hardware verification tools like
equivalence checkers. In practice, these tricky algorithms are implemented with
optimized C or C++ routines with no guarantee of correctness. Meanwhile, many
interactive theorem provers can now employ SAT or SMT solvers to automatically
solve finite goals, but no theorem prover makes use of these advanced,
AIG-based approaches.
We have developed two ways to represent AIGs within the ACL2 theorem prover.
One representation, Hons-AIGs, is especially convenient to use and reason
about. The other, Aignet, is the opposite; it is styled after modern AIG
packages and allows for efficient algorithms. We have implemented functions for
converting between these representations, random vector simulation, conversion
to CNF, etc., and developed reasoning strategies for verifying these
algorithms.
Aside from these contributions towards verifying AIG algorithms, this work
has an immediate, practical benefit for ACL2 users who are using GL to
bit-blast finite ACL2 theorems: they can now optionally trust an off-the-shelf
SAT solver to carry out the proof, instead of using the built-in BDD package.
Looking to the future, it is a first step toward implementing verified AIG
simplification algorithms that might further improve GL performance.Comment: In Proceedings ACL2 2013, arXiv:1304.712
Integration of tools for the Design and Assessment of High-Performance, Highly Reliable Computing Systems (DAHPHRS), phase 1
Systems for Space Defense Initiative (SDI) space applications typically require both high performance and very high reliability. These requirements present the systems engineer evaluating such systems with the extremely difficult problem of conducting performance and reliability trade-offs over large design spaces. A controlled development process supported by appropriate automated tools must be used to assure that the system will meet design objectives. This report describes an investigation of methods, tools, and techniques necessary to support performance and reliability modeling for SDI systems development. Models of the JPL Hypercubes, the Encore Multimax, and the C.S. Draper Lab Fault-Tolerant Parallel Processor (FTPP) parallel-computing architectures using candidate SDI weapons-to-target assignment algorithms as workloads were built and analyzed as a means of identifying the necessary system models, how the models interact, and what experiments and analyses should be performed. As a result of this effort, weaknesses in the existing methods and tools were revealed and capabilities that will be required for both individual tools and an integrated toolset were identified
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
Recommended from our members
Executing matrix multiply on a process oriented data flow machine
The Process-Oriented Dataflow System (PODS) is an execution model that combines the von Neumann and dataflow models of computation to gain the benefits of each. Central to PODS is the concept of array distribution and its effects on partitioning and mapping of processes.In PODS arrays are partitioned by simply assigning consecutive elements to each processing element (PE) equally. Since PODS uses single assignment, there will be only one producer of each element. This producing PE owns that element and will perform the necessary computations to assign it. Using this approach the filling loop is distributed across the PEs. This simple partitioning and mapping scheme provides excellent results for executing scientific code on MIMD machines. In this way PODS allows MIMD machines to exploit vector and data parallelism easily while still providing the flexibility of MIMD over SIMD for multi-user systems.In this paper, the classic matrix multiply algorithm, with 1024 data points, is executed on a PODS simulator and the results are presented and discussed. Matrix multiply is a good example because it has several interesting properties: there are multiple code-blocks; a new array must be dynamically allocated and distributed; there is a loop-carried dependency in the innermost loop; the two input arrays have different access patterns; and the sizes of the input arrays are not known at compile time. Matrix multiply also forms the basis for many important scientific algorithms such as: LU decomposition, convolution, and the Fast-Fourier Transform.The results show that PODS is comparable to both Iannucci's Hybrid Architecture and MIT's TTDA in terms of overhead and instruction power. They also show that PODS easily distributes the work load evenly across the PEs. The key result is that PODS can scale matrix multiply in a near linear fashion until there is little or no work to be performed for each PE. Then overhead and message passing become a major component of the execution time. With larger problems (e.g., >/=16k data points) this limit would be reached at around 256 PEs
- âŠ