517 research outputs found
Theory and design of portable parallel programs for heterogeneous computing systems and networks
A recurring problem with high-performance computing is that advanced architectures generally achieve only a small fraction of their peak performance on many portions of real applications sets. The Amdahl\u27s law corollary of this is that such architectures often spend most of their time on tasks (codes/algorithms and the data sets upon which they operate) for which they are unsuited. Heterogeneous Computing (HC) is needed in the mid 90\u27s and beyond due to ever increasing super-speed requirements and the number of projects with these requirements. HC is defined as a special form of parallel and distributed computing that performs computations using a single autonomous computer operating in both SIMD and MIMD modes, or using a number of connected autonomous computers. Physical implementation of a heterogeneous network or system is currently possible due to the existing technological advances in networking and supercomputing. Unfortunately, software solutions for heterogeneous computing are still in their infancy. Theoretical models, software tools, and intelligent resource-management schemes need to be developed to support heterogeneous computing efficiently. In this thesis, we present a heterogeneous model of computation which encapsulates all the essential parameters for designing efficient software and hardware for HC. We also study a portable parallel programming tool, called Cluster-M, which implements this model. Furthermore, we study and analyze the hardware and software requirements of HC and show that, Cluster-M satisfies the requirements of HC environments
Customizing the Computation Capabilities of Microprocessors.
Designers of microprocessor-based systems must constantly improve
performance and increase computational efficiency in their designs to
create value. To this end, it is increasingly common to see
computation accelerators in general-purpose processor
designs. Computation accelerators collapse portions of an
application's dataflow graph, reducing the critical path of
computations, easing the burden on processor resources, and reducing
energy consumption in systems. There are many problems associated with
adding accelerators to microprocessors, though. Design of
accelerators, architectural integration, and software support all
present major challenges.
This dissertation tackles these challenges in the context of
accelerators targeting acyclic and cyclic patterns of
computation. First, a technique to identify critical computation
subgraphs within an application set is presented. This technique is
hardware-cognizant and effectively generates a set of instruction set
extensions given a domain of target applications. Next, several
general-purpose accelerator structures are quantitatively designed
using critical subgraph analysis for a broad application set.
The next challenge is architectural integration of
accelerators. Traditionally, software invokes accelerators by
statically encoding new instructions into the application binary. This
is incredibly costly, though, requiring many portions of hardware and
software to be redesigned. This dissertation develops strategies to
utilize accelerators, without changing the instruction set. In the
proposed approach, the microarchitecture translates applications at
run-time, replacing computation subgraphs with microcode to utilize
accelerators. We explore the tradeoffs in performing difficult aspects
of the translation at compile-time, while retaining run-time
replacement. This culminates in a simple microarchitectural interface
that supports a plug-and-play model for integrating accelerators into
a pre-designed microprocessor.
Software support is the last challenge in dealing with computation
accelerators. The primary issue is difficulty in generating
high-quality code utilizing accelerators. Hand-written assembly code
is standard in industry, and if compiler support does exist, simple
greedy algorithms are common. In this work, we investigate more
thorough techniques for compiling for computation accelerators. Where
greedy heuristics only explore one possible solution, the techniques
in this dissertation explore the entire design space, when
possible. Intelligent pruning methods ensure that compilation is both
tractable and scalable.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/57633/2/ntclark_1.pd
Compiler and Architecture Design for Coarse-Grained Programmable Accelerators
abstract: The holy grail of computer hardware across all market segments has been to sustain performance improvement at the same pace as silicon technology scales. As the technology scales and the size of transistors shrinks, the power consumption and energy usage per transistor decrease. On the other hand, the transistor density increases significantly by technology scaling. Due to technology factors, the reduction in power consumption per transistor is not sufficient to offset the increase in power consumption per unit area. Therefore, to improve performance, increasing energy-efficiency must be addressed at all design levels from circuit level to application and algorithm levels.
At architectural level, one promising approach is to populate the system with hardware accelerators each optimized for a specific task. One drawback of hardware accelerators is that they are not programmable. Therefore, their utilization can be low as they perform one specific function. Using software programmable accelerators is an alternative approach to achieve high energy-efficiency and programmability. Due to intrinsic characteristics of software accelerators, they can exploit both instruction level parallelism and data level parallelism.
Coarse-Grained Reconfigurable Architecture (CGRA) is a software programmable accelerator consists of a number of word-level functional units. Motivated by promising characteristics of software programmable accelerators, the potentials of CGRAs in future computing platforms is studied and an end-to-end CGRA research framework is developed. This framework consists of three different aspects: CGRA architectural design, integration in a computing system, and CGRA compiler. First, the design and implementation of a CGRA and its instruction set is presented. This design is then modeled in a cycle accurate system simulator. The simulation platform enables us to investigate several problems associated with a CGRA when it is deployed as an accelerator in a computing system. Next, the problem of mapping a compute intensive region of a program to CGRAs is formulated. From this formulation, several efficient algorithms are developed which effectively utilize CGRA scarce resources very well to minimize the running time of input applications. Finally, these mapping algorithms are integrated in a compiler framework to construct a compiler for CGRADissertation/ThesisDoctoral Dissertation Computer Science 201
Application of hypergraphs in decomposition of discrete systems
seria: Lecture Notes in Control and Computer Science ; vol. 23
Scotch and libScotch 5.1 User's Guide
127 pagesUser's manualThis document describes the capabilities and operations of Scotch and libScotch, a software package and a software library devoted to static mapping, partitioning, and sparse matrix block ordering of graphs and meshes/hypergraphs. It gives brief descriptions of the algorithms, details the input/output formats, instructions for use, installation procedures, and provides a number of examples. Scotch is distributed as free/libre software, and has been designed such that new partitioning or ordering methods can be added in a straightforward manner. It can therefore be used as a testbed for the easy and quick coding and testing of such new methods, and may also be redistributed, as a library, along with third-party software that makes use of it, either in its original or in updated forms
Weighted graph matching approaches to structure comparison and alignment and their application to biological problems
In pattern recognition and machine learning, comparing and contrasting are the most fundamental operations: from similarities we derive common rules encoded in the systems, while from difference we infer what makes each system unique. The biological sciences are not an exception to these operations and, in fact, rely heavily on their use. More recently, the emergence of high-throughput measurement technologies has highlighted the need for novel approaches capable of enhancing our ability to understand complex relationships in these data sets. Often, these relationships can be best represented using graphs (or networks), where nodes are biochemical components such as genes, RNAs, proteins or metabolites, and edges indicate the types (and often quality) of relationship. Comparison of relationships is generally performed by aligning the networks of interest. For example, for protein-protein interaction (PPI) networks, the goal of network alignment is to find mappings between nodes (proteins) which are highly useful in identifying signaling pathways or protein complexes and to annotate genes of unknown functionality from subnetworks conserved across multiple species. Phylogenetic trees are also graph structures that describe evolutionary relationship among groups of organisms and their hypothetical ancestors. As it has been shown in a large volume of previous work, comparison of trees also opens the possibility of supporting or building new evolutionary hypotheses: for example, the detection of host-parasite symbiosis, gene coevolution as a signal of physical interactions among genes, or nonstandard events such as horizontal gene transfer. The goal of this thesis is to develop and implement a flexible set of algorithms and methodologies that can be used for the alignment of trees and/or networks having various sizes and properties. We first define a new relaxed model of graph isomorphism in which the shortest path lengths are preserved between corresponding intra-node pairs. Then, based on Google's PageRank model, we present a new tree matching approach, phyloAligner, which resolves several weakness of previous approaches. We further generalize this tree matching algorithm to a broader flexible framework, MCS-Finder, as a scalable and error-tolerant approximation for identifying the maximum common substructure between weighted graphs or distance matrices of different sizes. For phylogenetic trees with weighted edges and strictly-labeled nodes, multidimensional scaling-based methods, xCEED, can effectively evaluate the structural similarity and identify which regions are congruent/incongruent. These methods successfully detected coevolutionary signals as well as nonstandard evolutionary events such as horizontal gene transfer, and recovered interaction specificity between multigene families
On the Feasibility and Limitations of Just-in-Time Instruction Set Extension for FPGA-Based Reconfigurable Processors
Reconfigurable instruction set processors provide the possibility of tailor the instruction set of a CPU to a particular application. While this customization process could be performed during runtime in order to adapt the CPU to the currently executed workload, this use case has been hardly investigated. In this paper, we study the feasibility of moving the customization process to runtime and evaluate the relation of the expected speedups and the associated overheads. To this end, we present a tool flow that is tailored to the requirements of this just-in-time ASIP specialization scenario. We evaluate our methods by targeting our previously introduced Woolcano reconfigurable ASIP architecture for a set of applications from the SPEC2006, SPEC2000, MiBench, and SciMark2 benchmark suites. Our results show that just-in-time ASIP specialization is promising for embedded computing applications, where average speedups of 5x can be achieved by spending 50 minutes for custom instruction identification and hardware generation. These overheads will be compensated if the applications execute for more than 2 hours. For the scientific computing benchmarks, the achievable speedup is only 1.2x, which requires significant execution times in the order of days to amortize the overheads
- …