1,746 research outputs found
Exploiting Fine-Grain Concurrency Analytical Insights in Superscalar Processor Design
This dissertation develops analytical models to provide insight into various design issues associated with superscalar-type processors, i.e., the processors capable of executing multiple instructions per cycle. A survey of the existing machines and literature has been completed with a proposed classification of various approaches for exploiting fine-grain concurrency. Optimization of a single pipeline is discussed based on an analytical model. The model-predicted performance curves are found to be in close proximity to published results using simulation techniques. A model is also developed for comparing different branch strategies for single-pipeline processors in terms of their effectiveness in reducing branch delay. The additional instruction fetch traffic generated by certain branch strategies is also studied and is shown to be a useful criterion for choosing between equally well performing strategies. Next, processors with multiple pipelines are modelled to study the tradeoffs associated with deeper pipelines versus multiple pipelines. The model developed can reveal the cause of performance bottleneck: insufficient resources to exploit discovered parallelism, insufficient instruction stream parallelism, or insufficient scope of concurrency detection. The cost associated with speculative (i.e., beyond basic block) execution is examined via probability distributions that characterize the inherent parallelism in the instruction stream. The throughput prediction of the analytic model is shown, using a variety of benchmarks, to be close to the measured static throughput of the compiler output, under resource and scope constraints. Further experiments provide misprediction delay estimates for these benchmarks under scope constraints, assuming beyond-basic-block, out-of-order execution and run-time scheduling. These results were derived using traces generated by the Multiflow TRACE SCHEDULINGâ„¢(*) compacting C and FORTRAN 77 compilers. A simplified extension to the model to include multiprocessors is also proposed. The extended model is used to analyze combined systems, such as superpipelined multiprocessors and superscalar multiprocessors, both with shared memory. It is shown that the number of pipelines (or processors) at which the maximum throughput is obtained is increasingly sensitive to the ratio of memory access time to network access delay, as memory access time increases. Further, as a function of inter-iteration dependency distance, optimum throughput is shown to vary nonlinearly, whereas the corresponding Optimum number of processors varies linearly. The predictions from the analytical model agree with published results based on simulations. (*)TRACE SCHEDULING is a trademark of Multiflow Computer, Inc
System-level design of energy-efficient sensor-based human activity recognition systems: a model-based approach
This thesis contributes an evaluation of state-of-the-art dataflow models of computation regarding their suitability for a model-based design and analysis of human activity recognition systems, in terms of expressiveness and analyzability, as well as model accuracy. Different aspects of state-of-the-art human activity recognition systems have been modeled and analyzed. Based on existing methods, novel analysis approaches have been developed to acquire extra-functional properties like processor utilization, data communication rates, and finally energy consumption of the system
The "MIND" Scalable PIM Architecture
MIND (Memory, Intelligence, and Network Device) is an advanced parallel computer architecture for high performance computing and scalable embedded processing. It is a
Processor-in-Memory (PIM) architecture integrating both DRAM bit cells and CMOS logic devices on the same silicon die. MIND is multicore with multiple memory/processor nodes on
each chip and supports global shared memory across systems of MIND components. MIND is distinguished from other PIM architectures in that it incorporates mechanisms for efficient support of a global parallel execution model based on the semantics of message-driven multithreaded split-transaction processing. MIND is designed to operate either in conjunction with other conventional microprocessors or in standalone arrays of like devices. It also incorporates mechanisms for fault tolerance, real time execution, and active power management. This paper describes the major elements and operational methods of the MIND
architecture
Power Bounded Computing on Current & Emerging HPC Systems
Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets.
We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency.
Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems
GPGPU microbenchmarking for irregular application optimization
Irregular applications, such as unstructured mesh operations, do not easily map onto the typical GPU programming paradigms endorsed by GPU manufacturers, which mostly focus on maximizing concurrency for latency hiding. In this work, we show how alternative techniques focused on latency amortization can be used to control overall latency while requiring less concurrency. We used a custom-built microbenchmarking framework to test several GPU kernels and show how the GPU behaves under relevant workloads. We demonstrate that coalescing is not required for efficacious performance; an uncoalesced access pattern can achieve high bandwidth - even over 80% of the theoretical global memory bandwidth in certain circumstances. We also make other further observations on specific relevant behaviors of GPUs. We hope that this study opens the door for further investigation into techniques that can exploit latency amortization when latency hiding does not achieve sufficient performance
Three Highly Parallel Computer Architectures and Their Suitability for Three Representative Artificial Intelligence Problems
Virtually all current Artificial Intelligence (AI) applications are designed to run on sequential (von Neumann) computer architectures. As a result, current systems do not scale up. As knowledge is added to these systems, a point is reached where their performance quickly degrades. The performance of a von Neumann machine is limited by the bandwidth between memory and processor (the von Neumann bottleneck). The bottleneck is avoided by distributing the processing power across the memory of the computer. In this scheme the memory becomes the processor (a smart memory ).
This paper highlights the relationship between three representative AI application domains, namely knowledge representation, rule-based expert systems, and vision, and their parallel hardware realizations. Three machines, covering a wide range of fundamental properties of parallel processors, namely module granularity, concurrency control, and communication geometry, are reviewed: the Connection Machine (a fine-grained SIMD hypercube), DADO (a medium-grained MIMD/SIMD/MSIMD tree-machine), and the Butterfly (a coarse-grained MIMD Butterflyswitch machine)
The generation of concurrent code for declarative languages
PhD ThesisThis thesis presents an approach to the implementation of declarative languages
on a simple, general purpose concurrent architecture. The safe exploitation of
the available concurrency is managed by relatively sophisticated code generation
techniques to transform programs into an intermediate concurrent machine
code. Compilation techniques are discussed for 1'-HYBRID, a strongly typed
applicative language, and for 'c-HYBRID, a concurrent, nondeterministic logic
language.
An approach is presented for 1'- HYBRID whereby the style of programming
influences the concurrency utilised when a program executes. Code transformation
techniques are presented which generalise tail-recursion optimisation,
allowing many recursive functions to be modelled by perpetual processes. A
scheme is also presented to allow parallelism to be increased by the use of local
declarations, and constrained by the use of special forms of identity function.
In order to preserve determinism in the language, a novel fault handling
mechanism is used, whereby exceptions generated at run-time are treated as a
special class of values within the language.
A description is given of ,C-HYBRID, a dialect of the nondeterministic logic
language Concurrent Prolog. The language is embedded within the applicative
language 1'-HYBRID, yielding a combined applicative and logic programming
language. Various cross-calling techniques are described, including the use of
applicative scoping rules to allow local logical assertions.
A description is given of a polymorphic typechecking algorithm for logic
programs, which allows different instances of clauses to unify objects of different
types. The concept of a method is derived to allow unification Information to
be passed as an implicit argument to clauses which require it. In addition,
the typechecking algorithm permits higher-order objects such as functions to be
passed within arguments to clauses.
Using Concurrent Prolog's model of concurrency, techniques are described
which permit compilation of 'c-HYBRID programs to abstract machine code derived
from that used for the applicative language. The use of methods allows
polymorphic logic programs to execute without the need for run-time type information
in data structures.The Science and Engineering Research Council
- …