5 research outputs found
Recommended from our members
Exploiting tightly-coupled cores
As we move steadily through the multicore era, and the number of processing cores on each chip continues to rise, parallel computation becomes increasingly important. However, parallelising an application is often difficult because of dependencies between different regions of code which require cores to communicate. Communication is usually slow compared to computation, and so restricts the opportunities for profitable parallelisation. In this work, I explore the opportunities provided when communication between cores has a very low latency and low energy cost. I observe that there are many different ways in which multiple cores can be used to execute a program, allowing more parallelism to be exploited in more situations, and also providing energy savings in some cases. Individual cores can be made very simple and efficient because they do not need to exploit parallelism internally. The communication patterns between cores can be updated frequently to reflect the parallelism available at the time, allowing better utilisation than specialised hardware which is used infrequently.
In this dissertation I introduce Loki: a homogeneous, tiled architecture made up of many simple, tightly-coupled cores. I demonstrate the benefits in both performance and energy consumption which can be achieved with this arrangement and observe that it is also likely to have lower design and validation costs and be easier to optimise. I then determine exactly where the performance bottlenecks of the design are, and where the energy is consumed, and look into some more-advanced optimisations which can make parallelism even more profitable
Recommended from our members
Architectural Implications of Automatic Parallelization With HELIX-RC
As classic Dennard process scaling fades into the past, power density concerns have driven modern CPU designs to de-emphasize the pursuit of single-thread performance, focusing instead on increasing the number of cores in a chip. Computing throughput on a modern chip continues to improve, since multiple programs can run in parallel, but the performance of single programs improves only incrementally. Many compilers have been designed to automatically parallelize sequentially written programs by leveraging multiple cores for the same task, thereby enabling continued single-thread performance gains. One such compiler is HELIX, which can increase the performance of a mixture of SPECfp and SPECint benchmarks by 2X on a 6-core Nehalem CPU.
Previous approaches to automatically parallelize irregular programs have focused on removing apparent dependences through thread-level speculation, which limits the type of code that can be targeted. In contrast, this dissertation increases the amount of code that can be parallelized by addressing the specific communication demands of that code. The dissertation proposes a special purpose extension of the cache hierarchy, called ring cache, to greatly reduce the perceived communication latency between cores running an automatically parallelized program. This co-design of ring cache and the HELIX compiler, called HELIX-RC, increases the speedup of 10 SPEC benchmarks running on 16 simulated in-order cores from an average of 2X to an average of over 8X. Speedups are slightly reduced to 7X on out-of-order cores, which extract instruction-level parallelism on their own. A fully synthesized Verilog implementation of ring cache is evaluated and is shown to consume less than 25mW of power with an area of less than
0.275 square millimeters.
This dissertation includes a study comparing single program per core multiprogramming and HELIX-RC. Counterintuitively, some HELIX-RC parallelized benchmarks not only surpass simple multiprogramming in terms of single program performance, but can also beat multiprogramming in terms of total multicore throughput by reducing the effective per-core working set of a program.
With communication bottlenecks removed by ring cache, automatic parallelization with HELIX-RC restores a decade of lost single-thread performance improvements.Engineering and Applied Sciences - Engineering Science
The exploitation of parallelism on shared memory multiprocessors
PhD ThesisWith the arrival of many general purpose shared memory multiple processor
(multiprocessor) computers into the commercial arena during the mid-1980's, a
rift has opened between the raw processing power offered by the emerging
hardware and the relative inability of its operating software to effectively deliver
this power to potential users. This rift stems from the fact that, currently, no
computational model with the capability to elegantly express parallel activity is
mature enough to be universally accepted, and used as the basis for programming
languages to exploit the parallelism that multiprocessors offer. To add to this,
there is a lack of software tools to assist programmers in the processes of designing
and debugging parallel programs.
Although much research has been done in the field of programming languages,
no undisputed candidate for the most appropriate language for programming
shared memory multiprocessors has yet been found. This thesis examines why this
state of affairs has arisen and proposes programming language constructs,
together with a programming methodology and environment, to close the ever
widening hardware to software gap.
The novel programming constructs described in this thesis are intended for use
in imperative languages even though they make use of the synchronisation
inherent in the dataflow model by using the semantics of single assignment when
operating on shared data, so giving rise to the term shared values. As there are
several distinct parallel programming paradigms, matching flavours of shared
value are developed to permit the concise expression of these paradigms.The Science and Engineering Research Council
HPCCP/CAS Workshop Proceedings 1998
This publication is a collection of extended abstracts of presentations given at the HPCCP/CAS (High Performance Computing and Communications Program/Computational Aerosciences Project) Workshop held on August 24-26, 1998, at NASA Ames Research Center, Moffett Field, California. The objective of the Workshop was to bring together the aerospace high performance computing community, consisting of airframe and propulsion companies, independent software vendors, university researchers, and government scientists and engineers. The Workshop was sponsored by the HPCCP Office at NASA Ames Research Center. The Workshop consisted of over 40 presentations, including an overview of NASA's High Performance Computing and Communications Program and the Computational Aerosciences Project; ten sessions of papers representative of the high performance computing research conducted within the Program by the aerospace industry, academia, NASA, and other government laboratories; two panel sessions; and a special presentation by Mr. James Bailey