24,978 research outputs found
Optimizing for a Many-Core Architecture without Compromising Ease-of-Programming
Faced with nearly stagnant clock speed advances, chip manufacturers have turned to parallelism as the source for continuing performance improvements. But even though numerous parallel architectures have already been brought to market, a universally accepted methodology for programming them for general purpose applications has yet to emerge. Existing solutions tend to be hardware-specific, rendering them difficult to use for the majority of application programmers and domain experts, and not providing scalability guarantees for future generations of the hardware.
This dissertation advances the validation of the following thesis: it is possible to develop efficient general-purpose programs for a many-core platform using a model recognized for its simplicity. To prove this thesis, we refer to the eXplicit Multi-Threading (XMT) architecture designed and built at the University of Maryland. XMT is an attempt at re-inventing parallel computing with a solid theoretical foundation and an aggressive scalable design. Algorithmically, XMT is inspired by the PRAM (Parallel Random Access Machine) model and the architecture design is focused on reducing inter-task communication and synchronization overheads and providing an easy-to-program parallel model.
This thesis builds upon the existing XMT infrastructure to improve support for efficient execution with a focus on ease-of-programming. Our contributions aim at reducing the programmer's effort in developing XMT applications and improving the overall performance. More concretely, we: (1) present a work-flow guiding programmers to produce efficient parallel solutions starting from a high-level problem; (2) introduce an analytical performance model for XMT programs and provide a methodology to project running time from an implementation; (3) propose and evaluate RAP -- an improved resource-aware compiler loop prefetching algorithm targeted at fine-grained many-core architectures; we demonstrate performance improvements of up to 34.79% on average over the GCC loop prefetching implementation and up to 24.61% on average over a simple hardware prefetching scheme; and (4) implement a number of parallel benchmarks and evaluate the overall performance of XMT relative to existing serial and parallel solutions, showing speedups of up to 13.89x vs.~ a serial processor and 8.10x vs.~parallel code optimized for an existing many-core (GPU). We also discuss the implementation and optimization of the Max-Flow algorithm on XMT, a problem which is among the more advanced in terms of complexity, benchmarking and research interest in the parallel algorithms community. We demonstrate better speed-ups compared to a best serial solution than previous attempts on other parallel platforms
Tree Parity Machine Rekeying Architectures
The necessity to secure the communication between hardware components in
embedded systems becomes increasingly important with regard to the secrecy of
data and particularly its commercial use. We suggest a low-cost (i.e. small
logic-area) solution for flexible security levels and short key lifetimes. The
basis is an approach for symmetric key exchange using the synchronisation of
Tree Parity Machines. Fast successive key generation enables a key exchange
within a few milliseconds, given realistic communication channels with a
limited bandwidth. For demonstration we evaluate characteristics of a
standard-cell ASIC design realisation as IP-core in 0.18-micrometer
CMOS-technology
Serialized Asynchronous Links for NoC
This paper proposes an asynchronous serialized link for NoC that can achieve the same levels of performance in terms of flits per second as a synchronous link but with a reduced number of wires in the point to point switch links and reduced power consumption. This is achieved by employing serialization in the asynchronous domain as opposed to synchronous to facilitate the removal of global clocking on the serial links. Based on transistor level simulations using 0.12 ?m foundry models it has been shown that it is possible to achieve the same level of performance as synchronous but with 75% reduction in wires and 65% reduction in power for a 300 MFlit/s link with 8 buffers with a switch clock speed of 300 MHz. Furthermore the paper presents the design requirements arising from interfacing switches of synchronous NoC and asynchronous serial links
Exploring performance and power properties of modern multicore chips via simple machine models
Modern multicore chips show complex behavior with respect to performance and
power. Starting with the Intel Sandy Bridge processor, it has become possible
to directly measure the power dissipation of a CPU chip and correlate this data
with the performance properties of the running code. Going beyond a simple
bottleneck analysis, we employ the recently published Execution-Cache-Memory
(ECM) model to describe the single- and multi-core performance of streaming
kernels. The model refines the well-known roofline model, since it can predict
the scaling and the saturation behavior of bandwidth-limited loop kernels on a
multicore chip. The saturation point is especially relevant for considerations
of energy consumption. From power dissipation measurements of benchmark
programs with vastly different requirements to the hardware, we derive a
simple, phenomenological power model for the Sandy Bridge processor. Together
with the ECM model, we are able to explain many peculiarities in the
performance and power behavior of multicore processors, and derive guidelines
for energy-efficient execution of parallel programs. Finally, we show that the
ECM and power models can be successfully used to describe the scaling and power
behavior of a lattice-Boltzmann flow solver code.Comment: 23 pages, 10 figures. Typos corrected, DOI adde
Communications for Next Generation single chip computers
It is the thesis of this report that much of what is presently thought to require specialized VLSI functions might instead be achieved by combinations of fast general purpose single chip computers with upgraded communication facilities. To this end, the characteristics of applications
of this nature are first surveyed briefly and some working principles established. In the light of these, three different chip philosophies are explored in some detail. This study shows that some upgrading of typical
single chip I/O will definitely be necessary, but that this upgrading does not have to be complex and that true multiprocessor-multibus operation could be achieved without excessive cost
- …