8 research outputs found
Recommended from our members
Characterization of voltage noise in big, small and single-ISA heterogeneous systems
Sensitivity of the microprocessor to voltage fluctuations is becoming a major concern with growing emphasis on designing power-efficient microprocessors. Voltage fluctuations that exceed a certain threshold cause "emergencies" that can lead to timing errors in the processor, thus risking reliability. To guarantee correctness under such conditions, large voltage guardbands are employed, at the cost of reduced performance and wastage of power. Trends in microprocessor technology indicate that worst-case operating voltage margins are not sustainable. Since voltage emergencies occur only infrequently, resilient architectures with aggressive guardbands are needed. However, to enable the exploration of the design space of resilient processors, it is important to have a deep understanding of the characteristics of voltage noise in different system configurations. Prior research in this area has mostly focused on systems with very few cores. Given the increasing relevance of large multi-core systems, this thesis presents a detailed characterization of voltage noise on chip multi-processors, consisting of large number of cores. The data indicates that while the worst case voltage droop increases with increase in the number of cores, the frequency of occurrence of the droops is not greatly impacted, emphasizing the feasibility of employing resilient microarchitectures with aggressive voltage margins. The thesis also presents a comparative study of voltage noise in CMPs consisting of either high-performant out-of-order cores and power-efficient in-order cores. The study highlights that the out-of-order cores experience much larger voltage variations when compared to the in-order cores, but offer a clear advantage in terms of performance. Experiments indicate that in-order configurations that offer equivalent performance to the out-of-order cores result in large energy-delay product, indicating the trade-offs involved in designing for performance, power and reliability. The thesis also presents a study of voltage noise in single-ISA heterogeneous configurations, to highlight the benefits of such systems towards lowering the worst-case voltage margins, which improve both performance and power. The experimental results indicate that the worst-case voltage droop in such heterogeneous systems lies in between the out-of-order and in-order cores and provide reasonable power-efficiency and performance. Further, the work highlights the importance of exploring the design-space of heterogeneous systems considering reliability as an important design criteria.Computer Science
The Teleportation Design Pattern for Hardware Transactional Memory
We identify a design pattern for concurrent data structures, called teleportation, that uses best- effort hardware transactional memory to speed up certain kinds of legacy concurrent data struc- tures. Teleportation unifies and explains several existing data structure designs, and it serves as the basis for novel approaches to reducing the memory traffic associated with fine-grained locking, and with hazard pointer management for memory reclamation
HopliteBuf FPGA Network-on-Chip: Architecture and Analysis
We can prove occupancy bounds of stall-free FIFOs used in deflection-free, low-cost, and high-speed FPGA overlay Network-on-chips (NoCs). In our work, we build on top of the HopliteRT livelock-free overlay NoC with an FPGA-friendly 2D unidirectional torus topology to propose the novel HopliteBuf NoC. In our new NoC, we strategically introduce stall-free FIFOs in the network and support these FIFOs with static analysis based on network calculus to compute FIFO occupancy, latency, and bandwidth bounds. The microarchitecture of HopliteBuf combines the performance benefits of conventional buffered NoCs (high throughput, low latency) with the cost advantages of deflection-routed NoCs (low FPGA area, high clock frequencies).
Specifically, we look at two design variants of the HopliteBuf NoC: (1) Single corner-turn FIFO (W to S), and (2) Dual corner-turn FIFO (W to S+N). The single corner-turn (W to S) design is simpler and only introduces a buffering requirement for packets changing dimension from X ring to the downhill Y ring (or West to South). The dual corner-turn variant requires two FIFOs for turning packets going downhill (W to S) as well as uphill (W to N). The dual corner-turn design overcomes the mathematical analysis challenges associated with single corner-turn designs for communication workloads with cyclic dependencies between flow traversal paths at the expense of small increase in resource cost. Essentially, we resolve an analysis challenge with extra hardware resources.
Across a range of 100 synthetically-generated workloads on a 5 x 5 NoC, HopliteBuf outperforms HopliteRT by 1.2-2x in terms of latency, 10% in terms of injection rate, and 30-60% in terms of flowset feasibiliy. These advantages come at the cost of 3-4x higher FPGA resource requirement for buffers and muxes. Our analysis also deliver latency bounds that are not only better than HopliteRT in absolute terms but also tighter by 2-3x allowing us to provision less hardware to meet our specifications
Comparative modelling and verification of Pthreads and Dthreads
The POSIX threads (Pthreads) library is a thread API for C/C++ to control parallel threads and spawn concurrent process flows. Programming in Pthreads usually suffers from undesirable deadlock, data race, and race condition problems due to the potential nondeterministic execution behaviors between parallel threads. Dthreads, as another multithreading model that re-implements Pthreads, was proposed by Liu et al for efficient deterministic multithreading. They found out that, under specific test cases, Dthreads can effectively prevent data races. However, no comparison test has been made with Pthreads. To perform a formal comparison between Pthreads and Dthreads over deadlocks, data races, and race conditions, in this paper, we adopt CSP (communicating sequential processes) as a formal model for specifying part of API functions in Pthreads and Dthreads and illustrate the model construction using 4 classical example programs. By feeding the models into the model checker PAT (process analysis toolkit), we have verified that deadlocks and data races exist in Pthreads, but do not exist in Dthreads, for the considered programs. We have also found that neither of them can prevent race conditions. Our comparative modelling and verification of Pthreads and Dthreads show that though Dthreads cannot prevent all the deadlock situations, shown by verification results of another 2 example programs, Dthreads is better than Pthreads on eliminating data races and preventing deadlocks. Considering limited scalability of Dthreads, we have introduced a new programming model to support coarse granularity in bank transfer. Our modelling is also extended by covering the synchronization operations in Liu et al work
Cautiously Optimistic Program Analyses for Secure and Reliable Software
Modern computer systems still have various security and reliability vulnerabilities. Well-known dynamic analyses solutions can mitigate them using runtime monitors that serve as lifeguards. But the additional work in enforcing these security and safety properties incurs exorbitant performance costs, and such tools are rarely used in practice. Our work addresses this problem by constructing a novel technique- Cautiously Optimistic Program Analysis (COPA).
COPA is optimistic- it infers likely program invariants from dynamic observations, and assumes them in its static reasoning to precisely identify and elide wasteful runtime monitors. The resulting system is fast, but also ensures soundness by recovering to a conservatively optimized analysis when a likely invariant rarely fails at runtime. COPA is also cautious- by carefully restricting optimizations to only safe elisions, the recovery is greatly simplified. It avoids unbounded rollbacks upon recovery, thereby enabling analysis for live production software.
We demonstrate the effectiveness of Cautiously Optimistic Program Analyses in three areas:
Information-Flow Tracking (IFT) can help prevent security breaches and information leaks. But they are rarely used in practice due to their high performance overhead (>500% for web/email servers). COPA dramatically reduces this cost by eliding wasteful IFT monitors to make it practical (9% overhead, 4x speedup).
Automatic Garbage Collection (GC) in managed languages (e.g. Java) simplifies programming tasks while ensuring memory safety. However, there is no correct GC for weakly-typed languages (e.g. C/C++), and manual memory management is prone to errors that have been exploited in high profile attacks. We develop the first sound GC for C/C++, and use COPA to optimize its performance (16% overhead).
Sequential Consistency (SC) provides intuitive semantics to concurrent programs that simplifies reasoning for their correctness. However, ensuring SC behavior on commodity hardware remains expensive. We use COPA to ensure SC for Java at the language-level efficiently, and significantly reduce its cost (from 24% down to 5% on x86).
COPA provides a way to realize strong software security, reliability and semantic guarantees at practical costs.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/170027/1/subarno_1.pd
Multi-tasking scheduling for heterogeneous systems
Heterogeneous platforms play an increasingly important role in modern computer
systems. They combine high performance with low power consumption. From mobiles
to supercomputers, we see an increasing number of computer systems that are
heterogeneous.
The most well-known heterogeneous system, CPU+GPU platforms have been widely
used in recent years. As they become more mainstream, serving multiple tasks from
multiple users is an emerging challenge. A good scheduler can greatly improve performance.
However, indiscriminately allocating tasks based on availability leads to poor
performance. As modern GPUs have a large number of hardware resources, most tasks
cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising
solution, however, indiscriminately running tasks in parallel causes a slowdown.
This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed
to determine where to schedule OpenCL kernels. It predicts the best-fit device by
using a machine learning-based classifier, then schedules the kernels accordingly to either
CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed.
Kernels are merged if their predicted co-execution can provide better performance than
sequential execution. A machine learning based classifier is developed to find the best
kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to
schedule kernels separately on either CPU or GPU, and run kernels in pairs if their
co-execution can improve performance. The approaches developed in this thesis significantly
improve system performance and outperform all existing techniques
Mapping parallel programs to heterogeneous multi-core systems
Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile
to high-performance computing. They promise to deliver increased performance
at lower energy cost than purely homogeneous, CPU-based systems. In recent years
GPU-based heterogeneous systems have become increasingly popular. They combine
a programmable GPU with a multi-core CPU. GPUs have become flexible enough
to not only handle graphics workloads but also various kinds of general-purpose
algorithms. They are thus used as a coprocessor or accelerator alongside the CPU.
Developing applications for GPU-based heterogeneous systems involves several
challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus
important to carefully map the tasks of an application to the most suitable processor
in a system. Secondly, current frameworks for heterogeneous computing, such as
OpenCL, are low-level, requiring a thorough understanding of the hardware by the
programmer. This high barrier to entry could be lowered by automatically generating
and tuning this code from a high-level and thus more user-friendly programming
language. Both challenges are addressed in this thesis.
For the task mapping problem a machine learning-based approach is presented in
this thesis. It combines static features of the program code with runtime information
on input sizes to predict the optimal mapping of OpenCL kernels. This approach is
further extended to also take contention on the GPU into account. Both methods are
able to outperform competing mapping approaches by a significant margin.
Furthermore, this thesis develops a method for targeting GPU-based heterogeneous
systems from OpenMP, a directive-based framework for parallel computing.
OpenMP programs are translated to OpenCL and optimized for GPU performance.
At runtime a predictive model decides whether to execute the original OpenMP code
on the CPU or the generated OpenCL code on the GPU. This approach is shown to
outperform both a competing approach as well as hand-tuned code