23 research outputs found

    Feedback Driven Annotation and Refactoring of Parallel Programs

    Get PDF

    Doctor of Philosophy

    Get PDF
    dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches

    Domain Specific Computing in Tightly-Coupled Heterogeneous Systems

    Get PDF
    Over the past several decades, researchers and programmers across many disciplines have relied on Moores law and Dennard scaling for increases in compute capability in modern processors. However, recent data suggest that the number of transistors per square inch on integrated circuits is losing pace with Moores laws projection due to the breakdown of Dennard scaling at smaller semiconductor process nodes. This has signaled the beginning of a new “golden age in computer architecture” in which the paradigm will be shifted from improving traditional processor performance for general tasks to architecting hardware that executes a class of applications in a high-performing manner. This shift will be paved, in part, by making compute systems more heterogeneous and investigating domain specific architectures. However, the notion of domain specific architectures raises many research questions. Specifically, what constitutes a domain? How does one architect hardware for a specific domain? In this dissertation, we present our work towards domain specific computing. We start by constructing a guiding definition for our target domain and then creating a benchmark suite of applications based on our domain definition. We then use quantitative metrics from the literature to characterize our domain in order to gain insights regarding what would be most beneficial in hardware targeted specifically for the domain. From the characterization, we learn that data movement is a particularly salient aspect of our domain. Motivated by this fact, we evaluate our target platform, the Intel HARPv2 CPU+FPGA system, for architecting domain specific hardware through a portability and performance evaluation. To guide the creation of domain specific hardware for this platform, we create a novel tool to quantify spatial and temporal locality. We apply this tool to our benchmark suite and use the generated outputs as features to an unsupervised clustering algorithm. We posit that the resulting clusters represent sub-domains within our originally specified domain; specifically, these clusters inform whether a kernel of computation should be designed as a widely vectorized or deeply pipelined compute unit. Using the lessons learned from the domain characterization and hardware platform evaluation, we outline our process of designing hardware for our domain, and empirically verify that our prediction regarding a wide or deep kernel implementation is correct

    Computing the fast Fourier transform on SIMD microprocessors

    Get PDF
    This thesis describes how to compute the fast Fourier transform (FFT) of a power-of-two length signal on single-instruction, multiple-data (SIMD) microprocessors faster than or very close to the speed of state of the art libraries such as FFTW (“Fastest Fourier Transform in the West”), SPIRAL and Intel Integrated Performance Primitives (IPP). The conjugate-pair algorithm has advantages in terms of memory bandwidth, and three implementations of this algorithm, which incorporate latency and spatial locality optimizations, are automatically vectorized at the algorithm level of abstraction. Performance results on 2- way, 4-way and 8-way SIMD machines show that the performance scales much better than FFTW or SPIRAL. The implementations presented in this thesis are compiled into a high-performance FFT library called SFFT (“Streaming Fast Fourier Trans- form”), and benchmarked against FFTW, SPIRAL, Intel IPP and Apple Accelerate on sixteen x86 machines and two ARM NEON machines, and shown to be, in many cases, faster than these state of the art libraries, but without having to perform extensive machine specific calibration, thus demonstrating that there are good heuristics for predicting the performance of the FFT on SIMD microprocessors (i.e., the need for empirical optimization may be overstated)

    Analysis and transformation of legacy code

    Get PDF
    Hardware evolves faster than software. While a hardware system might need replacement every one to five years, the average lifespan of a software system is a decade, with some instances living up to several decades. Inevitably, code outlives the platform it was developed for and may become legacy: development of the software stops, but maintenance has to continue to keep up with the evolving ecosystem. No new features are added, but the software is still used to fulfil its original purpose. Even in the cases where it is still functional (which discourages its replacement), legacy code is inefficient, costly to maintain, and a risk to security. This thesis proposes methods to leverage the expertise put in the development of legacy code and to extend its useful lifespan, rather than to throw it away. A novel methodology is proposed, for automatically exploiting platform specific optimisations when retargeting a program to another platform. The key idea is to leverage the optimisation information embedded in vector processing intrinsic functions. The performance of the resulting code is shown to be close to the performance of manually retargeted programs, however with the human labour removed. Building on top of that, the question of discovering optimisation information when there are no hints in the form of intrinsics or annotations is investigated. This thesis postulates that such information can potentially be extracted from profiling the data flow during executions of the program. A context-aware data dependence profiling system is described, detailing previously overlooked aspects in related research. The system is shown to be essential in surpassing the information that can be inferred statically, in particular about loop iterators. Loop iterators are the controlling part of a loop. This thesis describes and evaluates a system for extracting the loop iterators in a program. It is found to significantly outperform previously known techniques and further increases the amount of information about the structure of a program that is available to a compiler. Combining this system with data dependence profiling improves its results even more. Loop iterator recognition enables other code modernising techniques, like source code rejuvenation and commutativity analysis. The former increases the use of idiomatic code and as a result increases the maintainability of the program. The latter can potentially drive parallelisation and thus dramatically improve runtime performance

    Video post processing architectures

    Get PDF

    Software/Hardware Co-Design to Improve Productivity, Portability, and Performance of Loop-Task Parallel Applications

    Full text link
    Computer architects are increasingly turning to programmable accelerators tailored for narrower classes of applications in order to achieve high performance and energy efficiency. A continuing challenge with accelerators is enabling the programmer to easily extract maximum performance without intimate knowledge of the underlying microarchitecture. It is important to consider productivity and portability, in addition to performance, as first-class metrics when developing and evaluating modern computing platforms. Software-centric approaches to achieving 3P computing platforms are compelling, but sacrifice efficiency and flexibility by hiding parallel abstractions from hardware and limiting the scope of the application domain. This thesis proposes a new software/hardware co-design approach to achieving 3P platforms, called the loop-task accelerator (LTA) platform, that provides high productivity and portability without sacrificing performance or efficiency across a wide range of applications. The LTA platform addresses the weaknesses of existing approaches that are identified through detailed experimentation with and analysis of modern application development. Discussion of an early attempt at a hardware-centric approach to achieving 3P platforms provides insight into area-efficient accelerator designs and highlights the need for innovations in both software and hardware. The LTA platform focuses on exploiting loop-task parallelism by exposing loop-tasks as a common parallel abstraction at the programming API, runtime, ISA, and microarchitectural levels. The LTA programming API uses the parallel_for construct to express loop-tasks that can be exploited both across cores and within a core, the LTA runtime distributes loop-tasks across cores, and a new xpfor instruction explicitly encodes loop-tasks as functions applied to a range of loop iterations. This thesis introduces a novel task-coupling taxonomy that captures how tasks can be coupled in both space and time. The LTA engine template can be configured at design time with variable spatial and temporal task coupling to accelerate the execution of both regular and irregular loop-tasks within a core. The LTA platform is evaluated with respect to the 3P’s using a vertically integrated research methodology. Compared to an in-order multi-core baseline, the LTA platform yields average improvements of 5.5× in raw performance, 2.5× in performance per area, and 1.2× in energy efficiency, while offering high productivity and portability

    Exploiting Multi-Level Parallelism in Streaming Applications for Heterogeneous Platforms with GPUs

    Get PDF
    Heterogeneous computing platforms support the traditional types of parallelism, such as e.g., instruction-level, data, task, and pipeline parallelism, and provide the opportunity to exploit a combination of different types of parallelism at different platform levels. The architectural diversity of platform components makes tapping into the platform potential a challenging programming task. This thesis makes an important step in this direction by introducing a novel methodology for automatic generation of structured, multi-level parallel programs from sequential applications. We introduce a novel hierarchical intermediate program representation (HiPRDG) that captures the notions of structure and hierarchy in the polyhedral model used for compile-time program transformation and code generation. Using the HiPRDG as the starting point, we present a novel method for generation of multi-level programs (MLPs) featuring different types of parallelism, such as task, data, and pipeline parallelism. Moreover, we introduce concepts and techniques for data parallelism identification, GPU code generation, and asynchronous data-driven execution on heterogeneous platforms with efficient overlapping of host-accelerator communication and computation. By enabling the modular, hybrid parallelization of program model components via HiPRDG, this thesis opens the door for highly efficient tailor-made parallel program generation and auto-tuning for next generations of multi-level heterogeneous platforms with diverse accelerators.Computer Systems, Imagery and Medi
    corecore