39 research outputs found

    CRAUL: Compiler and Run-Time Integration for Adaptation under Load

    Get PDF

    Performance and Memory Space Optimizations for Embedded Systems

    Get PDF
    Embedded systems have three common principles: real-time performance, low power consumption, and low price (limited hardware). Embedded computers use chip multiprocessors (CMPs) to meet these expectations. However, one of the major problems is lack of efficient software support for CMPs; in particular, automated code parallelizers are needed. The aim of this study is to explore various ways to increase performance, as well as reducing resource usage and energy consumption for embedded systems. We use code restructuring, loop scheduling, data transformation, code and data placement, and scratch-pad memory (SPM) management as our tools in different embedded system scenarios. The majority of our work is focused on loop scheduling. Main contributions of our work are: We propose a memory saving strategy that exploits the value locality in array data by storing arrays in a compressed form. Based on the compressed forms of the input arrays, our approach automatically determines the compressed forms of the output arrays and also automatically restructures the code. We propose and evaluate a compiler-directed code scheduling scheme, which considers both parallelism and data locality. It analyzes the code using a locality parallelism graph representation, and assigns the nodes of this graph to processors.We also introduce an Integer Linear Programming based formulation of the scheduling problem. We propose a compiler-based SPM conscious loop scheduling strategy for array/loop based embedded applications. The method is to distribute loop iterations across parallel processors in an SPM-conscious manner. The compiler identifies potential SPM hits and misses, and distributes loop iterations such that the processors have close execution times. We present an SPM management technique using Markov chain based data access. We propose a compiler directed integrated code and data placement scheme for 2-D mesh based CMP architectures. Using a Code-Data Affinity Graph (CDAG) to represent the relationship between loop iterations and array data, it assigns the sets of loop iterations to processing cores and sets of data blocks to on-chip memories. We present a memory bank aware dynamic loop scheduling scheme for array intensive applications.The goal is to minimize the number of memory banks needed for executing the group of loop iterations

    Runtime Support for In-Core and Out-of-Core Data-Parallel Programs

    Get PDF
    Distributed memory parallel computers or distributed computer systems are widely recognized as the only cost-effective means of achieving teraflops performance in the near future. However, the fact remains that they are difficult to program and advances in software for these machines have not kept pace with advances in hardware. This thesis addresses several issues in providing runtime support for in-core as well as out-of-core programs on distributed memory parallel computers. This runtime support can be directly used in application programs for greater efficiency, portability and ease of programming. It can also be used together with a compiler to translate programs written in a high-level data-parallel language like High Performance Fortran (HPF) to node programs for distributed memory machines. In distributed memory programs, it is often necessary to change the distribution of arrays during program execution. This thesis presents efficient and portable algorithms for runtime array redistribution. The algorithms have been implemented on the Intel Touchstone Delta and are found to scale well with the number of processors and array size. This thesis also presents algorithms for all to all collective communication on fat tree and two dimensional mesh interconnection topologies. The performance of these algorithms on the CM 5 and Touchstone Delta is studied extensively. A model for estimating the time taken by these algorithms on the basis of system parameters is developed and validated by comparing with experimental results. A number of applications deal with very large data sets which cannot fit in main memory, and hence have to be stored in files on disks, resulting in out of core programs. This thesis also describes the design and implementation of efficient runtime support for out of core computations. Several optimizations for accessing out of core data are presented. An extended Two Phase Method is proposed for accessing sections of out of core arrays efficiently. This method uses collective I/O and the I/O workload is divided among processors dynamically, depending on the access requests. Performance results obtained using this runtime support for out of core programs on the Touchstone Delta are presented

    RICH: implementing reductions in the cache hierarchy

    Get PDF
    Reductions constitute a frequent algorithmic pattern in high-performance and scientific computing. Sophisticated techniques are needed to ensure their correct and scalable concurrent execution on modern processors. Reductions on large arrays represent the most demanding case where traditional approaches are not always applicable due to low performance scalability. To address these challenges, we propose RICH, a runtime-assisted solution that relies on architectural and parallel programming model extensions. RICH updates the reduction variable directly in the cache hierarchy with the help of added in-cache functional units. Our programming model extensions fit with the most relevant parallel programming solutions for shared memory environments like OpenMP. RICH does not modify the ISA, which allows the use of algorithms with reductions from pre-compiled external libraries. Experiments show that our solution achieves the performance improvements of 11.2% on average, compared to the state-of-the-art hardware-based approaches, while it introduces 2.4% area and 3.8% power overhead.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by Generalitat de Catalunya (contracts 2017- SGR-1414 and 2017-SGR-1328). V. Dimić has been partially supported by the Agency for Management of University and Research Grants (AGAUR) of the Government of Catalonia under Ajuts per a la contractació de personal investigador novell fellowship number 2017 FI_B 00855. M. Moretó has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramón y Cajal fellowship number RYC-2016-21104. M. Casas has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2017-23269. This manuscript has been co-authored by National Technology & Engineering Solutions of Sandia, LLC. under Contract No. DENA0003525 with the U.S. Department of Energy/National Nuclear Security AdministrationPeer ReviewedPostprint (author's final draft

    High-Performance Computing and Four-Dimensional Data Assimilation: The Impact on Future and Current Problems

    Get PDF
    This is the final technical report for the project entitled: "High-Performance Computing and Four-Dimensional Data Assimilation: The Impact on Future and Current Problems", funded at NPAC by the DAO at NASA/GSFC. First, the motivation for the project is given in the introductory section, followed by the executive summary of major accomplishments and the list of project-related publications. Detailed analysis and description of research results is given in subsequent chapters and in the Appendix

    A metadata-enhanced framework for high performance visual effects

    No full text
    This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a vendor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation. Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying inter-thread and intra-core data sharing opportunities in memory access metadata. A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation

    GPUfs: The Case for Operating System Services on GPUs

    Get PDF
    Due to their impressive price/performance and performance/watt curves, GPUs have become the processor of choice for many types of intensively parallel computations from data mining to molecular dynamics simulations Unfortunately, GPU programming models are still almost entirely lacking core system abstractions, such as files and sockets, that CPU programmers have taken for granted for decades. Today's GPU is capable of amazing computational feats when fed with the right data and managed by application code on the host CPU, but it is incapable of initiating basic system interactions for itself, such as reading an input file from a disk. Because core system abstractions are unavailable to GPU code, GPU programmers today face many of the same challenges CPU application developers did a half-century ago-particularly the constant reimplementation of system abstractions such as data movement and management operations. We feel the time has come to provide GPU programs with the useful system services that CPU code already enjoys. This goal emerges from a broader trend to integrate GPUs more cleanly with operating systems (OS), as exemplified by recent work to support pipeline composition of GPU tasks Two key GPU characteristics make developing OS abstractions for GPUs challenging: data parallelism, and independent memory system. GPUs are optimized for Single Program Multiple Data (SPMD) parallelism, where the same program is used to concurrently process many different parts of the input data. GPU programs typically use tens of thousands of lightweight threads running similar or identical code with little control-flow variation. Conventional OS services, such as the POSIX file system API, were not built with such an execution environment in mind. In GPUfs, we had to adapt both the API semantics and its implementation to support such massive parallelism, allowing thousands of threads to efficiently invoke open, close, read, or write calls simultaneously. To feed their voracious appetites for data, high-end GPUs usually have their own DRAM storage. A massively parallel memory interface to this DRAM offers high bandwidth for local access by GPU code, but GPU access to the CPU's system memory is an order of magnitude slower, because it requires communication over bandwidth-constrained, higher latency PCI Express bus. In the increasingly common case of systems with multiple discrete GPUs- http://doi.acm.org/10.1145/2656206 standard in Apple's new Mac Pro, for example-each GPU has its own local memory, and accessing a GPU's own memory can be an order of magnitude more efficient than accessing a sibling GPU's memory. GPUs thus exhibit a particularly extreme non-uniform memory access (NUMA) property, making it performance-critical for the OS to optimize for access locality in data placement and reuse across CPU and GPU memories. GPUfs, for example, distributes its buffer cache across all CPU and GPU memories to enable idioms like process pipelines that read and write files from the same or different processors. To highlight the benefits of bringing core OS abstractions such as files to GPUs, we show the use of GPUfs in a self-contained GPU application for string search on NVIDIA GPUs. This application illustrates how GPUfs can efficiently support irregular workloads, in which parallel threads open and access dynamically-selected files of varying size and composition, and the output size might be arbitrarily large and determined at runtime. Our version is about seven times faster than a parallel 8-core CPU run on a full Linux kernel source stored in about 33,000 small files. While currently our prototype benchmarks represent only a few simple application data points for a single OS abstraction, they suggest that OS services for GPU code are not only hypothetically desirable, but are feasible and efficient in practice

    Dynamic load balancing via thread migration

    Get PDF
    Light-weight threads are becoming increasingly useful for parallel processing. This is particularly true for threads running in a distributed memory environment. Light-weight threads can be used to support latency hiding techniques, communication and computation overlap, and functional parallelism. Additionally, dynamic migration of light-weight threads supports both data locality and load balancing. Designing a thread migration mechanism presents some very unique and interesting challenges. One such challenge is maintaining communication between mobile threads. A potentially more difficult challenge involves maintaining the correctness of pointers within mobile threads. Since traditional pointers have no concept of address space, moving threads from processor to processor has a strong impact on the use of pointers. Options for dealing with pointers include restricting their use, adding a layer of software to support pointers referencing non-local data, and binding data to threads such that referenced data is always local to the thread. This dissertation presents the design and implementation of Chant, an efficient light-weight threads package which runs in a distributed memory environment. Chant was designed and implemented as a runtime system using MPI like and Pthreads like calls. Chant supports point-to-point message passing between threads executing in distributed address spaces. We focus on the use of Chant as a framework to support dynamic load balancing based on thread migration. We explore many of the issues which arise when designing and implementing a thread migration mechanism, as well as the issues which arise when considering the use of thread migration as a means for performing dynamic load balancing. This load balancing framework uses both system state information, including communication history, and user input. One of the basic functionalities of this load balancing framework is the ability of the user to customize the load balancing to fit particular classes of problems. This dissertation provides implementation details as well as discussion and justification of design choices. We go on to show that the overhead associated with our approach is within an acceptable range, and that significant performance gains can be achieved through the use of thread migration as a means of performing dynamic load balancing
    corecore