99 research outputs found
Efficient optimization of memory accesses in parallel programs
The power, frequency, and memory wall problems have caused a major shift in mainstream computing by introducing processors that contain multiple low power cores. As multi-core processors are becoming ubiquitous, software trends in both parallel programming languages and dynamic compilation have added new challenges to program compilation for multi-core processors. This thesis proposes a combination of high-level and low-level compiler optimizations to address these challenges.
The high-level optimizations introduced in this thesis include new approaches to May-Happen-in-Parallel analysis and Side-Effect analysis for parallel programs and a novel parallelism-aware Scalar Replacement for Load Elimination transformation. A new Isolation Consistency (IC) memory model is described that permits several scalar replacement transformation opportunities compared to many existing memory models.
The low-level optimizations include a novel approach to register allocation that retains the compile time and space efficiency of Linear Scan, while delivering runtime performance superior to both Linear Scan and Graph Coloring. The allocation phase is modeled as an optimization problem on a Bipartite Liveness Graph (BLG) data structure. The assignment phase focuses on reducing the number of spill instructions by using register-to-register move and exchange instructions wherever possible.
Experimental evaluations of our scalar replacement for load elimination transformation in the Jikes RVM dynamic compiler show decreases in dynamic counts for getfield operations of up to 99.99%, and performance improvements of up to 1.76x on 1 core, and 1.39x on 16 cores, when compared with the load elimination algorithm available in Jikes RVM. A prototype implementation of our BLG register allocator in Jikes RVM demonstrates runtime performance improvements of up to 3.52x relative to Linear Scan on an x86 processor. When compared to Graph Coloring register allocator in the GCC compiler framework, our allocator resulted in an execution time improvement of up to 5.8%, with an average improvement of 2.3% on a POWER5 processor.
With the experimental evaluations combined with the foundations presented in this thesis, we believe that the proposed high-level and low-level optimizations are useful in addressing some of the new challenges emerging in the optimization of parallel programs for multi-core architectures
A GPU Register File using Static Data Compression
GPUs rely on large register files to unlock thread-level parallelism for high
throughput. Unfortunately, large register files are power hungry, making it
important to seek for new approaches to improve their utilization.
This paper introduces a new register file organization for efficient
register-packing of narrow integer and floating-point operands designed to
leverage on advances in static analysis. We show that the hardware/software
co-designed register file organization yields a performance improvement of up
to 79%, and 18.6%, on average, at a modest output-quality degradation.Comment: Accepted to ICPP'2
Preprint: Open Source Compiling for V1Model RMT Switch: Making Data Center Networking Innovation Accessible
Very few of the innovations in deep networking have seen data center scale
implementation. Because the Data Center network's extreme scale performance
requires hardware implementation, which is only accessible to a few. However,
the emergence of reconfigurable match-action table (RMT) paradigm-based
switches have finally opened up the development life cycle of data plane
devices. The P4 language is the dominant language choice for programming these
devices. Now, Network operators can implement the desired feature over white
box RMT switches. The process involves an innovator writing new algorithms in
the P4 language and getting them compiled for the target hardware. However,
there is still a roadblock. After designing an algorithm, the P4 program's
compilation technology is not fully open-source. Thus, it is very difficult for
an average researcher to get deep insight into the performance of his/her
innovation when executed at the silicon level. There is no open-source compiler
backend available for this purpose. Proprietary compiler backends provided by
different hardware vendors are available for this purpose. However, they are
closed-source and do not provide access to the internal mapping mechanisms.
Which inhibits experimenting with new mapping algorithms and innovative
instruction sets for reconfigurable match-action table architecture. This paper
describes our work toward an open-source compiler backend for compiling P416
targeted for the V1Model architecture-based programmable switches.Comment: arXiv admin note: substantial text overlap with arXiv:2208.1289
Fault Tolerant Integer Data Computations: Algorithms and Applications
As computing units move to higher transistor integration densities and computing clusters become highly heterogeneous, studies begin to predict that, rather than being exceptions, data corruptions in memory and processor failures are likely to become more prevalent. It has therefore become imperative to improve the reliability of systems in the face of increasing soft error probabilities in memory and computing logic units of silicon CMOS integrated chips. This thesis introduces a new class of algorithms for fault tolerance in compute-intensive linear and sesquilinear (“one-and-half-linear”) data computations on integer data inputs within high-performance computing systems. The key difference between the proposed algorithms and existing fault tolerance methods is the elimination of the traditional requirement for additional hardware resources for system reliability. The first contribution of this thesis is in the detection of hardware-induced errors in integer matrix products. The proposed method of numerical packing for detecting a single error within a quadruple of matrix outputs is described in Chapter 2. The chapter includes analytic calculations of the proposed method’s computational complexity and reliability. Experimental results show that the proposed algorithm incurs comparable execution time overhead to existing algorithms for the detection and correction of a limited number of errors within generic matrix multiplication (GEMM) outputs. On the other hand, numerical packing becomes substantially more efficient in the mitigation of multiple errors. The achieved execution time gain of numerical packing is further analyzed with respect to its energy saving equivalent, thus paving the way for a new class of silent data corruption (SDC) mitigation method for integer matrix products that are fast, energy efficient, and highly reliable. A further advancement of the proposed numerical packing approach for the mitigation of core/processor failures in computing clusters (a.k.a., failstop failures) is described in Chapter 3 . The key advantage of this new packing approach is the ability to tolerate processor failures for all classes of sum-of-product computations. Because multimedia applications running on cloud computing platforms are now required to mitigate an increasing number of failures and outages at runtime, we analyze the efficiency of numerical packing within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminable) instances. Our results show that more than 70% reduction of cost can be achieved in comparison to conventional failure-intolerant processing based on AWS EC2 on-demand (i.e., higher-cost albeit guaranteed) instances. Finally, beyond numerical packing, we present a second approach for reliability in the case of linear and sesquilinear integer data computations by generalizing the recently-proposed concept of numerical entanglement. The proposed approach is capable of recovering from multiple fail-stop failures in a parallel/distributed computing environment. We present theoretical analysis of the computational and bit-width requirements of the proposed method in comparison to existing methods of checksum generation and processing. Our experiments with integer matrix products show that the proposed approach incurs 1.72% − 37.23% reduction in processing throughput in comparison to failure-intolerant processing while allowing for the mitigation of multiple fail-stop failures without the use of additional computing resources
Implicit-explicit Integrated Representations for Multi-view Video Compression
With the increasing consumption of 3D displays and virtual reality,
multi-view video has become a promising format. However, its high resolution
and multi-camera shooting result in a substantial increase in data volume,
making storage and transmission a challenging task. To tackle these
difficulties, we propose an implicit-explicit integrated representation for
multi-view video compression. Specifically, we first use the explicit
representation-based 2D video codec to encode one of the source views.
Subsequently, we propose employing the implicit neural representation
(INR)-based codec to encode the remaining views. The implicit codec takes the
time and view index of multi-view video as coordinate inputs and generates the
corresponding implicit reconstruction frames.To enhance the compressibility, we
introduce a multi-level feature grid embedding and a fully convolutional
architecture into the implicit codec. These components facilitate
coordinate-feature and feature-RGB mapping, respectively. To further enhance
the reconstruction quality from the INR codec, we leverage the high-quality
reconstructed frames from the explicit codec to achieve inter-view
compensation. Finally, the compensated results are fused with the implicit
reconstructions from the INR to obtain the final reconstructed frames. Our
proposed framework combines the strengths of both implicit neural
representation and explicit 2D codec. Extensive experiments conducted on public
datasets demonstrate that the proposed framework can achieve comparable or even
superior performance to the latest multi-view video compression standard MIV
and other INR-based schemes in terms of view compression and scene modeling
Survey on Combinatorial Register Allocation and Instruction Scheduling
Register allocation (mapping variables to processor registers or memory) and
instruction scheduling (reordering instructions to increase instruction-level
parallelism) are essential tasks for generating efficient assembly code in a
compiler. In the last three decades, combinatorial optimization has emerged as
an alternative to traditional, heuristic algorithms for these two tasks.
Combinatorial optimization approaches can deliver optimal solutions according
to a model, can precisely capture trade-offs between conflicting decisions, and
are more flexible at the expense of increased compilation time.
This paper provides an exhaustive literature review and a classification of
combinatorial optimization approaches to register allocation and instruction
scheduling, with a focus on the techniques that are most applied in this
context: integer programming, constraint programming, partitioned Boolean
quadratic programming, and enumeration. Researchers in compilers and
combinatorial optimization can benefit from identifying developments, trends,
and challenges in the area; compiler practitioners may discern opportunities
and grasp the potential benefit of applying combinatorial optimization
- …