588 research outputs found

    A polyhedral compilation framework for loops with dynamic data-dependent bounds

    Get PDF
    International audienceWe study the parallelizing compilation and loop nest optimization of an important class of programs where counted loops have a dynamic data-dependent upper bound. Such loops are amenable to a wider set of transformations than general while loops with inductively defined termination conditions: for example, the substitution of closed forms for induction variables remains applicable, removing the loop-carried data dependences induced by termination conditions. We propose an automatic compilation approach to parallelize and optimize dynamic counted loops. Our approach relies on affine relations only, as implemented in state-of-the-art polyhedral libraries. Revisiting a state-of-the-art framework to parallelize arbitrary while loops, we introduce additional control dependences on data-dependent predicates. Our method goes beyond the state of the art in fully automating the process, specializing the code generation algorithm to the case of dynamic counted loops and avoiding the introduction of spurious loop-carried dependences. We conduct experiments on representative irregular computations, from dynamic programming, computer vision and finite element methods to sparse matrix linear algebra. We validate that the method is applicable to general affine transformations for locality optimization, vectorization and parallelization

    The Potential of Synergistic Static, Dynamic and Speculative Loop Nest Optimizations for Automatic Parallelization

    Get PDF
    Research in automatic parallelization of loop-centric programs started with static analysis, then broadened its arsenal to include dynamic inspection-execution and speculative execution, the best results involving hybrid static-dynamic schemes. Beyond the detection of parallelism in a sequential program, scalable parallelization on many-core processors involves hard and interesting parallelism adaptation and mapping challenges. These challenges include tailoring data locality to the memory hierarchy, structuring independent tasks hierarchically to exploit multiple levels of parallelism, tuning the synchronization grain, balancing the execution load, decoupling the execution into thread-level pipelines, and leveraging heterogeneous hardware with specialized accelerators. The polyhedral framework allows to model, construct and apply very complex loop nest transformations addressing most of the parallelism adaptation and mapping challenges. But apart from hardware-specific, back-end oriented transformations (if-conversion, trace scheduling, value prediction), loop nest optimization has essentially ignored dynamic and speculative techniques. Research in polyhedral compilation recently reached a significant milestone towards the support of dynamic, data-dependent control flow. This opens a large avenue for blending dynamic analyses and speculative techniques with advanced loop nest optimizations. Selecting real-world examples from SPEC benchmarks and numerical kernels, we make a case for the design of synergistic static, dynamic and speculative loop transformation techniques. We also sketch the embedding of dynamic information, including speculative assumptions, in the heart of affine transformation search spaces

    Mira: A Framework for Static Performance Analysis

    Full text link
    The performance model of an application can pro- vide understanding about its runtime behavior on particular hardware. Such information can be analyzed by developers for performance tuning. However, model building and analyzing is frequently ignored during software development until perfor- mance problems arise because they require significant expertise and can involve many time-consuming application runs. In this paper, we propose a fast, accurate, flexible and user-friendly tool, Mira, for generating performance models by applying static program analysis, targeting scientific applications running on supercomputers. We parse both the source code and binary to estimate performance attributes with better accuracy than considering just source or just binary code. Because our analysis is static, the target program does not need to be executed on the target architecture, which enables users to perform analysis on available machines instead of conducting expensive exper- iments on potentially expensive resources. Moreover, statically generated models enable performance prediction on non-existent or unavailable architectures. In addition to flexibility, because model generation time is significantly reduced compared to dynamic analysis approaches, our method is suitable for rapid application performance analysis and improvement. We present several scientific application validation results to demonstrate the current capabilities of our approach on small benchmarks and a mini application

    Polly's Polyhedral Scheduling in the Presence of Reductions

    Full text link
    The polyhedral model provides a powerful mathematical abstraction to enable effective optimization of loop nests with respect to a given optimization goal, e.g., exploiting parallelism. Unexploited reduction properties are a frequent reason for polyhedral optimizers to assume parallelism prohibiting dependences. To our knowledge, no polyhedral loop optimizer available in any production compiler provides support for reductions. In this paper, we show that leveraging the parallelism of reductions can lead to a significant performance increase. We give a precise, dependence based, definition of reductions and discuss ways to extend polyhedral optimization to exploit the associativity and commutativity of reduction computations. We have implemented a reduction-enabled scheduling approach in the Polly polyhedral optimizer and evaluate it on the standard Polybench 3.2 benchmark suite. We were able to detect and model all 52 arithmetic reductions and achieve speedups up to 2.21Ă—\times on a quad core machine by exploiting the multidimensional reduction in the BiCG benchmark.Comment: Presented at the IMPACT15 worksho

    Polyhedral-based dynamic loop pipelining for high-level synthesis

    Get PDF
    Loop pipelining is one of the most important optimization methods in high-level synthesis (HLS) for increasing loop parallelism. There has been considerable work on improving loop pipelining, which mainly focuses on optimizing static operation scheduling and parallel memory accesses. Nonetheless, when loops contain complex memory dependencies, current techniques cannot generate high performance pipelines. In this paper, we extend the capability of loop pipelining in HLS to handle loops with uncertain dependencies (i.e., parameterized by an undetermined variable) and/or nonuniform dependencies (i.e., varying between loop iterations). Our optimization allows a pipeline to be statically scheduled without the aforementioned memory dependencies, but an associated controller will change the execution speed of loop iterations at runtime. This allows the augmented pipeline to process each loop iteration as fast as possible without violating memory dependencies. We use a parametric polyhedral analysis to generate the control logic for when to safely run all loop iterations in the pipeline and when to break the pipeline execution to resolve memory conflicts. Our techniques have been prototyped in an automated source-to-source code transformation framework, with Xilinx Vivado HLS, a leading HLS tool, as the RTL generation backend. Over a suite of benchmarks, experiments show that our optimization can implement optimized pipelines at almost the same clock speed as without our transformations, running approximately 3.7-10Ă— faster, with a reasonable resource overhead

    Polyhedral+Dataflow Graphs

    Get PDF
    This research presents an intermediate compiler representation that is designed for optimization, and emphasizes the temporary storage requirements and execution schedule of a given computation to guide optimization decisions. The representation is expressed as a dataflow graph that describes computational statements and data mappings within the polyhedral compilation model. The targeted applications include both the regular and irregular scientific domains. The intermediate representation can be integrated into existing compiler infrastructures. A specification language implemented as a domain specific language in C++ describes the graph components and the transformations that can be applied. The visual representation allows users to reason about optimizations. Graph variants can be translated into source code or other representation. The language, intermediate representation, and associated transformations have been applied to improve the performance of differential equation solvers, or sparse matrix operations, tensor decomposition, and structured multigrid methods

    Effective Cache Apportioning for Performance Isolation Under Compiler Guidance

    Full text link
    With a growing number of cores in modern high-performance servers, effective sharing of the last level cache (LLC) is more critical than ever. The primary agenda of such systems is to maximize performance by efficiently supporting multi-tenancy of diverse workloads. However, this could be particularly challenging to achieve in practice, because modern workloads exhibit dynamic phase behaviour, which causes their cache requirements & sensitivities to vary at finer granularities during execution. Unfortunately, existing systems are oblivious to the application phase behavior, and are unable to detect and react quickly enough to these rapidly changing cache requirements, often incurring significant performance degradation. In this paper, we propose Com-CAS, a new apportioning system that provides dynamic cache allocations for co-executing applications. Com-CAS differs from the existing cache partitioning systems by adapting to the dynamic cache requirements of applications just-in-time, as opposed to reacting, without any hardware modifications. The front-end of Com-CAS consists of compiler-analysis equipped with machine learning mechanisms to predict cache requirements, while the back-end consists of proactive scheduler that dynamically apportions LLC amongst co-executing applications leveraging Intel Cache Allocation Technology (CAT). Com-CAS's partitioning scheme utilizes the compiler-generated information across finer granularities to predict the rapidly changing dynamic application behaviors, while simultaneously maintaining data locality. Our experiments show that Com-CAS improves average weighted throughput by 15% over unpartitioned cache system, and outperforms state-of-the-art partitioning system KPart by 20%, while maintaining the worst individual application completion time degradation to meet various Service-Level Agreement (SLA) requirements

    Transformations of High-Level Synthesis Codes for High-Performance Computing

    Full text link
    Specialized hardware architectures promise a major step in performance and energy efficiency over the traditional load/store devices currently employed in large scale computing systems. The adoption of high-level synthesis (HLS) from languages such as C/C++ and OpenCL has greatly increased programmer productivity when designing for such platforms. While this has enabled a wider audience to target specialized hardware, the optimization principles known from traditional software design are no longer sufficient to implement high-performance codes. Fast and efficient codes for reconfigurable platforms are thus still challenging to design. To alleviate this, we present a set of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications. Our work provides a toolbox for developers, where we systematically identify classes of transformations, the characteristics of their effect on the HLS code and the resulting hardware (e.g., increases data reuse or resource consumption), and the objectives that each transformation can target (e.g., resolve interface contention, or increase parallelism). We show how these can be used to efficiently exploit pipelining, on-chip distributed fast memory, and on-chip streaming dataflow, allowing for massively parallel architectures. To quantify the effect of our transformations, we use them to optimize a set of throughput-oriented FPGA kernels, demonstrating that our enhancements are sufficient to scale up parallelism within the hardware constraints. With the transformations covered, we hope to establish a common framework for performance engineers, compiler developers, and hardware developers, to tap into the performance potential offered by specialized hardware architectures using HLS
    • …
    corecore