242 research outputs found

    Finding Sources of Synchronizationfree Slices in Perfectly Nested Loops

    No full text
    Algorithms, permitting us to find sources of synchronization-free slices of perfectly nested uniform and non-uniform loops, are presented. Sources extracted are to be used for creating synchronization-free-slices that can be executed independently preserving the lexicographic order of iterations in each slice. Our approach requires exact dependence analysis and based on operations on relations and sets. To describe and implement the algorithms, the dependence analysis by Pugh and Wonnacott was chosen where dependences are found in the form of tuple relations. The proposed algorithms have been implemented and verified by means of the Omega project software.Представлены алгоритмы, позволяющие находить несинхронизированные фрагменты, содержащие итерации полностью вложенных однородных и неоднородных циклов. Такие фрагменты могут выполняться независимо, сохраняя лексикографический порядок итераций в каждом фрагменте. Предложенный подход основан на операциях отношений и множеств и требует точного анализа зависимостей между операторами программы. Для описания и реализации алгоритмов выбран анализ зависимости по Пугу и Воннакоту, согласно которому зависимости отыскиваются в форме отношений кортежа. Описанные алгоритмы реализованы и верифицированы посредством программного пакета Omega project.Наведено алгоритми, що дозволяють знаходити несинхронізовані фрагменти, які вміщують ітерації повністю вкладених однорідних і неоднорідних циклів. Такі фрагменти можуть виконуватись незалежно, зберігаючи лексикографічний порядок ітерацій у кожному фрагменті. Запропонований підхід базується на операціях відношень та множин і потребує точного аналізу залежності між операторами програми. Для опису та реалізації алгоритмів обрано аналіз залежності по Пугу і Воннакоту, згідно з яким залежності знаходять у формі відношень кортежу. Описані алгоритми реалізовано і верифіковано за допомогою програмного пакета Omega project

    Extracting Synchronization-free Slices in Perfectly Nested Loops

    No full text
    An algorithm, permitting us to extract iterations belonging to synchronization-free slices and to generate code enumerating sources of such slices and iterations of each slice in lexicographical order is presented. Synchronization-free slices can be executed independently preserving the lexicographic order of iterations in each slice. Our approach requires exact dependence analysis and based on operations on relations and sets. To describe and implement the algorithms, the dependence analysis by Pugh and Wonnacott was chosen where dependences are found in the form of tuple relations. The proposed algorithms have been implemented and verified by means of the Omega project software. Presburger arithmetic limitations are discussed. Results of experiments are presented. Tasks for future research are outlined.Представлен алгоритм, позволяющий выделить итерации, принадлежащие несинхронизированным фрагментам, и генерировать программу, перечисляющую источники таких фрагментов и итераций в каждом фрагменте в лексикографическом порядке. Несинхронизированные фрагменты могут выполняться независимо, сохраняя лексикографический порядок итераций в каждом фрагменте. Данный подход требует точного анализа зависимости и основан на операциях с отношениями и множествами. Для описания и реализации алгоритмов, выбран анализ зависимости по Пугу и Воннакоту, в котором найдены зависимости в форме отношений кортежа. Предложенные алгоритмы реализованы и верифицированы посредством программного пакета Омега. Представлены результаты экспериментов.Наведено алгоритм, що дозволяє виділити ітерації, які належать несинхронізованим фрагментaм, і генерувати програму, яка перелічує джерела таких фрагментів та ітерацій у кожному фрагменті в лексикографічному порядку. Несинхронізовані фрагменти можуть виконуватися незалежно, зберігаючи лексикографічний порядок ітерацій у кожному фрагменті. Даний підхід потребує точного аналізу залежності і базується на операціях з відношенням та множинами. Для описування та реалізації алгоритмів обрано аналіз залежності за Пуго та Воннакотом, у якому знайдено залежності у формі відношень кортежа. Запропоновані алгоритми реалізовано і верифіковано за допомогою програмного пакета Омега. Наведено результати експериментів

    Parallel Tiled Code Generation with Loop Permutation within Tiles

    Get PDF
    An approach of generation of tiled code with an arbitrary order of loops within tiles is presented. It is based on the transitive closure of the program dependence graph and derived via a combination of the Polyhedral and Iteration Space Slicing frameworks. The approach is explained by means of a working example. Details of an implementation of the approach in the TRACO compiler are outlined. Increasing tiled program performance due to loop permutation within tiles is illustrated on real-life programs from the NAS Parallel Benchmark suite. An analysis of speed-up and scalability of parallel tiled code with loop permutation is presented

    TRACO: Source-to-Source Parallelizing Compiler

    Get PDF
    The paper presents a source-to-source compiler, TRACO, for automatic extraction of both coarse- and fine-grained parallelism available in C/C++ loops. Parallelization techniques implemented in TRACO are based on the transitive closure of a relation describing all the dependences in a loop. Coarse- and fine-grained parallelism is represented with synchronization-free slices (space partitions) and a legal loop statement instance schedule (time partitions), respectively. TRACO enables also applying scalar and array variable privatization as well as parallel reduction. On its output, TRACO produces compilable parallel OpenMP C/C++ and/or OpenACC C/C++ code. The effectiveness of TRACO, efficiency of parallel code produced by TRACO, and the time of parallel code production are evaluated by means of the NAS Parallel Benchmark and Polyhedral Benchmark suites. These features of TRACO are compared with closely related compilers such as ICC, Pluto, Par4All, and Cetus. Feature work is outlined

    Transitive Closure of Infinite Graphs and its Applications

    Get PDF
    Integer tuple relations can concisely summarize many types of information gathered from analysis of scientific codes. For example they can be used to precisely describe which iterations of a statement are data dependent of which other iterations. It is generally not possible to represent these tuple relations by enumerating the related pairs of tuples. For example, it is impossible to enumerate the related pairs of tuples in the relation {[i] -> [i+2] | 1 <= i <= n-2}. Even when it is possible to enumerate the related pairs of tuples, such as for the relation {[i,j] -> [i',j'] | 1 <= i,j,i',j' <= 100}, it is often not practical to do so. We instead use a closed form description by specifying a predicate consisting of affine constraints on the related pairs of tuples. As we just saw, these affine constraints can be parameterized, so what we are really describing are infinite families of relations (or graphs). Many of our applications of tuple relations rely heavily on an operation called transitive closure. Computing the transitive closure of these "infinite graphs" is very different from the traditional problem of computing the transitive closure of a graph whose edges can be enumerated. For example, the transitive closure of the first relation above is the relation {[i] -> [i'] | exists beta s.t. i'-i = 2beta and 1 <= i <= i' <= n}. As we will prove, this computation is not computable in the general case. We have developed algorithms that produce exact results in most commonly occurring cases and produce upper or lower bounds (as necessary) in the other cases. This paper will describe our algorithms for computing transitive closure and some of its applications such as determining which inter-processor synchronizations are redundant. (Also cross-referenced as UMIACS-TR-95-48

    Dynamic Tile Free Scheduling for Code with Acyclic Inter-Tile Dependence Graphs

    Get PDF
    Free scheduling is a task ordering technique under which instructions are executedas soon as their operands become available. Coarsening the grain ofcomputations under the free schedule, by means of using groups of loop neststatement instances (tiles) in place of single statement instances, increases thelocality of data accesses and reduces the number of synchronization events, andas a consequence improves program performance. The paper presents an approachfor code generation allowing for the free schedule for tiles of arbitrarilynested affine loops at run-time. The scope of the applicability of the introducedalgorithms is limited to tiled loop nests whose inter-tile dependence graphs arecycle-free. The approach is based on the Polyhedral Model. Results of experimentswith the PolyBench benchmark suite, demonstrating significant tiledcode speed-up, are discussed

    Beyond shared memory loop parallelism in the polyhedral model

    Get PDF
    2013 Spring.Includes bibliographical references.With the introduction of multi-core processors, motivated by power and energy concerns, parallel processing has become main-stream. Parallel programming is much more difficult due to its non-deterministic nature, and because of parallel programming bugs that arise from non-determinacy. One solution is automatic parallelization, where it is entirely up to the compiler to efficiently parallelize sequential programs. However, automatic parallelization is very difficult, and only a handful of successful techniques are available, even after decades of research. Automatic parallelization for distributed memory architectures is even more problematic in that it requires explicit handling of data partitioning and communication. Since data must be partitioned among multiple nodes that do not share memory, the original memory allocation of sequential programs cannot be directly used. One of the main contributions of this dissertation is the development of techniques for generating distributed memory parallel code with parametric tiling. Our approach builds on important contributions to the polyhedral model, a mathematical framework for reasoning about program transformations. We show that many affine control programs can be uniformized only with simple techniques. Being able to assume uniform dependences significantly simplifies distributed memory code generation, and also enables parametric tiling. Our approach implemented in the AlphaZ system, a system for prototyping analyses, transformations, and code generators in the polyhedral model. The key features of AlphaZ are memory re-allocation, and explicit representation of reductions. We evaluate our approach on a collection of polyhedral kernels from the PolyBench suite, and show that our approach scales as well as PLuTo, a state-of-the-art shared memory automatic parallelizer using the polyhedral model. Automatic parallelization is only one approach to dealing with the non-deterministic nature of parallel programming that leaves the difficulty entirely to the compiler. Another approach is to develop novel parallel programming languages. These languages, such as X10, aim to provide highly productive parallel programming environment by including parallelism into the language design. However, even in these languages, parallel bugs remain to be an important issue that hinders programmer productivity. Another contribution of this dissertation is to extend the array dataflow analysis to handle a subset of X10 programs. We apply the result of dataflow analysis to statically guarantee determinism. Providing static guarantees can significantly increase programmer productivity by catching questionable implementations at compile-time, or even while programming

    Adaptive streaming applications : analysis and implementation models

    Get PDF
    This thesis presents a highly automated design framework, called DaedalusRT, and several novel techniques. As the foundation of the DaedalusRT design framework, two types of dataflow Models-of-Computation (MoC) are used, one as timing analysis model and another one as the implementation model. The timing analysis model is used to formally reason about timing behavior of an application. In the context of DaedalusRT, the Mode-Aware Data Flow (MADF) MoC has been developed as the timing analysis model for adaptive streaming applications using different static modes. A novel mode transition protocol is devised to allow efficient reasoning of timing behavior during mode transitions. Based on the transition protocol, a hard real-time scheduling approach is proposed. On the other hand, the implementation model is used for efficient code generation of parallel computation, communication, and synchronization. In this thesis, the Parameterized Polyhedral Process Network (P3N) MoC has been developed to model adaptive streaming applications with parameter reconfiguration. An approach to verify the functional property of the P3N MoC has been devised. Finally, implementation of the P3N MoC on a MPSoC platform has shown that run-time performance penalty due to parameter reconfiguration is negligible.Technology Foundation STWComputer Systems, Imagery and Medi

    Engineering formal systems in constructive type theory

    Get PDF
    This thesis presents a practical methodology for formalizing the meta-theory of formal systems with binders and coinductive relations in constructive type theory. While constructive type theory offers support for reasoning about formal systems built out of inductive definitions, support for syntax with binders and coinductive relations is lacking. We provide this support. We implement syntax with binders using well-scoped de Bruijn terms and parallel substitutions. We solve substitution lemmas automatically using the rewriting theory of the -calculus. We present the Autosubst library to automate our approach in the proof assistant Coq. Our approach to coinductive relations is based on an inductive tower construction, which is a type-theoretic form of transfinite induction. The tower construction allows us to reduce coinduction to induction. This leads to a symmetric treatment of induction and coinduction and allows us to give a novel construction of the companion of a monotone function on a complete lattice. We demonstrate our methods with a series of case studies. In particular, we present a proof of type preservation for CC!, a proof of weak and strong normalization for System F, a proof that systems of weakly guarded equations have unique solutions in CCS, and a compiler verification for a compiler from a non-deterministic language into a deterministic language. All technical results in the thesis are formalized in Coq.In dieser Dissertation beschreiben wir praktische Techniken um Formale Systeme mit Bindern und koinduktiven Relationen in Konstruktiver Typtheorie zu implementieren. Während Konstruktive Typtheorie bereits gute Unterstützung für Induktive Definition bietet, gibt es momentan kaum Unterstützung für syntaktische Systeme mit Bindern, oder koinduktiven Definitionen. Wir kodieren Syntax mit Bindern in Typtheorie mit einer de Bruijn Darstellung und zeigen alle Substitutionslemmas durch Termersetzung mit dem -Kalkül. Wir präsentieren die Autosubst Bibliothek, die unseren Ansatz im Beweisassistenten Coq implementiert. Für koinduktive Relationen verwenden wir eine induktive Turmkonstruktion, welche das typtheoretische Analog zur Transfiniten Induktion darstellt. Auf diese Art erhalten wir neue Beweisprinzipien für Koinduktion und eine neue Konstruktion von Pous’ “companion” einer monotonen Funktion auf einem vollständigen Verband. Wir validieren unsere Methoden an einer Reihe von Fallstudien. Alle technischen Ergebnisse in dieser Dissertation sind mit Coq formalisiert

    Code-Optimierung im Polyedermodell - Effizienzsteigerung von parallelen Schleifensätzen

    Get PDF
    A safe basis for automatic loop parallelization is the polyhedron model which represents the iteration domain of a loop nest as a polyhedron in Zn\mathbb{Z}^n. However, turning the parallel loop program in the model to efficient code meets with several obstacles, due to which performance may deteriorate seriously -- especially on distributed memory architectures. We introduce a fine-grained model of the computation performed and show how this model can be applied to create efficient code
    corecore