322 research outputs found

    Guided rewriting and constraint satisfaction for parallel GPU code generation

    Get PDF
    Graphics Processing Units (GPUs) are notoriously hard to optimise for manually due to their scheduling and memory hierarchies. What is needed are good automatic code generators and optimisers for such parallel hardware. Functional approaches such as Accelerate, Futhark and LIFT leverage a high-level algorithmic Intermediate Representation (IR) to expose parallelism and abstract the implementation details away from the user. However, producing efficient code for a given accelerator remains challenging. Existing code generators depend on the user input to choose a subset of hard-coded optimizations or automated exploration of implementation search space. The former suffers from the lack of extensibility, while the latter is too costly due to the size of the search space. A hybrid approach is needed, where a space of valid implementations is built automatically and explored with the aid of human expertise. This thesis presents a solution combining user-guided rewriting and automatically generated constraints to produce high-performance code. The first contribution is an automatic tuning technique to find a balance between performance and memory consumption. Leveraging its functional patterns, the LIFT compiler is empowered to infer tuning constraints and limit the search to valid tuning combinations only. Next, the thesis reframes parallelisation as a constraint satisfaction problem. Parallelisation constraints are extracted automatically from the input expression, and a solver is used to identify valid rewriting. The constraints truncate the search space to valid parallel mappings only by capturing the scheduling restrictions of the GPU in the context of a given program. A synchronisation barrier insertion technique is proposed to prevent data races and improve the efficiency of the generated parallel mappings. The final contribution of this thesis is the guided rewriting method, where the user encodes a design space of structural transformations using high-level IR nodes called rewrite points. These strongly typed pragmas express macro rewrites and expose design choices as explorable parameters. The thesis proposes a small set of reusable rewrite points to achieve tiling, cache locality, data reuse and memory optimisation. A comparison with the vendor-provided handwritten kernel ARM Compute Library and the TVM code generator demonstrates the effectiveness of this thesis' contributions. With convolution as a use case, LIFT-generated direct and GEMM-based convolution implementations are shown to perform on par with the state-of-the-art solutions on a mobile GPU. Overall, this thesis demonstrates that a functional IR yields well to user-guided and automatic rewriting for high-performance code generation

    Benchmarking the continuous behavior of stream processing techniques on the problem of enumerating weakly connected components

    Get PDF
    The problem of enumerating subgraphs satisfying some properties (e.g., patterns and mo- tifs) corresponds to listing such subgraphs. In general, because of the large/huge graphs re- lating objects in different domains (biology, mobile networks, social networks, genetics, etc.), waiting until achieving the whole enumeration of subgraphs satisfying some properties can be too expensive and unmanageable for some families of problems. Thus, the possibility of incrementally enumerating these subgraphs (i.e., a subgraph is returned as soon as they are identified) rises as a promising approach for some families of problems (e.g., network analysis and motif identification). In effect, following a search approach over the listed subgraphs allows stopping the enumeration when it is convenient. The implementation of incremental enumer- ation solutions suits well to stream processing paradigms and frameworks. In this work, as a proof of concept, we conduct a comparative study of the continuous behavior of two stream processing-based solutions for the problem of incrementally enumerating weakly connected components of (bounded) undirected graphs. One of these solutions is implemented using the Go programming language following the Dynamic Pipeline Paradigm, and the other solution is implemented on top of the DataSet API provided by the Apache Flink Framework. Finally, we analyze and report experiments against large graphs having up to 2, 746 connected compo- nents. Obtained results satisfy our expectations. To assess the incremental delivery of results, we measure the the diefficiency metrics, i.e., the continuous efficiency of the implementation of the algorithm for generating incremental results.El problema de l'enumeració dels subgrafs que satisfan determinades propietats (per exem- ple patrons i disenys) correspon a la mostra d'aquests subgrafs. En general, a causa dels grafs grans/enormes que relacionen objectes en diferents dominis (biologia, xarxes mòbils, xarxes socials, genètica, etc.), esperar fins a aconseguir l'enumeració sencera dels subgrafs que satisfacin certes propietats pot ser massa costós i immanejable per a algunes famílies de problemes. Per tant, la possibilitat d'enumerar incrementalment aquests subgrafs (es dir, enumerar els com- ponents a mesura que es calculen o s'identifiquen) sorgeix com una alternativa prometedora per a algunes famílies de problemes (per exemple, en l'an'alisi de xarxes i en l'identificació de patrons). En efecte, seguir un enfocament de cerca sobre els subgrafs enumerats permet aturar l'enumeració quan sigui convenient. La implementacióde solucions d'enumeració incremental s'adapta bé als frameworks i paradigmes de processament de fluxos de dades. En aquest tre- ball, com a prova de concepte, realitzem un estudi comparatiu del comportament continu de dues solucions basades en el processament de fluxos de dades per al problema de l'enumeració incremental de components d'ebilment connexes de grafs (acotats) no dirigits. Una d'aques- tes solucions s'implementa utilitzant el llenguatge de programació Go seguint el Paradigma de Dynamic Pipeline, i l'altra solució s'implementa sobre l'API de DataSet proporcionada per l'Apache Flink Framework. Finalment, analitzem i informem sobre experiments realitzats amb grafs grans que contenen fins a 2746 components connexes. Els resultats obtinguts satisfan les nostres expectatives. Per avaluar el lliurament incremental de resultats, mesurem les mètriques de dieficiència, es a dir, l'eficiència continua de la implementació de l'algorisme per generar resultats incrementals

    Algorithms for Triangles, Cones & Peaks

    Get PDF
    Three different geometric objects are at the center of this dissertation: triangles, cones and peaks. In computational geometry, triangles are the most basic shape for planar subdivisions. Particularly, Delaunay triangulations are a widely used for manifold applications in engineering, geographic information systems, telecommunication networks, etc. We present two novel parallel algorithms to construct the Delaunay triangulation of a given point set. Yao graphs are geometric spanners that connect each point of a given set to its nearest neighbor in each of kk cones drawn around it. They are used to aid the construction of Euclidean minimum spanning trees or in wireless networks for topology control and routing. We present the first implementation of an optimal O(nlogn)\mathcal{O}(n \log n)-time sweepline algorithm to construct Yao graphs. One metric to quantify the importance of a mountain peak is its isolation. Isolation measures the distance between a peak and the closest point of higher elevation. Computing this metric from high-resolution digital elevation models (DEMs) requires efficient algorithms. We present a novel sweep-plane algorithm that can calculate the isolation of all peaks on Earth in mere minutes

    Automated cache optimisations of stencil computations for partial differential equations

    Get PDF
    This thesis focuses on numerical methods that solve partial differential equations. Our focal point is the finite difference method, which solves partial differential equations by approximating derivatives with explicit finite differences. These partial differential equation solvers consist of stencil computations on structured grids. Stencils for computing real-world practical applications are patterns often characterised by many memory accesses and non-trivial arithmetic expressions that lead to high computational costs compared to simple stencils used in much prior proof-of-concept work. In addition, the loop nests to express stencils on structured grids may often be complicated. This work is highly motivated by a specific domain of stencil computations where one of the challenges is non-aligned to the structured grid ("off-the-grid") operations. These operations update neighbouring grid points through scatter and gather operations via non-affine memory accesses, such as {A[B[i]]}. In addition to this challenge, these practical stencils often include many computation fields (need to store multiple grid copies), complex data dependencies and imperfect loop nests. In this work, we aim to increase the performance of stencil kernel execution. We study automated cache-memory-dependent optimisations for stencil computations. This work consists of two core parts with their respective contributions.The first part of our work tries to reduce the data movement in stencil computations of practical interest. Data movement is a dominant factor affecting the performance of high-performance computing applications. It has long been a target of optimisations due to its impact on execution time and energy consumption. This thesis tries to relieve this cost by applying temporal blocking optimisations, also known as time-tiling, to stencil computations. Temporal blocking is a well-known technique to enhance data reuse in stencil computations. However, it is rarely used in practical applications but rather in theoretical examples to prove its efficacy. Applying temporal blocking to scientific simulations is more complex. More specifically, in this work, we focus on the application context of seismic and medical imaging. In this area, we often encounter scatter and gather operations due to signal sources and receivers at arbitrary locations in the computational domain. These operations make the application of temporal blocking challenging. We present an approach to overcome this challenge and successfully apply temporal blocking.In the second part of our work, we extend the first part as an automated approach targeting a wide range of simulations modelled with partial differential equations. Since temporal blocking is error-prone, tedious to apply by hand and highly complex to assimilate theoretically and practically, we are motivated to automate its application and automatically generate code that benefits from it. We discuss algorithmic approaches and present a generalised compiler pipeline to automate the application of temporal blocking. These passes are written in the Devito compiler. They are used to accelerate the computation of stencil kernels in areas such as seismic and medical imaging, computational fluid dynamics and machine learning. \href{www.devitoproject.org}{Devito} is a Python package to implement optimised stencil computation (e.g., finite differences, image processing, machine learning) from high-level symbolic problem definitions. Devito builds on \href{www.sympy.org}{SymPy} and employs automated code generation and just-in-time compilation to execute optimised computational kernels on several computer platforms, including CPUs, GPUs, and clusters thereof. We show how we automate temporal blocking code generation without user intervention and often achieve better time-to-solution. We enable domain-specific optimisation through compiler passes and offer temporal blocking gains from a high-level symbolic abstraction. These automated optimisations benefit various computational kernels for solving real-world application problems.Open Acces

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum

    2023 Astrophotonics Roadmap: pathways to realizing multi-functional integrated astrophotonic instruments

    Get PDF
    This is the final version. Available on open access from IOP Publishing via the DOI in this recordData availability statement: The data that support the findings of this study are available upon reasonable request from the authors.Photonic technologies offer numerous functionalities that can be used to realize astrophotonic instruments. The most spectacular example to date is the ESO Gravity instrument at the Very Large Telescope in Chile that combines the light-gathering power of four 8 m telescopes through a complex photonic interferometer. Fully integrated astrophotonic devices stand to offer critical advantages for instrument development, including extreme miniaturization when operating at the diffraction-limit, as well as integration, superior thermal and mechanical stabilization owing to the small footprint, and high replicability offering significant cost savings. Numerous astrophotonic technologies have been developed to address shortcomings of conventional instruments to date, including for example the development of photonic lanterns to convert from multimode inputs to single mode outputs, complex aperiodic fiber Bragg gratings to filter OH emission from the atmosphere, complex beam combiners to enable long baseline interferometry with for example, ESO Gravity, and laser frequency combs for high precision spectral calibration of spectrometers. Despite these successes, the facility implementation of photonic solutions in astronomical instrumentation is currently limited because of (1) low throughputs from coupling to fibers, coupling fibers to chips, propagation and bend losses, device losses, etc, (2) difficulties with scaling to large channel count devices needed for large bandwidths and high resolutions, and (3) efficient integration of photonics with detectors, to name a few. In this roadmap, we identify 24 key areas that need further development. We outline the challenges and advances needed across those areas covering design tools, simulation capabilities, fabrication processes, the need for entirely new components, integration and hybridization and the characterization of devices. To realize these advances the astrophotonics community will have to work cooperatively with industrial partners who have more advanced manufacturing capabilities. With the advances described herein, multi-functional integrated instruments will be realized leading to novel observing capabilities for both ground and space based platforms, enabling new scientific studies and discoveries.National Science Foundation (NSF)NAS

    Task-based Runtime Optimizations Towards High Performance Computing Applications

    Get PDF
    The last decades have witnessed a rapid improvement of computational capabilities in high-performance computing (HPC) platforms thanks to hardware technology scaling. HPC architectures benefit from mainstream advances on the hardware with many-core systems, deep hierarchical memory subsystem, non-uniform memory access, and an ever-increasing gap between computational power and memory bandwidth. This has necessitated continuous adaptations across the software stack to maintain high hardware utilization. In this HPC landscape of potentially million-way parallelism, task-based programming models associated with dynamic runtime systems are becoming more popular, which fosters developers’ productivity at extreme scale by abstracting the underlying hardware complexity. In this context, this dissertation highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by HPC applications., i.e., data redistribution, geospatial modeling and 3D unstructured mesh deformation here. Data redistribution aims to reshuffle data to optimize some objective for an algorithm, whose objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. Geostatistical modeling, one of the prime motivating applications for exascale computing, is a technique for predicting desired quantities from geographically distributed data, based on statistical models and optimization of parameters. Meshing the deformable contour of moving 3D bodies is an expensive operation that can cause huge computational challenges in fluid-structure interaction (FSI) applications. Therefore, in this dissertation, Redistribute-PaRSEC, ExaGeoStat-PaRSEC and HiCMA-PaRSEC are proposed to efficiently tackle these HPC applications respectively at extreme scale, and they are evaluated on multiple HPC clusters, including AMD-based, Intel-based, Arm-based CPU systems and IBM-based multi-GPU system. This multidisciplinary work emphasizes the need for runtime systems to go beyond their primary responsibility of task scheduling on massively parallel hardware system for servicing the next-generation scientific applications

    Practical synthesis from real-world oracles

    Get PDF
    As software systems become increasingly heterogeneous, the ability of compilers to reason about an entire system has decreased. When components of a system are not implemented as traditional programs, but rather as specialised hardware, optimised architecture-specific libraries, or network services, the compiler is unable to cross these abstraction barriers and analyse the system as a whole. If these components could be modelled or understood as programs, then the compiler would be able to reason about their behaviour without concern for their internal implementation details: a homogeneous view of the entire system would be afforded. However, it is not often the case that such components ever corresponded to an original program. This means that to facilitate this homogenenous analysis, programmatic models of component behaviour must be learned or constructed automatically. Constructing these models is an inductive program synthesis problem, albeit a challenging one that is largely beyond the ability of existing implementations. In order for the problem to be made tractable, information provided by the underlying context (i.e. the real component behaviour to be matched) must be integrated. This thesis presents three program synthesis approaches that integrate contextual information to synthesise programmatic models for real, existing components. The first, Annote, exploits informally-encoded information about a component's interface (e.g. from documentation) by weaving that information into an extended type-and-attribute system for component interfaces. The second, Presyn, learns a pair of cooperating probabilistic models from prior syntheses, that aim to predict likely program structure based on a component's interface. Finally, Haze uses observations of common side-effects of component executions to bias the search for programs. These approaches are each evaluated against comparable synthesisers from the literature, on a set of benchmark problems derived from real components. Learning models for component behaviour is only a partial solution; the compiler must also have some mechanism to use those models for program analysis and transformation. This thesis additionally proposes a novel mechanism for context-sensitive automatic API migration based on synthesised programmatic models, and evaluates the effectiveness of doing so on real application code. In summary, this thesis proposes a new framing for program synthesis problems that target the behaviour of real components, and demonstrates three different potential approaches to synthesis in this spirit. The success of these approaches is evaluated against implementations from the literature, and their results used to drive a novel API migration technique

    Computer Aided Verification

    Get PDF
    This open access two-volume set LNCS 13371 and 13372 constitutes the refereed proceedings of the 34rd International Conference on Computer Aided Verification, CAV 2022, which was held in Haifa, Israel, in August 2022. The 40 full papers presented together with 9 tool papers and 2 case studies were carefully reviewed and selected from 209 submissions. The papers were organized in the following topical sections: Part I: Invited papers; formal methods for probabilistic programs; formal methods for neural networks; software Verification and model checking; hyperproperties and security; formal methods for hardware, cyber-physical, and hybrid systems. Part II: Probabilistic techniques; automata and logic; deductive verification and decision procedures; machine learning; synthesis and concurrency. This is an open access book

    Evolutionary computation for software testing

    Get PDF
    A variety of products undergo a transformation from a pure mechanical design to more and more software and electronic components. A polarized example are watches. Several decades ago they have been purely mechanical. Modern smart watches are almost completely electronic devices which heavily rely on software. Further, a smart watch offers a lot more features than just the information about the current time. This change had a crucial impact on how software is being developed. A first attempt to control the rising complexity was to move to agile development practices such as extreme programming or scrum. This rise in complexity is not only affecting the development process but also quality assurance and software testing. If a product contains more and more features then this leads to a higher number of tests necessary to ensure quality standards. Furthermore agile development practices work in an iterative manner which leads to repetitive testing that puts more effort on the testing team. We aimed within the thesis to ease the pain of testing. Thereby we examined a series of subproblems that arise. A key complexity is the number of test cases. We intended to reduce the number of test cases before they are executed manually or implemented as automated tests. Thereby we examined the test specification and based on the requirements coverage of the individual tests, we were able to identify redundant tests. We relied on a novel metaheuristic called GCAIS which we improved upon iteratively. Another task is to control the remaining complexity. Testing is often time crucial and an appropriate subset of the available tests must be chosen in order to get a quick insight into the status of the device under test. We examined this challenge in two different testing scenarios. The first scenario is located in semi-automated testing where engineers execute a set of automated tests locally and closely observe the behaviour of the system under test. We extended GCAIS to compute test suites that satisfy different criteria if provided with sufficient search time. The second use case is located in fully automated testing in a continuous integration (CI) setting. CI focuses on frequent software build cycles which also include testing. These builds contain a testing stage which greatly emphasizes speed. Thus there we also have to compute crucial tests. However, due to the nature of the process we have to continuously recompute a test suite for each build as the software and maybe even the test cases at hand have changed. Hence it is hard to compute the test suite ahead of time and these tests have to be determined as part of the CI execution. Thus we switched to a computational lightweight learning classifier system (LCS) to prioritize and select test cases. We integrated a series of innovations we made into an LCS known as XCSF such as continuous priorities, experience replay and transfer learning. This enabled us to outperform a state of the art artificial neural network which is used by companies such as Netflix. We further investigated how LCS can be made faster using parallelism. We developed generic approaches which may run on any multicore computing device. This is of interest for our CI use case as the build server's architecture is unknown. However, the methods are also independent of the concrete LCS and are not linked to our testing problem. We identified that many of the challenges that need to be faced in the CI use case have been tackled by Organic Computing (OC), for example the need to adapt to an ever changing environment. Hence we relied on OC design principles to create a system architecture which wraps the LCS developed and integrates it into existing CI processes. The final system is robust and highly autonomous. A side-effect of the high degree of autonomy is a high level of automatization which fits CI well. We also gave insight on the usability and delivery of the full system to our industrial partner. Test engineers can easily integrate it with a few lines of code and need no knowledge about LCS and OC in order to use it. Another implication of the developed system is that OC's ideas and design principles can also be employed outside the field of embedded systems. This shows that OC has a greater level of generality. The process of testing and correcting found errors is still only partially automated. We make a first step into automating the entire process and thereby take an analogy to the concept of self-healing of OC. As a first proof of concept of this school of thought we take a look at touch interfaces. There we can automatically manipulate the software to fulfill the specified behaviour. Thus only a minimalistic amount of manual work is required
    corecore