273 research outputs found

    Optimisation of computational fluid dynamics applications on multicore and manycore architectures

    Get PDF
    This thesis presents a number of optimisations used for mapping the underlying computational patterns of finite volume CFD applications onto the architectural features of modern multicore and manycore processors. Their effectiveness and impact is demonstrated in a block-structured and an unstructured code of representative size to industrial applications and across a variety of processor architectures that make up contemporary high-performance computing systems. The importance of vectorization and the ways through which this can be achieved is demonstrated in both structured and unstructured solvers together with the impact that the underlying data layout can have on performance. The utility of auto-tuning for ensuring performance portability across multiple architectures is demonstrated and used for selecting optimal parameters such as prefetch distances for software prefetching or tile sizes for strip mining/loop tiling. On the manycore architectures, running more than one thread per physical core is found to be crucial for good performance on processors with in-order core designs but not required on out-of-order architectures. For architectures with high-bandwidth memory packages, their exploitation, whether explicitly or implicitly, is shown to be imperative for best performance. The implementation of all of these optimisations led to application speed-ups ranging between 2.7X and 3X on the multicore CPUs and 5.7X to 24X on the manycore processors.Open Acces

    IMPROVING MULTIBANK MEMORY ACCESS PARALLELISM ON SIMT ARCHITECTURES

    Get PDF
    Memory mapping has traditionally been an important optimization problem for high-performance parallel systems. Today, these issues are increasingly affecting a much wider range of platforms. Several techniques have been presented to solve bank conflicts and reduce memory access latency but none of them turns out to be generally applicable to different application contexts. One of the ambitious goals of this Thesis is to contribute to modelling the problem of the memory mapping in order to find an approach that generalizes on existing conflict-avoiding techniques, supporting a systematic exploration of feasible mapping schemes

    Near-Memory Address Translation

    Full text link
    Memory and logic integration on the same chip is becoming increasingly cost effective, creating the opportunity to offload data-intensive functionality to processing units placed inside memory chips. The introduction of memory-side processing units (MPUs) into conventional systems faces virtual memory as the first big showstopper: without efficient hardware support for address translation MPUs have highly limited applicability. Unfortunately, conventional translation mechanisms fall short of providing fast translations as contemporary memories exceed the reach of TLBs, making expensive page walks common. In this paper, we are the first to show that the historically important flexibility to map any virtual page to any page frame is unnecessary in today's servers. We find that while limiting the associativity of the virtual-to-physical mapping incurs no penalty, it can break the translate-then-fetch serialization if combined with careful data placement in the MPU's memory, allowing for translation and data fetch to proceed independently and in parallel. We propose the Distributed Inverted Page Table (DIPTA), a near-memory structure in which the smallest memory partition keeps the translation information for its data share, ensuring that the translation completes together with the data fetch. DIPTA completely eliminates the performance overhead of translation, achieving speedups of up to 3.81x and 2.13x over conventional translation using 4KB and 1GB pages respectively.Comment: 15 pages, 9 figure

    Power, Performance, and Energy Management of Heterogeneous Architectures

    Get PDF
    abstract: Many core modern multiprocessor systems-on-chip offers tremendous power and performance optimization opportunities by tuning thousands of potential voltage, frequency and core configurations. Applications running on these architectures are becoming increasingly complex. As the basic building blocks, which make up the application, change during runtime, different configurations may become optimal with respect to power, performance or other metrics. Identifying the optimal configuration at runtime is a daunting task due to a large number of workloads and configurations. Therefore, there is a strong need to evaluate the metrics of interest as a function of the supported configurations. This thesis focuses on two different types of modern multiprocessor systems-on-chip (SoC): Mobile heterogeneous systems and tile based Intel Xeon Phi architecture. For mobile heterogeneous systems, this thesis presents a novel methodology that can accurately instrument different types of applications with specific performance monitoring calls. These calls provide a rich set of performance statistics at a basic block level while the application runs on the target platform. The target architecture used for this work (Odroid XU3) is capable of running at 4940 different frequency and core combinations. With the help of instrumented application vast amount of characterization data is collected that provides details about performance, power and CPU state at every instrumented basic block across 19 different types of applications. The vast amount of data collected has enabled two runtime schemes. The first work provides a methodology to find optimal configurations in heterogeneous architecture using classifiers and demonstrates an average increase of 93%, 81% and 6% in performance per watt compared to the interactive, ondemand and powersave governors, respectively. The second work using same data shows a novel imitation learning framework for dynamically controlling the type, number, and the frequencies of active cores to achieve an average of 109% PPW improvement compared to the default governors. This work also presents how to accurately profile tile based Intel Xeon Phi architecture while training different types of neural networks using open image dataset on deep learning framework. The data collected allows deep exploratory analysis. It also showcases how different hardware parameters affect performance of Xeon Phi.Dissertation/ThesisMasters Thesis Engineering 201

    Automated CNN pipeline generation for heterogeneous architectures

    Get PDF
    Heterogeneity is a vital feature in emerging processor chip designing. Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster are common in modern edge devices. One example is Intel\u27s Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor. Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect.\ua0For Instance Intel\u27s XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms. An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices. In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.The proposed approach provides near optimal solution for a throughput maximizing pipeline. We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values. The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms. Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80\% of the tested configurations are sub-optimal solutions.Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution
    • …
    corecore