26 research outputs found

    Design and Implementation of 2D Convolution on x86/x64 Processors

    Get PDF

    Workflow Simulation Aware and Multi-threading Effective Task Scheduling for Heterogeneous Computing

    Get PDF
    Efficient application scheduling is critical for achieving high performance in heterogeneous computing systems. This problem has proved to be NP-complete, heading research efforts in obtaining low complexity heuristics that produce good quality schedules. Although this problem has been extensively studied in the past, all the related works assume the computation costs of application tasks on processors are available a priori, ignoring the fact that the time needed to run/simulate all these tasks is orders of magnitude higher than finding a good quality schedule, especially in heterogeneous systems. In this paper, we propose two new methods applicable to several task scheduling algorithms for heterogeneous computing systems. We showcase both methods by using HEFT well known and popular algorithm, but they are applicable to other algorithms too, such as HCPT, HPS, PETS and CPOP. First, we propose a methodology to reduce the scheduling time of HEFT when the computation costs are unknown, without sacrificing the length of the output schedule (monotonic computation costs); this is achieved by reducing the number of computation costs required by HEFT and as a consequence the number of simulations applied. Second, we give heuristics to find which tasks are going to be executed as Single-Thread and which as Multi-Thread CPU implementations, as well as the number of the threads used. The experimental results considering both random graphs and real world applications show that extending HEFT with the two proposed methods achieves better schedule lengths, while at the same time requires from 4.5 up to 24 less simulations

    A methodology correlating code optimizations with data memory accesses, execution time and energy consumption

    Get PDF
    The advent of data proliferation and electronic devices gets low execution time and energy consumption software in the spotlight. The key to optimizing software is the correct choice, order as well as parameters of optimization transformations that has remained an open problem in compilation research for decades for various reasons. First, most of the transformations are interdependent and thus addressing them separately is not effective. Second, it is very hard to couple the transformation parameters to the processor architecture (e.g., cache size) and algorithm characteristics (e.g., data reuse); therefore, compiler designers and researchers either do not take them into account at all or do it partly. Third, the exploration space, i.e., the set of all optimization configurations that have to be explored, is huge and thus searching is impractical. In this paper, the above problems are addressed for data-dominant affine loop kernels, delivering significant contributions. A novel methodology is presented reducing the exploration space of six code optimizations by many orders of magnitude. The objective can be execution time (ET), energy consumption (E) or the number of L1, L2 and main memory accesses. The exploration space is reduced in two phases: firstly, by applying a novel register blocking algorithm and a novel loop tiling algorithm and secondly, by computing the maximum and minimum ET/E values for each optimization set. The proposed methodology has been evaluated for both embedded and general-purpose CPUs and for seven well-known algorithms, achieving high memory access, speedup and energy consumption gain values (from 1.17 up to 40) over gcc compiler, hand-written optimized code and Polly. The exploration space from which the near-optimum parameters are selected is reduced from 17 up to 30 orders of magnitude

    Evaluation of language runtimes in open-source serverless platforms

    Get PDF
    Serverless computing is revolutionising cloud application development as it offers the ability to create modular, highly-scalable, fault-tolerant applications, with minimal operational management. In order to contribute to its widespread adoption of serverless platforms, the design and performance of language runtimes that are available in Function-as-a-Service (FaaS) serverless platforms is key. This paper aims to investigate the performance impact of language runtimes in open-source serverless platforms, deployable on local clusters. A suite of experiments is developed and deployed on two selected platforms: OpenWhisk and Fission. The results show a clear distinction between compiled and dynamic languages in cold starts but a pretty close overall performance in warm starts. Comparisons with similar evaluations for commercial platforms reveal that warm start performance is competitive for certain languages, while cold starts are lagging behind by a wide margin. Overall, the evaluation yielded usable results in regards to preferable choice of language runtime for each platform

    A Hierarchical Profiler of Intermediate Representation Code based on LLVM

    Get PDF
    Profiling based techniques have gained much attention on computer architecture and software analysis communities. The target is to rely on one or more profiling tools in order to identify specific code pieces of interest e.g., code pieces that slowdown a given application. The extracted code pieces can be further modified and optimized. In general, the profiling tools can be classified as deterministic, statistical-based, or rely on hardware performance counters. A common characteristic of the available profiling tools is typically based on analyzing or even manipulating (in case of binary instrumentation tools) machine-level code. This approach come with two main drawbacks. First, a lot of information (even GBytes of data) needs to be gathered, stored, post-processed, and visualized. Second, the performed analysis of the gathered data is platform-specific and it is not straightforward to categorize the given applications/program phases/kernels into distinct categories that have the same or almost the same behavior (e.g., the same percentage of computational vs. control instructions). The latter stems from the fact even small changes in the source code of the applications might lead to significantly different machine code implementations. Therefore, even two specific program kernels exhibit the same behavior (e.g., they have the same number of instructions, but with a different ordering), it is very difficult for a machine-code level profiling tool to assess their similarity, simply because the generated machine level code might have significant differences resulting in many missing opportunities for the available profiling tools. To address this issue, in this paper, we present a new profiling tool that is able to operate on the machine independent intermediate representation (IR) level. The profiler (still in development phase) relies on the LLVM API and it is able to hierarchically (at various levels of the call stack) and recursively parse the IR code and extract various useful statistics. We showcase the practicality of our profiler by analyzing a subset of the PolyBench benchmarks assuming (as pointed out by a recent study) that there is a strong correlation of LLVM IR code

    SDN-Based Routing Framework for Elephant and Mice Flows Using Unsupervised Machine Learning

    Get PDF
    Software-defined networks (SDNs) have the capabilities of controlling the efficient movement of data flows through a network to fulfill sufficient flow management and effective usage of network resources. Currently, most data center networks (DCNs) suffer from the exploitation of network resources by large packets (elephant flow) that enter the network at any time, which affects a particular flow (mice flow). Therefore, it is crucial to find a solution for identifying and finding an appropriate routing path in order to improve the network management system. This work proposes a SDN application to find the best path based on the type of flow using network performance metrics. These metrics are used to characterize and identify flows as elephant and mice by utilizing unsupervised machine learning (ML) and the thresholding method. A developed routing algorithm was proposed to select the path based on the type of flow. A validation test was performed by testing the proposed framework using different topologies of the DCN and comparing the performance of a SDN-Ryu controller with that of the proposed framework based on three factors: throughput, bandwidth, and data transfer rate. The results show that 70% of the time, the proposed framework has higher performance for different types of flows.</jats:p

    Towards an Energy-Aware Framework for Application Development and Execution in Heterogeneous Parallel Architectures

    Get PDF
    The Transparent heterogeneous hardware Architecture deployment for eNergy Gain in Operation (TANGO) project’s goal is to characterise factors which affect power consumption in software development and operation for Heterogeneous Parallel Hardware (HPA) environments. Its main contribution is the combination of requirements engineering and design modelling for self-adaptive software systems, with power consumption awareness in relation to these environments. The energy efficiency and application quality factors are integrated into the application lifecycle (design, implementation and operation). To support this, the key novelty of the project is a reference architecture and its implementation. Moreover, a programming model with built-in support for various hardware architectures including heterogeneous clusters, heterogeneous chips and programmable logic devices is provided. This leads to a new cross-layer programming approach for heterogeneous parallel hardware architectures featuring software and hardware modelling. Application power consumption and performance, data location and time-criticality optimization, as well as security and dependability requirements on the target hardware architecture are supported by the architecture

    A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures

    Get PDF
    Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures
    corecore