18 research outputs found

    FLASH 1.0: A Software Framework for Rapid Parallel Deployment and Enhancing Host Code Portability in Heterogeneous Computing

    Full text link
    In this paper, we present FLASH 1.0, a C++-based software framework for rapid parallel deployment and enhancing host code portability in heterogeneous computing. FLASH takes a novel approach in describing kernels and dynamically dispatching them in a hardware-agnostic manner. FLASH features truly hardware-agnostic frontend interfaces, which not only unify the compile-time control flow but also enforces a portability-optimized code organization that imposes a demarcation between computational (performance-critical) and functional (non-performance-critical) codes as well as the separation of hardware-specific and hardware-agnostic codes in the host application. We use static code analysis to measure the hardware independence ratio of popular HPC applications and show that up to 99.72% code portability can be achieved with FLASH. Similarly, we measure the complexity of state-of-the-art portable programming models and show that a code reduction of up to 2.2x can be achieved for two common HPC kernels while maintaining 100% code portability with a normalized framework overhead between 1% - 13% of the total kernel runtime. The codes are available at https://github.com/PSCLab-ASU/FLASH.Comment: 12 page

    Empowering parallel computing with field programmable gate arrays

    Get PDF
    After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements

    Co-Evaluation of Pattern Matching Algorithms on IoT Devices with Embedded GPUs

    Get PDF
    Pattern matching is an important building block for many security applications, including Network Intrusion Detection Systems (NIDS). As NIDS grow in functionality and complexity, the time overhead and energy consumption of pattern matching become a significant consideration that limits the deployability of such systems, especially on resource-constrained devices.\ua0On the other hand, the emergence of new computing platforms, such as embedded devices with integrated, general-purpose Graphics Processing Units (GPUs), brings new, interesting challenges and opportunities for algorithm design in this setting: how to make use of new architectural features and how to evaluate their effect on algorithm performance. Up to now, work that focuses on pattern matching for such platforms has been limited to specific algorithms in isolation.In this work, we present a systematic and comprehensive benchmark that allows us to co-evaluate both existing and new pattern matching algorithms on heterogeneous devices equipped with embedded GPUs, suitable for medium- to high-level IoT deployments. We evaluate the algorithms on such a heterogeneous device, in close connection with the architectural features of the platform and provide insights on how these features affect the algorithms\u27 behavior. We find that, in our target embedded platform, GPU-based pattern matching algorithms have competitive performance compared to the CPU and consume half as much energy as the CPU-based variants.\ua0Based on these insights, we also propose HYBRID, a new pattern matching approach that efficiently combines techniques from existing approaches and outperforms them by 1.4x, across a range of realistic and synthetic data sets. Our benchmark details the effect of various optimizations, thus providing a path forward to make existing security mechanisms such as NIDS deployable on IoT devices

    The Linear Algebra Mapping Problem

    Full text link
    We observe a disconnect between the developers and the end users of linear algebra libraries. On the one hand, the numerical linear algebra and the high-performance communities invest significant effort in the development and optimization of highly sophisticated numerical kernels and libraries, aiming at the maximum exploitation of both the properties of the input matrices, and the architectural features of the target computing platform. On the other hand, end users are progressively less likely to go through the error-prone and time consuming process of directly using said libraries by writing their code in C or Fortran; instead, languages and libraries such as Matlab, Julia, Eigen and Armadillo, which offer a higher level of abstraction, are becoming more and more popular. Users are given the opportunity to code matrix computations with a syntax that closely resembles the mathematical description; it is then a compiler or an interpreter that internally maps the input program to lower level kernels, as provided by libraries such as BLAS and LAPACK. Unfortunately, our experience suggests that in terms of performance, this translation is typically vastly suboptimal. In this paper, we first introduce the Linear Algebra Mapping Problem, and then investigate how effectively a benchmark of test problems is solved by popular high-level programming languages. Specifically, we consider Matlab, Octave, Julia, R, Armadillo (C++), Eigen (C++), and NumPy (Python); the benchmark is meant to test both standard compiler optimizations such as common subexpression elimination and loop-invariant code motion, as well as linear algebra specific optimizations such as optimal parenthesization of a matrix product and kernel selection for matrices with properties. The aim of this study is to give concrete guidelines for the development of languages and libraries that support linear algebra computations

    Evaluating the graphics processing unit for digital audio synthesis and the development of HyperModels

    Get PDF
    The extraordinary growth in computation in single processors for almost half a century is becoming increasingly difficult to maintain. Future computational growth is expected from parallel processors, as seen in the increasing number of tightly coupled processors inside the conventional modern heterogeneous system. The graphics processing unit (GPU) is a massively parallel processing unit that can be used to accelerate particular digital audio processes; However, digital audio developers are cautious of adopting the GPU into their designs to avoid any complications the GPU architecture may have. For example, linear systems simulated using finite-difference-based physical model synthesis is highly suited for the GPU, but developers will be reluctant to use it without a complete evaluation of the GPU for digital audio. Previously limited by computation, the audio landscape could see future advancement by providing a comprehensive evaluation of the GPU in digital audio and developing a framework for accelerating particular audio processes. This thesis is separated into two parts; Part one evaluates the utility of the GPU as a hardware accelerator for digital audio processing using bespoke performance benchmarking suites. The results suggest that the GPU is appropriate under particular conditions; For example, the sample buffer size dispatched to the GPU must be within 32 to 512 to meet real-time digital audio requirements. However, despite some constraints, the GPU could support linear finite-difference-based physical models with 4X higher resolution than the equivalent CPU version. These results suggest that the GPU is superior to the CPU for high-resolution physical models. Therefore, the second part of this thesis presents the design of the novel HyperModels framework to facilitate the development of real-time linear physical models for interaction and performance. HyperModels uses vector graphics to describe a model's geometry and a domain-specific language (DSL) to define the physics equations that operate in the physical model. An implementation of the HyperModels framework is then objectively evaluated by comparing the performance with manually written CPU and GPU equivalent versions. The automatically generated GPU programs from HyperModels were shown to outperform the CPU versions for resolutions 64x64 and above whilst maintaining similar performance to the manually written GPU versions. To conclude part 2, the expressibility and usability of HyperModels is demonstrated by presenting two instruments built using the framewor

    Scalable, Efficient and Precise Natural Language Processing in the Semantic Web

    Get PDF
    The Internet of Things (IoT) is an emerging phenomenon in the public space. Users with accessibility needs could especially benefit from these “smart” devices if they were able to interact with them through speech. This thesis presents a Compositional Semantics and framework for developing extensible and expressive Natural Language Query Interfaces to the Semantic Web, addressing privacy and auditability needs in the process. This could be particularly useful in healthcare or legal applications, where confidentiality of information is a key concer

    Hardware-Aware Algorithm Designs for Efficient Parallel and Distributed Processing

    Get PDF
    The introduction and widespread adoption of the Internet of Things, together with emerging new industrial applications, bring new requirements in data processing. Specifically, the need for timely processing of data that arrives at high rates creates a challenge for the traditional cloud computing paradigm, where data collected at various sources is sent to the cloud for processing. As an approach to this challenge, processing algorithms and infrastructure are distributed from the cloud to multiple tiers of computing, closer to the sources of data. This creates a wide range of devices for algorithms to be deployed on and software designs to adapt to.In this thesis, we investigate how hardware-aware algorithm designs on a variety of platforms lead to algorithm implementations that efficiently utilize the underlying resources. We design, implement and evaluate new techniques for representative applications that involve the whole spectrum of devices, from resource-constrained sensors in the field, to highly parallel servers. At each tier of processing capability, we identify key architectural features that are relevant for applications and propose designs that make use of these features to achieve high-rate, timely and energy-efficient processing.In the first part of the thesis, we focus on high-end servers and utilize two main approaches to achieve high throughput processing: vectorization and thread parallelism. We employ vectorization for the case of pattern matching algorithms used in security applications. We show that re-thinking the design of algorithms to better utilize the resources available in the platforms they are deployed on, such as vector processing units, can bring significant speedups in processing throughout. We then show how thread-aware data distribution and proper inter-thread synchronization allow scalability, especially for the problem of high-rate network traffic monitoring. We design a parallelization scheme for sketch-based algorithms that summarize traffic information, which allows them to handle incoming data at high rates and be able to answer queries on that data efficiently, without overheads.In the second part of the thesis, we target the intermediate tier of computing devices and focus on the typical examples of hardware that is found there. We show how single-board computers with embedded accelerators can be used to handle the computationally heavy part of applications and showcase it specifically for pattern matching for security-related processing. We further identify key hardware features that affect the performance of pattern matching algorithms on such devices, present a co-evaluation framework to compare algorithms, and design a new algorithm that efficiently utilizes the hardware features.In the last part of the thesis, we shift the focus to the low-power, resource-constrained tier of processing devices. We target wireless sensor networks and study distributed data processing algorithms where the processing happens on the same devices that generate the data. Specifically, we focus on a continuous monitoring algorithm (geometric monitoring) that aims to minimize communication between nodes. By deploying that algorithm in action, under realistic environments, we demonstrate that the interplay between the network protocol and the application plays an important role in this layer of devices. Based on that observation, we co-design a continuous monitoring application with a modern network stack and augment it further with an in-network aggregation technique. In this way, we show that awareness of the underlying network stack is important to realize the full potential of the continuous monitoring algorithm.The techniques and solutions presented in this thesis contribute to better utilization of hardware characteristics, across a wide spectrum of platforms. We employ these techniques on problems that are representative examples of current and upcoming applications and contribute with an outlook of emerging possibilities that can build on the results of the thesis
    corecore