652 research outputs found

    Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

    Full text link
    Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such proposals are reasonable given the the growing importance of ML, as memory is a pervasive component, %in this work, we make there is a case for a more inclusive PIM design that can accelerate primitives across domains. In this work, we ascertain the capabilities of commercial PIM proposals to accelerate various primitives across domains. We first begin with outlining a set of characteristics, termed PIM-amenability-test, which aid in assessing if a given primitive is likely to be accelerated by PIM. Next, we apply this test to primitives under study to ascertain efficient data-placement and orchestration to map the primitives to underlying PIM architecture. We observe here that, even though primitives under study are largely PIM-amenable, existing commercial PIM proposals do not realize their performance potential for these primitives. To address this, we identify bottlenecks that arise in PIM execution and propose hardware and software optimizations which stand to broaden the acceleration reach of commercial PIM designs (improving average PIM speedups from 1.12x to 2.49x relative to a GPU baseline). Overall, while we believe emerging commercial PIM proposals add a necessary and complementary design point in the application acceleration space, hardware-software co-design is necessary to deliver their benefits broadly

    On the automated compilation of UML notation to a VLIW chip multiprocessor

    Get PDF
    With the availability of more and more cores within architectures the process of extracting implicit and explicit parallelism in applications to fully utilise these cores is becoming complex. Implicit parallelism extraction is performed through the inclusion of intelligent software and hardware sections of tool chains although these reach their theoretical limit rather quickly. Due to this the concept of a method of allowing explicit parallelism to be performed as fast a possible has been investigated. This method enables application developers to perform creation and synchronisation of parallel sections of an application at a finer-grained level than previously possible, resulting in smaller sections of code being executed in parallel while still reducing overall execution time. Alongside explicit parallelism, a concept of high level design of applications destined for multicore systems was also investigated. As systems are getting larger it is becoming more difficult to design and track the full life-cycle of development. One method used to ease this process is to use a graphical design process to visualise the high level designs of such systems. One drawback in graphical design is the explicit nature in which systems are required to be generated, this was investigated, and using concepts already in use in text based programming languages, the generation of platform-independent models which are able to be specialised to multiple hardware architectures was developed. The explicit parallelism was performed using hardware elements to perform thread management, this resulted in speed ups of over 13 times when compared to threading libraries executed in software on commercially available processors. This allowed applications with large data dependent sections to be parallelised in small sections within the code resulting in a decrease of overall execution time. The modelling concepts resulted in the saving of between 40-50% of the time and effort required to generate platform-specific models while only incurring an overhead of up to 15% the execution cycles of these models designed for specific architectures

    Modelling of Information Flow and Resource Utilization in the EDGE Distributed Web System

    Get PDF
    The adoption of Distributed Web Systems (DWS) into modern engineering design process has dramatically increased in recent years. The Engineering Design Guide and Environment (EDGE) is one such DWS, intended to provide an integrated set of tools for use in the development of new products and services. Previous attempts to improve the efficiency and scalability of DWS focused largely on hardware utilization (i.e. multithreading and virtualization) and software scalability (i.e. load balancing and cloud services). However, these techniques are often limited to analysis of the computational complexity of the algorithms implemented. This work seeks to improve the understanding of efficiency and scalability of DWS by modelling the dynamics of information flow and resource utilization by characterizing DWS workloads through historical usage data (i.e. request type, frequency, access time). The design and implementation of EDGE is described. A DWS model of an EDGE system is developed and validated against theoretical limiting cases. The DWS model is used to predict the throughput of an EDGE system given a resource allocation and workflow. Results of the simulation suggest that proposed DWS designs can be evaluated according to the usage requirements of an engineering firm, ultimately guiding an informed decision for the selection and deployment of a DWS in an enterprise environment. Recommendations for future work related to the continued development of EDGE, DWS modelling of EDGE installation environments, and the extension of DWS modelling to new product development processes are presented

    PiCo: A Domain-Specific Language for Data Analytics Pipelines

    Get PDF
    In the world of Big Data analytics, there is a series of tools aiming at simplifying programming applications to be executed on clusters. Although each tool claims to provide better programming, data and execution models—for which only informal (and often confusing) semantics is generally provided—all share a common under- lying model, namely, the Dataflow model. Using this model as a starting point, it is possible to categorize and analyze almost all aspects about Big Data analytics tools from a high level perspective. This analysis can be considered as a first step toward a formal model to be exploited in the design of a (new) framework for Big Data analytics. By putting clear separations between all levels of abstraction (i.e., from the runtime to the user API), it is easier for a programmer or software designer to avoid mixing low level with high level aspects, as we are often used to see in state-of-the-art Big Data analytics frameworks. From the user-level perspective, we think that a clearer and simple semantics is preferable, together with a strong separation of concerns. For this reason, we use the Dataflow model as a starting point to build a programming environment with a simplified programming model implemented as a Domain-Specific Language, that is on top of a stack of layers that build a prototypical framework for Big Data analytics. The contribution of this thesis is twofold: first, we show that the proposed model is (at least) as general as existing batch and streaming frameworks (e.g., Spark, Flink, Storm, Google Dataflow), thus making it easier to understand high-level data-processing applications written in such frameworks. As result of this analysis, we provide a layered model that can represent tools and applications following the Dataflow paradigm and we show how the analyzed tools fit in each level. Second, we propose a programming environment based on such layered model in the form of a Domain-Specific Language (DSL) for processing data collections, called PiCo (Pipeline Composition). The main entity of this programming model is the Pipeline, basically a DAG-composition of processing elements. This model is intended to give the user an unique interface for both stream and batch processing, hiding completely data management and focusing only on operations, which are represented by Pipeline stages. Our DSL will be built on top of the FastFlow library, exploiting both shared and distributed parallelism, and implemented in C++11/14 with the aim of porting C++ into the Big Data world

    Energy-Efficient Hardware-Accelerated Synchronization for Shared-L1-Memory Multiprocessor Clusters

    Get PDF
    The steeply growing performance demands for highly power- and energy-constrained processing systems such as end-nodes of the Internet-of-Things (IoT) have led to parallel near-threshold computing (NTC), joining the energy-efficiency benefits of low-voltage operation with the performance typical of parallel systems. Shared-L1-memory multiprocessor clusters are a promising architecture, delivering performance in the order of GOPS and over 100 GOPS/W of energy-efficiency. However, this level of computational efficiency can only be reached by maximizing the effective utilization of the processing elements (PEs) available in the clusters. Along with this effort, the optimization of PE-to-PE synchronization and communication is a critical factor for performance. In this article, we describe a light-weight hardware-accelerated synchronization and communication unit (SCU) for tightly-coupled clusters of processors. We detail the architecture, which enables fine-grain per-PE power management, and its integration into an eight-core cluster of RISC-V processors. To validate the effectiveness of the proposed solution, we implemented the eight-core cluster in advanced 22 nm FDX technology and evaluated performance and energy-efficiency with tunable microbenchmarks and a set of rea-life applications and kernels. The proposed solution allows synchronization-free regions as small as 42 cycles, over 41 smaller than the baseline implementation based on fast test-and-set access to L1 memory when constraining the microbenchmarks to 10 percent synchronization overhead. When evaluated on the real-life DSP-applications, the proposed SCU improves performance by up to 92 and 23 percent on average and energy efficiency by up to 98 and 39 percent on average

    FPGA based technical solutions for high throughput data processing and encryption for 5G communication: A review

    Get PDF
    The field programmable gate array (FPGA) devices are ideal solutions for high-speed processing applications, given their flexibility, parallel processing capability, and power efficiency. In this review paper, at first, an overview of the key applications of FPGA-based platforms in 5G networks/systems is presented, exploiting the improved performances offered by such devices. FPGA-based implementations of cloud radio access network (C-RAN) accelerators, network function virtualization (NFV)-based network slicers, cognitive radio systems, and multiple input multiple output (MIMO) channel characterizers are the main considered applications that can benefit from the high processing rate, power efficiency and flexibility of FPGAs. Furthermore, the implementations of encryption/decryption algorithms by employing the Xilinx Zynq Ultrascale+MPSoC ZCU102 FPGA platform are discussed, and then we introduce our high-speed and lightweight implementation of the well-known AES-128 algorithm, developed on the same FPGA platform, and comparing it with similar solutions already published in the literature. The comparison results indicate that our AES-128 implementation enables efficient hardware usage for a given data-rate (up to 28.16 Gbit/s), resulting in higher efficiency (8.64 Mbps/slice) than other considered solutions. Finally, the applications of the ZCU102 platform for high-speed processing are explored, such as image and signal processing, visual recognition, and hardware resource management

    MPEG-4 Software Video Encoding

    Get PDF
    A Thesis submitted in fulfillment of the requirements of the degree of doctor of Philosophy in the University of LondonThis thesis presents a software model that allows a parallel decomposition of the MPEG-4 video encoder onto shared memory architectures, in order to reduce its total video encoding time. Since a video sequence consists of video objects each of which is likely to have different encoding requirements, the model incorporates a scheduler which (a) always selects the most appropriate video object for encoding and, (b) employs a mechanism for dynamically allocating video objects allocation onto the system processors, based on video object size information. Further spatial video object parallelism is exploited by applying the single program multiple data (SPMD) paradigm within the different modules of the MPEG-4 video encoder. Due to the fact that not all macroblocks have the same processing requirements, the model also introduces a data partition scheme that generates tiles with identical processing requirements. Since, macroblock data dependencies preclude data parallelism at the shape encoder the model also introduces a new mechanism that allows parallelism using a circular pipeline macroblock technique The encoding time depends partly on an encoder’s computational complexity. This thesis also addresses the problem of the motion estimation, as its complexity has a significant impact on the encoder’s complexity. In particular, two fast motion estimation algorithms have been developed for the model which reduce the computational complexity significantly. The thesis includes experimental results on a four processor shared memory platform, Origin200
    • …
    corecore