123 research outputs found

    Vector coprocessor sharing techniques for multicores: performance and energy gains

    Get PDF
    Vector Processors (VPs) created the breakthroughs needed for the emergence of computational science many years ago. All commercial computing architectures on the market today contain some form of vector or SIMD processing. Many high-performance and embedded applications, often dealing with streams of data, cannot efficiently utilize dedicated vector processors for various reasons: limited percentage of sustained vector code due to substantial flow control; inherent small parallelism or the frequent involvement of operating system tasks; varying vector length across applications or within a single application; data dependencies within short sequences of instructions, a problem further exacerbated without loop unrolling or other compiler optimization techniques. Additionally, existing rigid SIMD architectures cannot tolerate efficiently dynamic application environments with many cores that may require the runtime adjustment of assigned vector resources in order to operate at desired energy/performance levels. To simultaneously alleviate these drawbacks of rigid lane-based VP architectures, while also releasing on-chip real estate for other important design choices, the first part of this research proposes three architectural contexts for the implementation of a shared vector coprocessor in multicore processors. Sharing an expensive resource among multiple cores increases the efficiency of the functional units and the overall system throughput. The second part of the dissertation regards the evaluation and characterization of the three proposed shared vector architectures from the performance and power perspectives on an FPGA (Field-Programmable Gate Array) prototype. The third part of this work introduces performance and power estimation models based on observations deduced from the experimental results. The results show the opportunity to adaptively adjust the number of vector lanes assigned to individual cores or processing threads in order to minimize various energy-performance metrics on modern vector- capable multicore processors that run applications with dynamic workloads. Therefore, the fourth part of this research focuses on the development of a fine-to-coarse grain power management technique and a relevant adaptive hardware/software infrastructure which dynamically adjusts the assigned VP resources (number of vector lanes) in order to minimize the energy consumption for applications with dynamic workloads. In order to remove the inherent limitations imposed by FPGA technologies, the fifth part of this work consists of implementing an ASIC (Application Specific Integrated Circuit) version of the shared VP towards precise performance-energy studies involving high- performance vector processing in multicore environments

    Design and evaluation of an adaptable vector coprocessor for multicores

    Get PDF
    Future applications for multi-core processor systems will require increased signal processing power along with increased resource utilization and decreased power consumption. Conservative power consumption will be of paramount importance primarily for battery-powered portable multi-core platforms (e.g., advanced cell phones, tablet computers, etc.). This thesis investigates the robustness, efficiency and effectiveness of vector coprocessor sharing policies in multi-core environments. Vector coprocessor sharing is based on an innovative design for a vector lane that forms the building block for the creation of larger vector coprocessors. This innovative lane design contains a floating-point multiply unit, a floating-point add/subtract unit, a miscellaneous function unit, a load/store unit, and a vector register file. The design was prototyped and benchmarked on a field programmable gate array (FPGA) for a multitude of configurations to evaluate the performance and power consumption. The configurations included one or two host processors and two, four, eight, sixteen or thirty-two lanes. Sample applications in benchmarking were the fast Fourier transform, finite impulse response filter, matrix multiplication and LU matrix decomposition. As an additional experiment, a reconfigurable unit was added to the lane and configured as either a combined floating-point multiply/add or a floating-point divide to better match the needs of specific applications. The results show the versatility of the design towards high performance and controllable power consumption

    Vicuna: A Timing-Predictable RISC-V Vector Coprocessor for Scalable Parallel Computation

    Get PDF

    High throughput spatial convolution filters on FPGAs

    Get PDF
    Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility

    Using the Xeon Phi platform to run speculatively-parallelized codes

    Get PDF
    Producción CientíficaIntel Xeon Phi accelerators are one of the newest devices used in the field of parallel computing. However, there are comparatively few studies concerning their performance when using most of the existing parallelization techniques. One of them is thread-level speculation, a technique that optimistically tries to extract parallelism of loops without the need of a compile-time analysis that guarantees that the loop can be executed in parallel. In this article we evaluate the performance delivered by an Intel Xeon Phi coprocessor when using a software, state-of-the-art thread-level speculative parallelization library in the execution of well-known benchmarks. We describe both the internal characteristics of the Xeon Phi platform and the particularities of the thread-level speculation library being used as benchmark. Our results show that, although the Xeon Phi delivers a relatively good speedup in comparison with a shared-memory architecture in terms of scalability, the relatively low computing power of its computational units when specific vectorization and SIMD instructions are not fully exploited makes this first generation of Xeon Phi architectures not competitive (in terms of absolute performance) with respect to conventional multicore systems for the execution of speculatively parallelized code.2018-04-01Castilla-Leon Regional Government (VA172A12-2); MICINN (Spain) and the European Union FEDER (MOGECOPP project TIN2011-25639, HomProg-HetSys project TIN2014-58876-P, CAPAP-H5 network TIN2014-53522-REDT)

    Energy Efficient Parallel K-Means Clustering for an Intel Hybrid Multi-Chip Package

    Get PDF
    International audienceFPGA devices have been proving to be good candidates to accelerate applications from different research topics. For instance, machine learning applications such as K-Means clustering usually relies on large amount of data to be processed, and, despite the performance offered by other architectures, FPGAs can offer better energy efficiency. With that in mind, Intel ® has launched a platform that integrates a multicore and an FPGA in the same package, enabling low latency and coherent fine-grained data offload. In this paper, we present a parallel implementation of the K-Means clustering algorithm, for this novel platform, using OpenCL language, and compared it against other platforms. We found that the CPU+FPGA platform was more energy efficient than the CPU-only approach from 70.71% to 85.92%, with Standard and Tiny input sizes respectively, and up to 68.21% of performance improvement was obtained with Tiny input size. Furthermore, it was up to 7.2× more energy efficient than an Intel® Xeon Phi ™, 21.5× than a cluster of Raspberry Pi boards, and 3.8× than the low-power MPPA-256 architecture, when the Standard input size was used

    Journal of Real-Time Image Processing manuscript No. (will be inserted by the editor) Evaluation of real-time LBP computing in multiple architectures

    Get PDF
    Abstract Local Binary Pattern (LBP) is a texture operator that is used in several different computer vision applications requiring, in many cases, real-time operation in multiple computing platforms. The irruption of new video standards has increased the typical resolutions and frame rates, which need considerable computational performance. Since LBP is essentially a pixel operator that scales with image size, typical straightforward implementations are usually insufficient to meet these requirements. To identify the solutions that maximize the performance of the real-time LBP extraction, we compare a series different implementations in terms of computational performance and energy efficiency while analyzing the different optimizations that can be made to reach real-time performance on multiple platforms and their different available computing resources. Our contribution addresses the extensive survey of LBP implementations in different platforms that can be found in the literature. To provide for a more complete evaluation, we have implemented the LBP algorithms in several platforms such as Graphics Processing Units, mobile processors and a hybrid programming model image coprocessor. We have extended the evaluation of some of the solutions that can be found in previous work. In addition, we publish the source code of our implementations

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations
    • …
    corecore