175 research outputs found

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    EASY: efficient arbiter SYnthesis from multi-threaded code

    Get PDF
    High-Level Synthesis (HLS) tools automatically transform a high-level specification of a circuit into a low-level RTL description.Traditionally, HLS tools have operated on sequential code, howeverin recent years there has been a drive to synthesize multi-threadedcode. A major challenge facing HLS tools in this context is how toautomatically partition memory amongst parallel threads to fullyexploit the bandwidth available on an FPGA device and avoid mem-ory contention. Current automatic memory partitioning techniqueshave inefficient arbitration due to conservative assumptions regard-ing which threads may access a given memory bank. In this paper,we address this problem through formal verification techniques,permitting a less conservative, yet provably correct circuit to begenerated. We perform a static analysis on the code to determinewhich memory banks are shared by which threads. This analysisenables us to optimize the arbitration efficiency of the generatedcircuit. We apply our approach to the LegUp HLS tool and showthat for a set of typical application benchmarks we can achieve upto 87% area savings, and 39% execution time improvement, withlittle additional compilation time

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    This paper was accepted for publication in the journal Microprocessors and Microsystems and the definitive published version is available at http://dx.doi.org/10.1016/j.micpro.2016.07.010.We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

    Architectural Enhancements for Data Transport in Datacenter Systems

    Full text link
    Datacenter systems run myriad applications, which frequently communicate with each other and/or Input/Output (I/O) devices—including network adapters, storage devices, and accelerators. Due to the growing speed of I/O devices and the emergence of microservice-based programming models, the I/O software stacks have become a critical factor in end-to-end communication performance. As such, I/O software stacks have been evolving rapidly in recent years. Datacenters rely on fast, efficient “Software Data Planes”, which orchestrate data transfer between applications and I/O devices. The goal of this dissertation is to enhance the performance, efficiency, and scalability of software data planes by diagnosing their existing issues and addressing them through hardware-software solutions. In the first step, I characterize challenges of modern software data planes, which bypass the operating system kernel to avoid associated overheads. Since traditional interrupts and system calls cannot be delivered to user code without kernel assistance, kernel-bypass data planes use spinning cores on I/O queues to identify work/data arrival. Spin-polling obviously wastes CPU cycles on checking empty queues; however, I show that it entails even more drawbacks: (1) Full-tilt spinning cores perform more (useless) polling work when there is less work pending in the queues. (2) Spin-polling scales poorly with the number of polled queues due to processor cache capacity constraints, especially when traffic is unbalanced. (3) Spin-polling also scales poorly with the number of cores due to the overhead of polling and operation rate limits. (4) Whereas shared queues can mitigate load imbalance and head-of-line blocking, synchronization overheads of spinning on them limit their potential benefits. Next, I propose a notification accelerator, dubbed HyperPlane, which replaces spin-polling in software data planes. Design principles of HyperPlane are: (1) not iterating on empty I/O queues to find work/data in ready ones, (2) blocking/halting when all queues are empty rather than spinning fruitlessly, and (3) allowing multiple cores to efficiently monitor a shared set of queues. These principles lead to queue scalability, work proportionality, and enjoying theoretical merits of shared queues. HyperPlane is realized with a programming model front-end and a hardware microarchitecture back-end. Evaluation of HyperPlane shows its significant advantage in terms of throughput, average/tail latency, and energy efficiency over a state-of-the-art spin-polling-based software data plane, with very small power and area overheads. Finally, I focus on the data transfer aspect in software data planes. Cache misses incurred by accessing I/O data are a major bottleneck in software data planes. Despite considerable efforts put into delivering I/O data directly to the last-level cache, some access latency is still exposed. Cores cannot prefetch such data to nearer caches in today's systems because of the complex access pattern of data buffers and the lack of an appropriate notification mechanism that can trigger the prefetch operations. As such, I propose HyperData, a data transfer accelerator based on targeted prefetching. HyperData prefetches exact (rather than predicted) data buffers (or a required subset to avoid cache pollution) to the L1 cache of the consumer core at the right time. Prefetching can be done for both core-peripheral and core-core communications. HyperData's prefetcher is programmable and supports various queue formats—namely, direct (regular), indirect (Virtio), and multi-consumer queues. I show that with a minor overhead, HyperData effectively hides data access latency in software data planes, thereby improving both application- and system-level performance and efficiency.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/169826/1/hosseing_1.pd

    Instruction fusion and vector processor virtualization for higher throughput simultaneous multithreaded processors

    Get PDF
    The utilization wall, caused by the breakdown of threshold voltage scaling, hinders performance gains for new generation microprocessors. To alleviate its impact, an instruction fusion technique is first proposed for multiscalar and many-core processors. With instruction fusion, similar copies of an instruction to be run on multiple pipelines or cores are merged into a single copy for simultaneous execution. Instruction fusion applied to vector code enables the processor to idle early pipeline stages and instruction caches at various times during program implementation with minimum performance degradation, while reducing the program size and the required instruction memory bandwidth. Instruction fusion is applied to a MIPS-based dual-core that resembles an ideal multiscalar of degree two. Benchmarking using an FPGA prototype shows a 6-11% reduction in dynamic power dissipation as well as a 17-45% decrease in code size with frequent performance improvements due to higher instruction cache hit rates. The second part of this dissertation deals with vector processors (VPs) which are commonly assigned exclusively to a single thread/core, and are not often performance and energy efficient due to mismatches with the vector needs of individual applications. An easy-to-implement VP virtualization technology is presented to improve the VP in terms of utilization and energy efficiency. The proposed VP virtualization technology, when applied, improves aggregate VP utilization by enabling simultaneous execution of multiple threads of similar or disparate vector lengths on a multithreaded VP. With a vector register file (VRF) virtualization technique invented to dynamically allocate physical vector registers to threads, the virtualization approach improves programmer productivity by providing at run time a distinct physical register name space to each competing thread, thus eliminating the need to solve register name conflicts statically. The virtualization technique is applied to a multithreaded VP prototyped on an FPGA; it supports VP sharing as well as power gating for better energy efficiency. A throughput-driven scheduler is proposed to optimize the virtualized VP’s utilization in dynamic environments where diverse threads are created randomly. Simulations of various low utilization benchmarks show that, with the proposed scheduler and power gating, the virtualized VP yields a larger than 3-fold speedup while the reduction in the total energy consumption approaches 40% compared to the same VP running in the single-threaded mode. The third part of this dissertation focuses on combining the two aforementioned technologies to create an improved VP prototype that is fully virtualized to support thread fusion and dynamic lane-based power-gating (PG). The VP is capable of dynamically triggering thread fusion according to the availability of similar threads in the task queue. Once thread fusion is triggered, every vector instruction issued to the virtualized VP is interpreted as two similar instructions working in two independent virtual spaces, thus doubling the vector instruction issue rate. Based on an accurate power model of the VP prototype, two different policies are proposed to dynamically choose the optimal number of active VP lanes. With the combined effort of VP lane-based PG and thread fusion, compared to a conventional VP without the two proposed capabilities, benchmarking shows that the new prototype yields up to 33.8% energy reduction in addition to 40% runtime improvement, or up to 62.7% reduction in the product of energy and runtime

    Accelerating the pace of protein functional annotation with intel xeon phi coprocessors

    Get PDF
    © 2002-2011 IEEE. Intel Xeon Phi is a new addition to the family of powerful parallel accelerators. The range of its potential applications in computationally driven research is broad; however, at present, the repository of scientific codes is still relatively limited. In this study, we describe the development and benchmarking of a parallel version of {\mmb e}FindSite, a structural bioinformatics algorithm for the prediction of ligand-binding sites in proteins. Implemented for the Intel Xeon Phi platform, the parallelization of the structure alignment portion of {\mmb e}FindSite using pragma-based OpenMP brings about the desired performance improvements, which scale well with the number of computing cores. Compared to a serial version, the parallel code runs 11.8 and 10.1 times faster on the CPU and the coprocessor, respectively; when both resources are utilized simultaneously, the speedup is 17.6. For example, ligand-binding predictions for 501 benchmarking proteins are completed in 2.1 hours on a single Stampede node equipped with the Intel Xeon Phi card compared to 3.1 hours without the accelerator and 36.8 hours required by a serial version. In addition to the satisfactory parallel performance, porting existing scientific codes to the Intel Xeon Phi architecture is relatively straightforward with a short development time due to the support of common parallel programming models by the coprocessor. The parallel version of {\mmb e}FindSite is freely available to the academic community at www.brylinski.org/efindsite

    Collaborative aware CPU thread throttling and FPGA HLS versioning

    Get PDF
    Warehouses and Cloud Servers have been adopting collaborative CPU-GPU and CPU FPGA architectures as alternatives to enable extra acceleration for applications by parti tioning threads/kernels execution across both devices. However, exploiting the benefits of these environments is challenging, since there are many factors that may influence perfor mance and energy consumption, such as the number of CPU threads, the workload bal ance, and optimization techniques such as FPGA HLS (High-Level Synthesis)-versioning. This work shows that maximizing resource utilization by triggering the highest number of CPU threads does not always result in the best efficiency for both CPU-GPU and CPU FPGA architectures. Moreover, our experiments show that the amount of data distributed to each device (workload balance) affects the needed CPU processing power and, there fore, the number of active CPU threads for the application. To address these problems, we first propose ETCG – Energy-aware CPU Thread Throttling for CPU-GPU collaborative environments. ETCG transparently selects a near-optimal number of CPU threads to min imize the energy-delay product (EDP) of CPU-GPU applications. In the second study, we propose ETCF – Energy-Aware CPU Thread Throttling and Workload Balancing Frame work for CPU-FPGA collaborative environments. ETCF automatically provides efficient CPU-FPGA execution by selecting the right workload balance and the number of CPU threads for a given collaborative application. ETCF framework offers different optimiza tion goals: performance, energy, or EDP. Compared to the baseline (an equally balanced workload executing with the maximum number of CPU threads), ETCG and ETCF pro vide, on average, 73% and 93% of EDP reduction, respectively. We also show that both ETCG and ETCF achieve near-optimal solutions by comparing it to an exhaustive search, but just taking up to 5% of its searching time. Finally, in the third study, we considered a task-collaborative CPU-FPGA environment. We investigate the impact of collaboratively applying Thread Throttling on the CPU side and HLS-versioning on the FPGA side. We use a multi-tenant Cloud service as our object of study, where sequence of application requests with different priorities result in DAGs of application kernels that must be ex ecuted over the heterogeneous architecture. We show that by synergistically applying Thread Throttling and HLS-Versioning to the incoming kernels may improve the EDP in up to 41x over the non-optimized execution.Servidores em Nuvem vem adotando arquiteturas colaborativas de CPU-GPU e CPU FPGA como alternativas para permitir aceleração extra às aplicações através do particionando da execução de threads/kernels em ambos os dispositivos. No entanto, explorar os benefícios destes ambientes é desafiador, pois existem muitos fatores que podem influenciar o desempenho e o consumo de energia, como o número de threads da CPU, o balanceamento da carga de trabalho (Worload Balance) e técnicas de otimização como FPGA HLS (High-Level Synthesis)-versioning. Este trabalho mostra que maximizar a utilização de recursos acionando o maior número de threads de CPU nem sempre resulta na melhor eficiência tanto em arquiteturas CPU-GPU quanto CPU-FPGA. Além disso, os experi mentos mostram que a quantidade de dados distribuídos para cada dispositivo (Workload Balance) afeta o poder de processamento necessário da CPU e, portanto, o número ótimo de threads de CPU para a execução da aplicação. Com o intuito de otimizar a execução de aplicações colaborativas CPU-GPU, inicialmente propomos o ETCG – Energy-aware CPU Thread Throttling for CPU-GPU collaborative environments. O ETCG é capaz de selecionar de forma transparente um número quase ótimo de threads na CPU visando a minimizar o produto entre atraso e energia consumida (EDP) de aplicações CPU-GPU. No segundo estudo, propomos ETCF – Energy-Aware CPU Thread Throttling and Wor kload Balancing Framework para ambientes colaborativos CPU-FPGA. O ETCF fornece uma execução CPU-FPGA eficiente ao selecionar apropriadamente o Workload Balance e o número de threads na CPU para uma aplicação colaborativa. Além disso, são oferecidos diferentes objetivos de otimização: desempenho, energia ou EDP. Em comparação à linha de base (considerando uma carga de trabalho igualmente equilibrada entre os dispositi vos e usando o número máximo de threads na CPU), nossos experimentos mostram que ETCG e ETCF fornecem, em média, 73% e 93% de redução de EDP, respectivamente. Também mostramos que tanto o ETCG quanto o ETCF alcançam soluções quase ótimas comparando-os a uma busca exaustiva, mas levando apenas 5% de seu tempo de busca no pior cenário. Finalmente, no terceiro estudo, consideramos um ambiente de tarefas colaborativas CPU-FPGA. Investigamos o impacto de aplicar sinergicamente Thread Th rottling no lado da CPU e do HLS-Versioning no lado do FPGA. Utilizamos como objeto de estudo um serviço de nuvem multi-inquilino, onde a sequência de requisições de apli- cações com diferentes prioridades resulta em DAGs de kernels de aplicações que devem ser executados sobre a arquitetura heterogênea. Nossos experimentos mostram que, ao aplicar sinergicamente Thread Throttling e HLS-Versioning aos kernels recebidos, pode se melhorar o EDP em até 41x em relação à execução não otimizada
    corecore