154 research outputs found

    Design of High Speed Memory-Based FFT Processor Using 90nm Technology

    Get PDF
    In order to enhance performance, the Fast Fourier Transformation is a important operation in Digital Signal Processing (DSP) systems had been extensively studied. State-of-the-art transmission technology uses Orthogonal frequency division multiplexing (OFDM), which primary operation is the Fast fourier transform (FFT). This analysis presents the design of a high-speed memory-based FFT processor using 90nm technology. The novel hybrid multiplier and hybrid adder is used in this analysis. The main objective of this method is to develop an efficient, memory-efficient FFT processor that requires less area.  Using 90nm CMOS (Complementary Metal Oxide Semiconductor) technology, the proposed FFT processor was created and implemented in process. With reduced processing time, this means that the proposed FFT processor performs better than the prior memory-based FFT processors in terms of performance and the number of LUTs required which reduces area and memory utilization

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    Constructive Synthesis of Memory-Intensive Accelerators for FPGA From Nested Loop Kernels

    Get PDF

    Cross-layer system reliability assessment framework for hardware faults

    Get PDF
    System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

    Application-Specific Memory Subsystems

    Get PDF
    The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. These cache hierarchies are designed to be general-purpose in that they strive to provide the best possible performance across a wide range of applications. However, such a memory subsystem does not necessarily provide the best possible performance for a particular application. Although general-purpose memory subsystems are desirable when the work-load is unknown and the memory subsystem must remain fixed, when this is not the case a custom memory subsystem may be beneficial. For example, in an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) designed to run a particular application, a custom memory subsystem optimized for that application would be desirable. In addition, when there are tunable parameters in the memory subsystem, it may make sense to change these parameters depending on the application being run. Such a situation arises today with FPGAs and, to a lesser extent, GPUs, and it is plausible that general-purpose computers will begin to support greater flexibility in the memory subsystem in the future. In this dissertation, we first show that it is possible to create application-specific memory subsystems that provide much better performance than a general-purpose memory subsystem. In addition, we show a way to discover such memory subsystems automatically using a superoptimization technique on memory address traces gathered from applications. This allows one to generate a custom memory subsystem with little effort. We next show that our memory subsystem superoptimization technique can be used to optimize for objectives other than performance. As an example, we show that it is possible to reduce the number of writes to the main memory, which can be useful for main memories with limited write durability, such as flash or Phase-Change Memory (PCM). Finally, we show how to superoptimize memory subsystems for streaming applications, which are a class of parallel applications. In particular, we show that, through the use of ScalaPipe, we can author and deploy streaming applications targeting FPGAs with superoptimized memory subsystems. ScalaPipe is a domain-specific language (DSL) embedded in the Scala programming language for generating streaming applications that can be implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we are able to demonstrate actual performance improvements using the superoptimized memory subsystem with applications implemented in hardware

    VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

    Get PDF
    This paper was accepted for publication in the journal Microprocessors and Microsystems and the definitive published version is available at http://dx.doi.org/10.1016/j.micpro.2016.07.010.We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

    Datacenter Design for Future Cloud Radio Access Network.

    Full text link
    Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

    Architectural exploration of Si-IF many-die processors

    Get PDF
    Monolithic, single-die processors dominate today’s computing landscape. High performance systems achieve massive throughput by connecting large numbers of discrete chips – CPUs, GPUs, FPGAs – through high latency, low bandwidth interconnects. However, such systems provide limited performance scaling due to high communication costs between the discrete chips. This thesis proposes an alternate path for performance scaling: integrating many dies onto a single chip using a novel assembly technology – Silicon Interconnect Fabric (Si-IF). Many-die processors have both a technical and an economic advantage over their monolithic counterparts. We demonstrate potential benefits of a many-die approach using two approaches: efficient workload coverage design space exploration using many dies and evaluating a many-die wafer-scale GPU design.Ope
    corecore