Search CORE

154 research outputs found

Design of High Speed Memory-Based FFT Processor Using 90nm Technology

Author: T. Prasada Babu
Publication venue: Auricle Global Society of Education and Research
Publication date: 04/05/2024
Field of study

In order to enhance performance, the Fast Fourier Transformation is a important operation in Digital Signal Processing (DSP) systems had been extensively studied. State-of-the-art transmission technology uses Orthogonal frequency division multiplexing (OFDM), which primary operation is the Fast fourier transform (FFT). This analysis presents the design of a high-speed memory-based FFT processor using 90nm technology. The novel hybrid multiplier and hybrid adder is used in this analysis. The main objective of this method is to develop an efficient, memory-efficient FFT processor that requires less area.  Using 90nm CMOS (Complementary Metal Oxide Semiconductor) technology, the proposed FFT processor was created and implemented in process. With reduced processing time, this means that the proposed FFT processor performs better than the prior memory-based FFT processors in terms of performance and the number of LUTs required which reduces area and memory utilization

International Journal on Recent and Innovation Trends in Computing and Communication

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Author: David Stevens (4254274)
Vassilios Chouliaras (1251600)
Vincent Dwyer (1251447)
Publication venue
Publication date: 01/01/2016
Field of study

We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

Loughborough University Institutional Repository

Constructive Synthesis of Memory-Intensive Accelerators for FPGA From Nested Loop Kernels

Author: McAllister John
Milford Matthew
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/08/2016
Field of study

Queen's University Belfast Research Portal

Crossref

Cross-layer system reliability assessment framework for hardware faults

Author: Bosio Alberto
Canal Corretger Ramon
Chatzidimitriou Athanansios
Di Carlo Stefano
Di Natale Giorgio
Gizipoulos Dimitris
González Colás Antonio María
Kaliorakis Manolis
Kooli Maha
Politano Gianfranco
Riera Villanueva Marc
Savino Alessandro
Tselonis Sotiris
Vallero Alessandro
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

System reliability estimation during early design phases facilitates informed decisions for the integration of effective protection mechanisms against different classes of hardware faults. When not all system abstraction layers (technology, circuit, microarchitecture, software) are factored in such an estimation model, the delivered reliability reports must be excessively pessimistic and thus lead to unacceptably expensive, over-designed systems. We propose a scalable, cross-layer methodology and supporting suite of tools for accurate but fast estimations of computing systems reliability. The backbone of the methodology is a component-based Bayesian model, which effectively calculates system reliability based on the masking probabilities of individual hardware and software components considering their complex interactions. Our detailed experimental evaluation for different technologies, microarchitectures, and benchmarks demonstrates that the proposed model delivers very accurate reliability estimations (FIT rates) compared to statistically significant but slow fault injection campaigns at the microarchitecture level.Peer ReviewedPostprint (author's final draft

Crossref

UPCommons. Portal del coneixement obert de la UPC

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

Application-Specific Memory Subsystems

Author: Wingbermuehle Joseph George
Publication venue: Washington University Open Scholarship
Publication date: 15/05/2015
Field of study

The disparity in performance between processors and main memories has led computer architects to incorporate large cache hierarchies in modern computers. These cache hierarchies are designed to be general-purpose in that they strive to provide the best possible performance across a wide range of applications. However, such a memory subsystem does not necessarily provide the best possible performance for a particular application. Although general-purpose memory subsystems are desirable when the work-load is unknown and the memory subsystem must remain fixed, when this is not the case a custom memory subsystem may be beneficial. For example, in an application-specific integrated circuit (ASIC) or a field-programmable gate array (FPGA) designed to run a particular application, a custom memory subsystem optimized for that application would be desirable. In addition, when there are tunable parameters in the memory subsystem, it may make sense to change these parameters depending on the application being run. Such a situation arises today with FPGAs and, to a lesser extent, GPUs, and it is plausible that general-purpose computers will begin to support greater flexibility in the memory subsystem in the future. In this dissertation, we first show that it is possible to create application-specific memory subsystems that provide much better performance than a general-purpose memory subsystem. In addition, we show a way to discover such memory subsystems automatically using a superoptimization technique on memory address traces gathered from applications. This allows one to generate a custom memory subsystem with little effort. We next show that our memory subsystem superoptimization technique can be used to optimize for objectives other than performance. As an example, we show that it is possible to reduce the number of writes to the main memory, which can be useful for main memories with limited write durability, such as flash or Phase-Change Memory (PCM). Finally, we show how to superoptimize memory subsystems for streaming applications, which are a class of parallel applications. In particular, we show that, through the use of ScalaPipe, we can author and deploy streaming applications targeting FPGAs with superoptimized memory subsystems. ScalaPipe is a domain-specific language (DSL) embedded in the Scala programming language for generating streaming applications that can be implemented on CPUs and FPGAs. Using the ScalaPipe implementation, we are able to demonstrate actual performance improvements using the superoptimized memory subsystem with applications implemented in hardware

Washington University St. Louis: Open Scholarship

VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads

Author: Agron
Andrews
Arvind
Brodersen
Chouliaras
Chouliaras
Chouliaras
Chouliaras
Colwell
Cong
D. Stevens
de Dinechin
De Micheli
Faraboschi
Gupta
Hubener
Kathail
Lin
Lin
Lübbers
Mandelbrot
Milward
Muck
Oliveira
Owaida
Papakonstantinou
Robert Thomson
Rooholamin
Schlansker
Stevens
Stevens
Thomson
Tullsen
V.A. Chouliaras
V.M. Dwyer
Villarreal
Watson
Windh
Ziavras
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

This paper was accepted for publication in the journal Microprocessors and Microsystems and the definitive published version is available at http://dx.doi.org/10.1016/j.micpro.2016.07.010.We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2 performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility, hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support

Crossref

Loughborough University Institutional Repository

Recommended from our members

Active timing margin management to improve microprocessor power efficiency

Author: Zu Yazhou
Publication venue
Publication date: 11/04/2019
Field of study

Improving power/performance efficiency is critical for today’s micro- processors. From edge devices to datacenters, lower power or higher performance always produces better systems, measured by lower cost of ownership or longer battery time. This thesis studies improving microprocessor power/performance efficiency by optimizing the pipeline timing margin. In particular, this thesis focuses on improving the efficacy of Active Timing Margin, a young technology that dynamically adjusts the margin. Active timing margin trims down the pipeline timing margin with a control loop that adjusts voltage and frequency based on real-time chip environment monitoring. The key insight of this thesis is that in order to maximize active timing margin’s efficiency enhancement benefits, synergistic management from processor architecture design and system software scheduling are needed. To that end, this thesis covers the major consumers of pipeline timing margin, including temperature, voltage, and process variation. For temperature variation, the thesis proposes a table-lookup based active timing margin mechanism, and an associated temperature management scheme to minimize power consumption. For voltage variation, the thesis characterizes the limiting factors of adaptive clocking’s power saving and proposes application scheduling to maximize total system power reduction. For process variation, the thesis proposes core-level adaptive clocking reconfiguration to automatically expose inter-core variation and discusses workload scheduling and throttling management to control critical application performance. The author believes the optimization presented in this thesis can potentially benefit a variety of processor architectures as the conclusions are based on the solid measurement on state-of-the-art processors, and the research objective, active timing margin, already has wide applicability in the latest microprocessors by the time this thesis is written.Electrical and Computer Engineerin

Texas ScholarWorks

コウセイカヘンプロセッサノタメノコードサイテキカシュホウ

Author: Hieda Takuji
ヒエダタクジ
稗田拓路
Publication venue
Publication date
Field of study

Osaka University Knowledge Archive

Datacenter Design for Future Cloud Radio Access Network.

Author: Zheng Qi
Publication venue
Publication date: 01/01/2015
Field of study

Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

Deep Blue Documents at the University of Michigan

Architectural exploration of Si-IF many-die processors

Author: Petrisko Daniel
Publication venue
Publication date: 01/05/2018
Field of study

Monolithic, single-die processors dominate today’s computing landscape. High performance systems achieve massive throughput by connecting large numbers of discrete chips – CPUs, GPUs, FPGAs – through high latency, low bandwidth interconnects. However, such systems provide limited performance scaling due to high communication costs between the discrete chips. This thesis proposes an alternate path for performance scaling: integrating many dies onto a single chip using a novel assembly technology – Silicon Interconnect Fabric (Si-IF). Many-die processors have both a technical and an economic advantage over their monolithic counterparts. We demonstrate potential benefits of a many-die approach using two approaches: efficient workload coverage design space exploration using many dies and evaluating a many-die wafer-scale GPU design.Ope

Illinois Digital Environment for Access to Learning and Scholarship Repository