Search CORE

526 research outputs found

A system-on-chip vector multiprocessor for transmission line modelling acceleration

Author: James Flint (1251738)
Jose L. Nunez-Yanez (7202684)
Vassilios Chouliaras (1251600)
Yibin Li (1526803)
Publication venue
Publication date: 01/01/2005
Field of study

We discuss a configurable, System-on-Chip vector multiprocessor for accelerating the Transmission Line Modeling (TLM) algorithm with an architecture capable of exploiting the two primary forms of parallelism in the code, thread and data level parallelism. Theoretical results demonstrate an order of magnitude reduction in the dynamic instruction count for a scalar-processor/vector-coprocessor configuration at a vector length of sixteen 32-bit singleprecision elements. Furthermore, a multi-vector SoC architecture consisting of ten such vector accelerators provides a near-linear theoretical performance benefit of the order of 88% in three out of four benchmark configurations which is orthogonal to the benefit realized by vectorization alone. We discuss in detail this potent architecture and present implementation data for the 2-way multi-processor VLSI macrocell

Loughborough University Institutional Repository

Asynchrony in parallel computing: from dataflow to multithreading

Author: Robic B.
Silc J.
Ungerer Theo
Publication venue
Publication date: 02/08/2007
Field of study

KITopen

Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002

Author: Wolf Tilman
Publication venue: Washington University Open Scholarship
Publication date: 03/05/2002
Field of study

The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design

Washington University St. Louis: Open Scholarship

Ravel-XL: a hardware accelerator for assigned-delay compiled-code logic gate simulation

Author: Brown R. B.
Marques-Silva J. P.
Riepe M. A.
Sakallah K. A.
Publication venue
Publication date: 01/03/1996
Field of study

Southampton (e-Prints Soton)

Context flow architecture

Author: Lees Timothy
Publication venue: The University of Edinburgh
Publication date: 01/01/1990
Field of study

Edinburgh Research Archive

Non-power-of-Two FFTs: Exploring the Flexibility of the Montium TP

Author: Hauck S.A.
Smit Gerardus Johannes Maria
van de Burgwal M.D.
Wolkotte P.T.
Publication venue: Hindawi Publishing Corporation
Publication date: 01/01/2009
Field of study

Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation

CiteSeerX

Crossref

Directory of Open Access Journals

University of Twente Research Information

Wire management for coherence traffic in chip multiprocessors

Author: Balasubramonian Rajeev
Cheng Liqun
Publication venue: Workshop on Complexity-Effective Design
Publication date: 01/01/2005
Field of study

Journal ArticleImprovements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high clock frequency, and high throughput. In a typical CMP architecture, the L2 cache is shared by multiple cores and data coherence is maintained among private L1s. Coherence operations entail frequent communication over global on-chip wires. In future technologies, communication between different L1s will have a significant impact on overall processor performance and power consumption. On-chip wires can be designed to have different latency, bandwidth, and energy properties. Likewise, coherence protocol messages have different latency and bandwidth needs. We propose an interconnect comprised of wires with varying latency, bandwidth, and energy characteristics, and advocate intelligently mapping coherence operations to the appropriate wires. In this paper, we present a comprehensive list of techniques that allow coherence protocols to exploit a heterogeneous interconnect and present preliminary data that indicates the potential of these techniques to significantly improve performance and reduce power consumption. We further demonstrate that most of these techniques can be implemented at a minimum complexity overhead

The University of Utah: J. Willard Marriott Digital Library

Low power digital signal processing

Author: Paker Ozgun
Publication venue: Technical University of Denmark
Publication date: 01/01/2003
Field of study

Online Research Database In Technology

Memory system support for image processing

Author: Hsieh Wilson C.
Zhang Lixin
Publication venue: University of Utah
Publication date: 01/01/1999
Field of study

Journal ArticleProcessor speeds are increasing rapidly, but memory speeds are not keeping pace. Image processing is an important application domain that is particularly impacted by this growing performance gap. Image processing algorithms tend to have poor memory locality because they access their data in a non-sequential fashion and reuse that data infrequently. As a result, they often exhibit poor cache and TLB hit rates on conventional memory systems, which limits overall performance. Most current approaches to addressing the memory bottleneck focus on modifying cache organizations or introducing processor-based prefetching. The Impulse memory system takes a different approach: allowing application software to control how, when, and where data are loaded into a conventional processor cache. Impulse does this by letting software configure how the memory controller interprets the physical addresses exported by a processor. Introducing an extra level of address translation in the memory. Data that is sparse in memory can be accessed densely, which improves both cache and TLB utilization, and Impulse hides memory latency by prefectching data within the memory controller. We describe how Impulse improves the performance of three image processing algorithms: an Impulse memory system yields speedups of 40% to 226% over an otherwise identical machine with a conventional memory system

The University of Utah: J. Willard Marriott Digital Library