526 research outputs found
A system-on-chip vector multiprocessor for transmission line modelling acceleration
We discuss a configurable, System-on-Chip vector
multiprocessor for accelerating the Transmission Line
Modeling (TLM) algorithm with an architecture capable of
exploiting the two primary forms of parallelism in the code,
thread and data level parallelism. Theoretical results
demonstrate an order of magnitude reduction in the dynamic
instruction count for a scalar-processor/vector-coprocessor
configuration at a vector length of sixteen 32-bit singleprecision
elements. Furthermore, a multi-vector SoC
architecture consisting of ten such vector accelerators provides
a near-linear theoretical performance benefit of the order of
88% in three out of four benchmark configurations which is
orthogonal to the benefit realized by vectorization alone. We
discuss in detail this potent architecture and present
implementation data for the 2-way multi-processor VLSI
macrocell
Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002
The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design
Non-power-of-Two FFTs: Exploring the Flexibility of the Montium TP
Coarse-grain reconfigurable architectures, like the Montium TP, have proven to be a very successful approach for low-power and high-performance computation of regular digital signal processing algorithms. This paper presents the implementation of a class of non-power-of-two FFTs to discover the limitations and Flexibility of the Montium TP for less regular algorithms. A non-power-of-two FFT is less regular compared to a traditional power-of-two FFT. The results of the implementation show the processing time, accuracy, energy consumption and Flexibility of the implementation
Wire management for coherence traffic in chip multiprocessors
Journal ArticleImprovements in semiconductor technology have made it possible to include multiple processor cores on a single die. Chip Multi-Processors (CMP) are an attractive choice for future billion transistor architectures due to their low design complexity, high clock frequency, and high throughput. In a typical CMP architecture, the L2 cache is shared by multiple cores and data coherence is maintained among private L1s. Coherence operations entail frequent communication over global on-chip wires. In future technologies, communication between different L1s will have a significant impact on overall processor performance and power consumption. On-chip wires can be designed to have different latency, bandwidth, and energy properties. Likewise, coherence protocol messages have different latency and bandwidth needs. We propose an interconnect comprised of wires with varying latency, bandwidth, and energy characteristics, and advocate intelligently mapping coherence operations to the appropriate wires. In this paper, we present a comprehensive list of techniques that allow coherence protocols to exploit a heterogeneous interconnect and present preliminary data that indicates the potential of these techniques to significantly improve performance and reduce power consumption. We further demonstrate that most of these techniques can be implemented at a minimum complexity overhead
Memory system support for image processing
Journal ArticleProcessor speeds are increasing rapidly, but memory speeds are not keeping pace. Image processing is an important application domain that is particularly impacted by this growing performance gap. Image processing algorithms tend to have poor memory locality because they access their data in a non-sequential fashion and reuse that data infrequently. As a result, they often exhibit poor cache and TLB hit rates on conventional memory systems, which limits overall performance. Most current approaches to addressing the memory bottleneck focus on modifying cache organizations or introducing processor-based prefetching. The Impulse memory system takes a different approach: allowing application software to control how, when, and where data are loaded into a conventional processor cache. Impulse does this by letting software configure how the memory controller interprets the physical addresses exported by a processor. Introducing an extra level of address translation in the memory. Data that is sparse in memory can be accessed densely, which improves both cache and TLB utilization, and Impulse hides memory latency by prefectching data within the memory controller. We describe how Impulse improves the performance of three image processing algorithms: an Impulse memory system yields speedups of 40% to 226% over an otherwise identical machine with a conventional memory system
- âŠ