910 research outputs found
Automatic design of domain-specific instructions for low-power processors
This paper explores hardware specialization of low power processors to improve performance and energy efficiency. Our main contribution is an automated framework that analyzes instruction sequences of applications within a domain at the loop body level and identifies exactly and partially-matching sequences across applications that can become custom instructions. Our framework transforms sequences to a new code abstraction, a Merging Diagram, that improves similarity identification, clusters alike groups of potential custom instructions to effectively reduce the search space, and selects merged custom instructions to efficiently exploit the available customizable area. For a set of 11 media applications, our fast framework generates instructions that significantly improve the energy-delay product and speed up, achieving more than double the savings as compared to a technique analyzing sequences within basic blocks. This paper shows that partially-matched custom instructions, which do not significantly increase design time, are crucial to achieving higher energy efficiency at limited hardware areas
A system-on-chip vector multiprocessor for transmission line modelling acceleration
We discuss a configurable, System-on-Chip vector
multiprocessor for accelerating the Transmission Line
Modeling (TLM) algorithm with an architecture capable of
exploiting the two primary forms of parallelism in the code,
thread and data level parallelism. Theoretical results
demonstrate an order of magnitude reduction in the dynamic
instruction count for a scalar-processor/vector-coprocessor
configuration at a vector length of sixteen 32-bit singleprecision
elements. Furthermore, a multi-vector SoC
architecture consisting of ten such vector accelerators provides
a near-linear theoretical performance benefit of the order of
88% in three out of four benchmark configurations which is
orthogonal to the benefit realized by vectorization alone. We
discuss in detail this potent architecture and present
implementation data for the 2-way multi-processor VLSI
macrocell
Automatic generation and testing of application specific hardware accelerators on a new reconfigurable OpenSPARC platform
Specific hardware customization for scientific applications has shown a big potential to address the current holy grail in computer architecture: reducing power consumption while increasing performance. In particular, the automatic generation of domain-specific accelerators for General-
Purpose Processors (GPPs) is an active field of research to the point that different leading hardware design companies (e.g. Intel, ARM) are announcing commercial platforms that integrate GPPs and FPGAs. In this paper we present a new framework with a holistic approach that addresses the challenge of design exploration of specific application accelerators. Our work focuses on a target platform consisting of a GPP with a reconfigurable functional unit. The framework includes a reconfigurable 1-core 1-thread OpenSPARC with a new programmable specific purpose unit (SPU) inside the OpenSPARC core. In order to program the SPU we have developed an automatic toolchain that profiles an application and discovers its main computing bottlenecks. With that information our toolchain is able to both
design hardware specific accelerators that can be automatically mapped in the aforementioned SPU, and generate the binary code necessary to run the application using those accelerators. The OpenSPARC with the new specific application accelerators, defined in a Hardware Description Language,
can then be executed and measured. Still awaiting further development, nowadays our framework is a proof-of-concept that shows that this kind of systems can be developed and programmed as easily as a GPP. In a
near future it would be the source of very interesting information about the capabilities and drawbacks of those mixed GPP-FPGA systems.Postprint (published version
VThreads: A novel VLIW chip multiprocessor with hardware-assisted PThreads
We discuss VThreads, a novel VLIW CMP with hardware-assisted shared-memory Thread support. VThreads supports Instruction Level Parallelism via static multiple-issue and Thread Level Parallelism via hardware-assisted POSIX Threads along with extensive customization. It allows the instantiation of tightlycoupled streaming accelerators and supports up to 7-address Multiple-Input, Multiple-Output instruction extensions. VThreads is designed in technology-independent Register-Transfer-Level VHDL and prototyped on 40 nm and 28 nm Field-Programmable gate arrays. It was evaluated against a PThreads-based multiprocessor
based on the Sparc-V8 ISA. On a 65 nm ASIC implementation VThreads achieves up to x7.2
performance increase on synthetic benchmarks, x5 on a parallel Mandelbrot implementation, 66% better on a threaded JPEG implementation, 79% better on an edge-detection benchmark and ~13% improvement on DES compared to the Leon3MP CMP. In the range of 2 to 8 cores VThreads demonstrates a post-route (statistical) power reduction between 65% to 57% at an area increase of 1.2%-10% for 1-8 cores, compared to a similarly-configured Leon3MP CMP. This combination of micro-architectural features, scalability, extensibility,
hardware support for low-latency PThreads, power efficiency and area make the processor an attractive proposition for low-power, deeply-embedded applications requiring minimum OS support
A configurable vector processor for accelerating speech coding algorithms
The growing demand for voice-over-packer (VoIP) services and multimedia-rich
applications has made increasingly important the efficient, real-time implementation of
low-bit rates speech coders on embedded VLSI platforms. Such speech coders are
designed to substantially reduce the bandwidth requirements thus enabling dense multichannel
gateways in small form factor. This however comes at a high computational cost
which mandates the use of very high performance embedded processors.
This thesis investigates the potential acceleration of two major ITU-T speech coding
algorithms, namely G.729A and G.723.1, through their efficient implementation on a
configurable extensible vector embedded CPU architecture. New scalar and vector ISAs
were introduced which resulted in up to 80% reduction in the dynamic instruction count
of both workloads. These instructions were subsequently encapsulated into a parametric,
hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research
and implementation of the vector datapath of this vector coprocessor which is tightly-coupled
to a Sparc-V8 compliant CPU, the optimization and simulation methodologies
employed and the use of Electronic System Level (ESL) techniques to rapidly design
SIMD datapaths
Parallelism and the software-hardware interface in embedded systems
This thesis by publications addresses issues in the architecture and microarchitecture of next generation, high performance streaming Systems-on-Chip through quantifying the most important forms of parallelism in current and emerging embedded system workloads. The work consists of three major research tracks, relating to data level parallelism, thread level parallelism and the software-hardware interface which together reflect the research interests of the author as they have been formed in the last nine years. Published works confirm that parallelism at the data level is widely accepted as the most important performance leverage for the efficient execution of embedded media and telecom applications and has been exploited via a number of approaches the most efficient being vectorlSIMD architectures. A further, complementary and substantial form of parallelism exists at the thread level but this has not been researched to the same extent in the context of embedded workloads. For the efficient execution of such applications, exploitation of both forms of parallelism is of paramount importance. This calls for a new architectural approach in the software-hardware interface as its rigidity, manifested in all desktop-based and the majority of embedded CPU's, directly affects the performance ofvectorized, threaded codes. The author advocates a holistic, mature approach where parallelism is extracted via automatic means while at the same time, the traditionally rigid hardware-software interface is optimized to match the temporal and spatial behaviour of the embedded workload. This ultimate goal calls for the precise study of these forms of parallelism for a number of applications executing on theoretical models such as instruction set simulators and parallel RAM machines as well as the development of highly parametric microarchitectural frameworks to encapSUlate that functionality.EThOS - Electronic Theses Online ServiceGBUnited Kingdo
Building Programmable Wireless Networks: An Architectural Survey
In recent times, there have been a lot of efforts for improving the ossified
Internet architecture in a bid to sustain unstinted growth and innovation. A
major reason for the perceived architectural ossification is the lack of
ability to program the network as a system. This situation has resulted partly
from historical decisions in the original Internet design which emphasized
decentralized network operations through co-located data and control planes on
each network device. The situation for wireless networks is no different
resulting in a lot of complexity and a plethora of largely incompatible
wireless technologies. The emergence of "programmable wireless networks", that
allow greater flexibility, ease of management and configurability, is a step in
the right direction to overcome the aforementioned shortcomings of the wireless
networks. In this paper, we provide a broad overview of the architectures
proposed in literature for building programmable wireless networks focusing
primarily on three popular techniques, i.e., software defined networks,
cognitive radio networks, and virtualized networks. This survey is a
self-contained tutorial on these techniques and its applications. We also
discuss the opportunities and challenges in building next-generation
programmable wireless networks and identify open research issues and future
research directions.Comment: 19 page
- …