18 research outputs found
Application specific instruction set processor design for embedded application using the coware tool
An Application Specific Instruction Set Processor (ASIP) is widely used as a System on a Chip(SoC) Component. ASIPs possess an instruction set which is tai-lored to benefit a specific application. Such specialization allows ASIPs to serve as an intermediate between two dominant processor design styles- ASICs which has high processing abilities at the cost of limited programmability and Programmable solu-tions such as FPGAs that provide programming exibility at the cost of less energy eficiency. In this dissertation the goal is to design ASIP, keeping in mind a temper-ature sensor system. The platform used for processor design is LISA 2.0 description language and processor designing environment from CoWare. Coware processor de-signer allows processor architecture to be defined at an abstract level and automatic generation of chain of software tools like assembler, linker and simulator for functional verification followed by RTL level description. RTL level description is used to gen-erate synthesized report of the design using RTL compiler and finally the layout is created using Cadence encounter
The implementation of an LDPC decoder in a Network on Chip environment
The proposed project takes origin from a cooperation initiative named NEWCOM++ among
research groups to develop 3G wireless mobile system. This work, in particular, tries to focuse on
the communication errors arising on a message signal characterized by working under WiMAX
802.16e standard. It will be shown how this last wireless generation protocol needs a specific
flexible instrumentation and why an LDPC error correction code suitable in order to respect the
quality restrictions. A chapter will be dedicated to describe, not from a mathematical point of view,
the LDPC algorithm theory and how it can be graphically represented to better organize the
decodification process.
The main objective of this work is to validate the PHAL-concept when addressing a
complex and computationally intensive design like the LDPC encoder/decoder. The expected results
should be both conceptual; identifying the lacks on the PHAL concept when addressing a real
problem; and second to determine the overhead introduced by PHAL in the implementation of a
LDPC decoder.
The mission is to build a NoC (Network on Chip) able to perform the same task of a general
purpose processor, but in less time and with better efficiency, in terms of component flexibility and
throughput. The single element of the network is a basic processor element (PE) formed by the
union of two separated components: a special purpose processor ASIP, the responsible of the input
data LDPC decoding, and the router component PHAL, checking incoming data packets and
scanning the temporization of tasks execution.
Supported by a specific programming tool, the ASIP has been completely designed, from the
architecture resources to the instruction set, through a language like C. Realized in this SystemC
code and converted in VHDL language, it's been synthesized as to fit onto an FPGA of the Xilinx
Virtex-5 family. Although the main purpose regards the making of an application as flexible as
possible, a WiMAX-orientated LDPC implemented on a FPGA saves space and resources, choosing
the one that best suits the project synthesis. This is because encoders and decoders will have to find
room in the communication tools (e.g. modems) as best as possible.
The whole network scenary has been mounted through a Linux application, acting as a
master element. The entire environment will require the use of VPI libraries and components able to
manage the communication protocols and interfacing mechanisms
The implementation of an LDPC decoder in a Network on Chip environment
The proposed project takes origin from a cooperation initiative named NEWCOM++ among
research groups to develop 3G wireless mobile system. This work, in particular, tries to focuse on
the communication errors arising on a message signal characterized by working under WiMAX
802.16e standard. It will be shown how this last wireless generation protocol needs a specific
flexible instrumentation and why an LDPC error correction code suitable in order to respect the
quality restrictions. A chapter will be dedicated to describe, not from a mathematical point of view,
the LDPC algorithm theory and how it can be graphically represented to better organize the
decodification process.
The main objective of this work is to validate the PHAL-concept when addressing a
complex and computationally intensive design like the LDPC encoder/decoder. The expected results
should be both conceptual; identifying the lacks on the PHAL concept when addressing a real
problem; and second to determine the overhead introduced by PHAL in the implementation of a
LDPC decoder.
The mission is to build a NoC (Network on Chip) able to perform the same task of a general
purpose processor, but in less time and with better efficiency, in terms of component flexibility and
throughput. The single element of the network is a basic processor element (PE) formed by the
union of two separated components: a special purpose processor ASIP, the responsible of the input
data LDPC decoding, and the router component PHAL, checking incoming data packets and
scanning the temporization of tasks execution.
Supported by a specific programming tool, the ASIP has been completely designed, from the
architecture resources to the instruction set, through a language like C. Realized in this SystemC
code and converted in VHDL language, it's been synthesized as to fit onto an FPGA of the Xilinx
Virtex-5 family. Although the main purpose regards the making of an application as flexible as
possible, a WiMAX-orientated LDPC implemented on a FPGA saves space and resources, choosing
the one that best suits the project synthesis. This is because encoders and decoders will have to find
room in the communication tools (e.g. modems) as best as possible.
The whole network scenary has been mounted through a Linux application, acting as a
master element. The entire environment will require the use of VPI libraries and components able to
manage the communication protocols and interfacing mechanisms
High-Level Design Space and Flexibility Exploration for Adaptive, Energy-Efficient WCDMA Channel Estimation Architectures
Due to the fast changing wireless communication standards coupled with strict performance constraints, the demand for flexible yet high-performance architectures is increasing. To tackle the flexibility requirement, software-defined radio (SDR) is emerging as an obvious solution, where the underlying hardware implementation is tuned via software layers to the varied standards depending on power-performance and quality requirements leading to adaptable, cognitive radio. In this paper, we conduct a case study for representatives of two complexity classes of WCDMA channel estimation algorithms and explore the effect of flexibility on energy efficiency using different implementation options. Furthermore, we propose new design guidelines for both highly specialized architectures and highly flexible architectures using high-level synthesis, to enable the required performance and flexibility to support multiple applications. Our experiments with various design points show that the resulting architectures meet the performance constraints of WCDMA and a wide range of options are offered for tuning such architectures depending on power/performance/area constraints of SDR
An automated OpenCL FPGA compilation framework targeting a configurable, VLIW chip multiprocessor
Modern system-on-chips augment their baseline CPU with coprocessors and accelerators to increase overall computational capacity and power efficiency, and thus have evolved into heterogeneous systems. Several languages have been developed to enable this paradigm shift, including CUDA and OpenCL. This thesis discusses a unified compilation environment to enable heterogeneous system design through the use of OpenCL and a customised VLIW chip multiprocessor (CMP) architecture, known as the LE1. An LLVM compilation framework was researched and a prototype developed to enable the execution of OpenCL applications on the LE1 CPU. The framework fully automates the compilation flow and supports work-item coalescing to better utilise the CPU cores and alleviate the effects of thread divergence. This thesis discusses in detail both the software stack and target hardware architecture and evaluates the scalability of the proposed framework on a highly precise cycle-accurate simulator. This is achieved through the execution of 12 benchmarks across 240 different machine configurations, as well as further results utilising an incomplete development branch of the compiler. It is shown that the problems generally scale well with the LE1 architecture, up to eight cores, when the memory system becomes a serious bottleneck. Results demonstrate superlinear performance on certain benchmarks (x9 for the bitonic sort benchmark with 8 dual-issue cores) with further improvements from compiler optimisations (x14 for bitonic with the same configuration
Design methodologies for instruction-set extensible processors
Ph.DDOCTOR OF PHILOSOPH
Extensible microprocessor without interlocked pipeline stages (emips), the reconfigurable microprocessor
In this thesis we propose to realize the performance benefits of applicationspecific hardware optimizations in a general-purpose, multi-user system environment
using a dynamically extensible microprocessor architecture. We have called our
dynamically extensible microprocessor design the Extensible Microprocessor without
Interlocked Pipeline Stages, or eMIPS.
The eMIPS architecture uses the interaction of fixed and configurable logic
available in modern Field Programmable Gate Array (FPGA). This interaction is used to
address the limitations of current microprocessor architectures based solely on
Application Specific Integrated Circuits (ASIC). These limitations include inflexibility,
size, and application specific performance optimization. The eMIPS system allows
multiple secure extensions to load dynamically and to plug into the stages of a pipelined
central processing unit (CPU) data path, thereby extending the core instruction set of the
microprocessor. Extensions can also be used to realize on-chip peripherals, and if area
permits, even multiple cores. Extension instructions reduce dramatically the execution
time of frequently executed instruction patterns. These new functionalities we have developed can be exploited by patching the binaries of existing applications, without any
changes to the compilers.
A FPGA based workstation prototype and a flexible simulation system
implementating this design demonstrates speedups of 2x-3x on a set of applications that
include video games, real-time programs and the SPEC2000 integer benchmarks. eMIPS
is the first realized workstation based entirely on a dynamically extensible
microprocessor that is safe for general purpose, multi-user applications. By exposing the
individual stages of the data path, eMIPS allows optimizations not previously possible.
This includes permitting safe and coherent accesses to memory from within an extension,
optimizing multi-branched blocks, and throwing precise and restart able exceptions from
within an extension.
This work describes a simplified implementation of an extensible microprocessor
architecture based on the Microprocessor without Interlocked Pipeline Stages (MIPS)
Reduced Instruction Set Computer (RISC) architecture. The concepts and methods
contained within this thesis may be applied to other similar architectures. Given this
simplified prototype we look forward to propose how this architecture will be expanded
as it matures
Run-time management for future MPSoC platforms
In recent years, we are witnessing the dawning of the Multi-Processor Systemon- Chip (MPSoC) era. In essence, this era is triggered by the need to handle more complex applications, while reducing overall cost of embedded (handheld) devices. This cost will mainly be determined by the cost of the hardware platform and the cost of designing applications for that platform. The cost of a hardware platform will partly depend on its production volume. In turn, this means that ??exible, (easily) programmable multi-purpose platforms will exhibit a lower cost. A multi-purpose platform not only requires ??exibility, but should also combine a high performance with a low power consumption. To this end, MPSoC devices integrate computer architectural properties of various computing domains. Just like large-scale parallel and distributed systems, they contain multiple heterogeneous processing elements interconnected by a scalable, network-like structure. This helps in achieving scalable high performance. As in most mobile or portable embedded systems, there is a need for low-power operation and real-time behavior. The cost of designing applications is equally important. Indeed, the actual value of future MPSoC devices is not contained within the embedded multiprocessor IC, but in their capability to provide the user of the device with an amount of services or experiences. So from an application viewpoint, MPSoCs are designed to ef??ciently process multimedia content in applications like video players, video conferencing, 3D gaming, augmented reality, etc. Such applications typically require a lot of processing power and a signi??cant amount of memory. To keep up with ever evolving user needs and with new application standards appearing at a fast pace, MPSoC platforms need to be be easily programmable. Application scalability, i.e. the ability to use just enough platform resources according to the user requirements and with respect to the device capabilities is also an important factor. Hence scalability, ??exibility, real-time behavior, a high performance, a low power consumption and, ??nally, programmability are key components in realizing the success of MPSoC platforms. The run-time manager is logically located between the application layer en the platform layer. It has a crucial role in realizing these MPSoC requirements. As it abstracts the platform hardware, it improves platform programmability. By deciding on resource assignment at run-time and based on the performance requirements of the user, the needs of the application and the capabilities of the platform, it contributes to ??exibility, scalability and to low power operation. As it has an arbiter function between different applications, it enables real-time behavior. This thesis details the key components of such an MPSoC run-time manager and provides a proof-of-concept implementation. These key components include application quality management algorithms linked to MPSoC resource management mechanisms and policies, adapted to the provided MPSoC platform services. First, we describe the role, the responsibilities and the boundary conditions of an MPSoC run-time manager in a generic way. This includes a de??nition of the multiprocessor run-time management design space, a description of the run-time manager design trade-offs and a brief discussion on how these trade-offs affect the key MPSoC requirements. This design space de??nition and the trade-offs are illustrated based on ongoing research and on existing commercial and academic multiprocessor run-time management solutions. Consequently, we introduce a fast and ef??cient resource allocation heuristic that considers FPGA fabric properties such as fragmentation. In addition, this thesis introduces a novel task assignment algorithm for handling soft IP cores denoted as hierarchical con??guration. Hierarchical con??guration managed by the run-time manager enables easier application design and increases the run-time spatial mapping freedom. In turn, this improves the performance of the resource assignment algorithm. Furthermore, we introduce run-time task migration components. We detail a new run-time task migration policy closely coupled to the run-time resource assignment algorithm. In addition to detailing a design-environment supported mechanism that enables moving tasks between an ISP and ??ne-grained recon??gurable hardware, we also propose two novel task migration mechanisms tailored to the Network-on-Chip environment. Finally, we propose a novel mechanism for task migration initiation, based on reusing debug registers in modern embedded microprocessors. We propose a reactive on-chip communication management mechanism. We show that by exploiting an injection rate control mechanism it is possible to provide a communication management system capable of providing a soft (reactive) QoS in a NoC. We introduce a novel, platform independent run-time algorithm to perform quality management, i.e. to select an application quality operating point at run-time based on the user requirements and the available platform resources, as reported by the resource manager. This contribution also proposes a novel way to manage the interaction between the quality manager and the resource manager. In order to have a the realistic, reproducible and ??exible run-time manager testbench with respect to applications with multiple quality levels and implementation tradev offs, we have created an input data generation tool denoted Pareto Surfaces For Free (PSFF). The the PSFF tool is, to the best of our knowledge, the ??rst tool that generates multiple realistic application operating points either based on pro??ling information of a real-life application or based on a designer-controlled random generator. Finally, we provide a proof-of-concept demonstrator that combines these concepts and shows how these mechanisms and policies can operate for real-life situations. In addition, we show that the proposed solutions can be integrated into existing platform operating systems