376 research outputs found

    Efficient Connection Allocator in Network-on-Chip

    Get PDF
    As semiconductor technologies develop, a System-on-Chip (SoC) that integrates all semiconductor intellectual property (IP) cores is suggested and widely used for various applications. A traditional bus interconnection does not support transmitting data between IP cores for high performance. Because of this reason, a Network-on-Chip (NoC) has been suggested to provide an efficient and scalable solution to interconnect among all IP cores. High throughput and low latency have recently become the main important factors of NoC for achieving hard guaranteed real-time systems. In order to guarantee these factors and provide real-time service (i.e., Guaranteed Service, GS), the circuit switching (CS) approach has been widely utilized. The CS approach allocates mutually exclusive paths to transmitting data between different sources and destinations using dedicated NoC resources. However, the exclusive occupancy of the allocated path reduces the efficiency of the overall use of NoC resources. In order to solve this problem, Space-Division-Multiplexing (SDM) and Time-Division-Multiplexing (TDM) techniques have been suggested. SDM implements a circuit switching technique by assigning physically different NoC-links between different connections. Path connections of the SDM technique based on spatial resources assignment do not provide high scalability. In contrast to this, using virtual time slots for a path connection, the TDM technique can share physical links between exclusively established connections, thereby improving NoC path diversity. For all of these mentioned techniques, the factor that significantly impacts the system efficiency or performance scaling is how the path is allocated. In recent years, a dynamic connection allocation approach that can cope with highly dynamic workloads has been gaining attention due to the sudden and diverse demands of applications in real-time systems. There are two groups in the dynamic connection allocation approach. One is a distributed allocation technique, and the other is a centralized allocation technique. While distributed allocation exploits additional logic integrated into the NoC-routers for path search and allocation, the centralized approach makes use of a central unit to manage the path allocation problem. There are several algorithms for the centralized allocation technique. Trellis search-based allocation approach shows the best performance among them. Many algorithms related to centralized connection allocators have been studied extensively during the past decade. However, relatively little attention was paid to methodology in analyzing and evaluating the centralized connection allocation algorithms. In order to further develop the algorithms, it is necessary to understand and evaluate the centralized connection allocator by establishing a new analysis methodology. Thus, this thesis presents a performance analysis methodology for the trellis search-based allocation approach. Firstly, this thesis proposes a system model for analysis. Secondly, performance metrics are defined. Finally, the analysis results of each performance metric related to the trellis search-based allocation approach are presented. Through this analysis, the performance of the trellis search-based allocation approach can be accurately analyzed. Although a simulation is not performed, the upper limit of performance of the trellis search-based allocation approach can also be predicted through the analysis metrics. Additionally, we introduce the general formulation of the trellis search-based path allocation algorithm. The weight values among available paths through the branch metric and path metric are proposed to enable higher performance path connection. Furthermore, according to network size, topology, TDM, interface load delivery, and router internal storage, the performance of trellis search-based path allocation algorithms is also described. In the end, the Application Specific Instruction Processor (ASIP) hardware platform customized for the trellis search-based path allocation algorithm is presented. The shortest available and lowest-cost (SALC) path search algorithm is proposed to improve the success rate of path connection in the ASIP hardware platform. We evaluate the algorithm performance and implementation synthesis results. In order to realize the dynamic connection approach, a short execution cycle of ASIP time is essential. We develop several algorithms to achieve this short execution cycle. The first one is a rectangular region of search algorithm that allows adapting the size and form of path search region according to the particular source-destination positions and considers actual operational constraints. The average execution cycles for searching an optimum path are decreased because the unnecessary region for path-search is excluded. The second one is a path-spreading search algorithm that separates between involved routers and uninvolved routers in path search. The involved routers are selected and spread out from source to destination at each intermediate trellis-search process. The path-search overhead is considerably reduced due to the router involvements. The third one is a three-directional path-spreading search algorithm that eliminates one direction movement among four spreading movements. Because of this reason, the trellis search-based path connection algorithm, which omits the back-tracing process, can be implemented in the ASIP platform. Thus, the whole algorithm execution time can be halved. The last one is a moving regional path search algorithm that significantly reduces computation complexity by selecting a constant dimensional path-search region that affects performance and moving the region from source to destination. The moving regional path search algorithm achieves a considerable decrement of computational complexity.:1 Introduction 1 1.1 NoC-interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 2 Connection allocation in a Network-on-Chip 7 2.1 Circuit Switching NoCs . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Guaranteed Service in NoCs . . . . . . . . . . . . . . . . . . . 7 2.1.2 Spatial-Division-Multiplexing technique . . . . . . . . . . . . 8 2.1.3 Time-Division-Multiplexing technique . . . . . . . . . . . . . 10 2.2 System architectures employing circuit switching NoCs . . . . . . . . 11 2.2.1 Static and dynamic connection allocation . . . . . . . . . . . 12 2.2.2 Distributed connection allocation technique . . . . . . . . . . 14 2.2.3 Centralized connection allocation technique . . . . . . . . . . 16 2.2.4 Algorithms for centralized connection allocation . . . . . . . . 17 2.2.4.1 Software based run-time path allocation approach . 18 2.2.4.2 Trellis search-based allocation approach . . . . . . . 19 3 Performance analysis methodology for a centralized connection allocator 23 3.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2 Performance metrics and analysis methodology . . . . . . . . . . . . 25 3.3 System simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 4 Trellis search-based path allocation algorithm 45 4.1 General formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.1.1 Trellis graph structure . . . . . . . . . . . . . . . . . . . . . . 45 4.1.2 Survivor path selection criterion . . . . . . . . . . . . . . . . . 52 ix 4.1.2.1 Branch metric and path metric . . . . . . . . . . . . 52 4.1.2.2 The shortest-available and lowest-cost path selection criterion . . . . . . . . . . . . . . . . . . . . . . . . . 53 4.2 Algorithm Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.1 Network topology . . . . . . . . . . . . . . . . . . . . . . . . 55 4.2.2 Network size . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 4.2.3 Time-Division-Multiplexing . . . . . . . . . . . . . . . . . . . 61 4.2.4 NoC interface load diversity . . . . . . . . . . . . . . . . . . . 63 4.2.5 The internal storage of the router . . . . . . . . . . . . . . . . 66 5 ASIP approach for Trellis search-based connection allocation 73 5.1 System model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.1.1 Trellis search-based ASIP platform architecture . . . . . . . . 74 5.2 Algorithm for improving success rates of path connection . . . . . . . 81 5.2.1 SALC algorithm for Trellis search-based ASIP platform . . . . 81 5.2.2 Performance evaluation of the SALC algorithm . . . . . . . . 88 5.2.2.1 Simulation results . . . . . . . . . . . . . . . . . . . 88 5.2.2.2 Synthesis results . . . . . . . . . . . . . . . . . . . . 91 5.3 Algorithm for reducing path-search time . . . . . . . . . . . . . . . . 93 5.3.1 Rectangular regional path search algorithm . . . . . . . . . . 93 5.3.2 Path-spreading search algorithm . . . . . . . . . . . . . . . . 99 5.3.3 Three directional path-spreading search algorithm . . . . . . 108 5.3.4 Moving regional path search algorithm . . . . . . . . . . . . . 114 5.3.5 Performance evaluation . . . . . . . . . . . . . . . . . . . . . 123 5.3.5.1 Simulation results . . . . . . . . . . . . . . . . . . . 123 5.3.5.2 Synthesis results . . . . . . . . . . . . . . . . . . . . 126 6 Conclusion and Future work 131 6.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Bibliography 13

    Computer aided design of cluster-based ASIPs

    Get PDF

    VLSI implementation of a multi-mode turbo/LDPC decoder architecture

    Get PDF
    Flexible and reconfigurable architectures have gained wide popularity in the communications field. In particular, reconfigurable architectures for the physical layer are an attractive solution not only to switch among different coding modes but also to achieve interoperability. This work concentrates on the design of a reconfigurable architecture for both turbo and LDPC codes decoding. The novel contributions of this paper are: i) tackling the reconfiguration issue introducing a formal and systematic treatment that, to the best of our knowledge, was not previously addressed; ii) proposing a reconfigurable NoCbased turbo/LDPC decoder architecture and showing that wide flexibility can be achieved with a small complexity overhead. Obtained results show that dynamic switching between most of considered communication standards is possible without pausing the decoding activity. Moreover, post-layout results show that tailoring the proposed architecture to the WiMAX standard leads to an area occupation of 2.75 mm2 and a power consumption of 101.5 mW in the worst case

    Custom Integrated Circuits

    Get PDF
    Contains table of contents for Part III, table of contents for Section 1 and reports on eleven research projects.IBM CorporationMIT School of EngineeringNational Science Foundation Grant MIP 94-23221Defense Advanced Research Projects Agency/U.S. Army Intelligence Center Contract DABT63-94-C-0053Mitsubishi CorporationNational Science Foundation Young Investigator Award Fellowship MIP 92-58376Joint Industry Program on Offshore Structure AnalysisAnalog DevicesDefense Advanced Research Projects AgencyCadence Design SystemsMAFET ConsortiumConsortium for Superconducting ElectronicsNational Defense Science and Engineering Graduate FellowshipDigital Equipment CorporationMIT Lincoln LaboratorySemiconductor Research CorporationMultiuniversity Research IntiativeNational Science Foundatio

    Instruction-set architecture synthesis for VLIW processors

    Get PDF

    Energy analysis and optimisation techniques for automatically synthesised coprocessors

    Get PDF
    The primary outcome of this research project is the development of a methodology enabling fast automated early-stage power and energy analysis of configurable processors for system-on-chip platforms. Such capability is essential to the process of selecting energy efficient processors during design-space exploration, when potential savings are highest. This has been achieved by developing dynamic and static energy consumption models for the constituent blocks within the processors. Several optimisations have been identified, specifically targeting the most significant blocks in terms of energy consumption. Instruction encoding mechanism reduces both the energy and area requirements of the instruction cache; modifications to the multiplier unit reduce energy consumption during inactive cycles. Both techniques are demonstrated to offer substantial energy savings. The aforementioned techniques have undergone detailed evaluation and, based on the positive outcomes obtained, have been incorporated into Cascade, a system-on-chip coprocessor synthesis tool developed by Critical Blue, to provide automated analysis and optimisation of processor energy requirements. This thesis details the process of identifying and examining each method, along with the results obtained. Finally, a case study demonstrates the benefits of the developed functionality, from the perspective of someone using Cascade to automate the creation of an energy-efficient configurable processor for system-on-chip platforms

    Application specific instruction set processor design for embedded application using the coware tool

    Get PDF
    An Application Specific Instruction Set Processor (ASIP) is widely used as a System on a Chip(SoC) Component. ASIPs possess an instruction set which is tai-lored to benefit a specific application. Such specialization allows ASIPs to serve as an intermediate between two dominant processor design styles- ASICs which has high processing abilities at the cost of limited programmability and Programmable solu-tions such as FPGAs that provide programming exibility at the cost of less energy eficiency. In this dissertation the goal is to design ASIP, keeping in mind a temper-ature sensor system. The platform used for processor design is LISA 2.0 description language and processor designing environment from CoWare. Coware processor de-signer allows processor architecture to be defined at an abstract level and automatic generation of chain of software tools like assembler, linker and simulator for functional verification followed by RTL level description. RTL level description is used to gen-erate synthesized report of the design using RTL compiler and finally the layout is created using Cadence encounter

    KAVUAKA: a low-power application-specific processor architecture for digital hearing aids

    Get PDF
    The power consumption of digital hearing aids is very restricted due to their small physical size and the available hardware resources for signal processing are limited. However, there is a demand for more processing performance to make future hearing aids more useful and smarter. Future hearing aids should be able to detect, localize, and recognize target speakers in complex acoustic environments to further improve the speech intelligibility of the individual hearing aid user. Computationally intensive algorithms are required for this task. To maintain acceptable battery life, the hearing aid processing architecture must be highly optimized for extremely low-power consumption and high processing performance.The integration of application-specific instruction-set processors (ASIPs) into hearing aids enables a wide range of architectural customizations to meet the stringent power consumption and performance requirements. In this thesis, the application-specific hearing aid processor KAVUAKA is presented, which is customized and optimized with state-of-the-art hearing aid algorithms such as speaker localization, noise reduction, beamforming algorithms, and speech recognition. Specialized and application-specific instructions are designed and added to the baseline instruction set architecture (ISA). Among the major contributions are a multiply-accumulate (MAC) unit for real- and complex-valued numbers, architectures for power reduction during register accesses, co-processors and a low-latency audio interface. With the proposed MAC architecture, the KAVUAKA processor requires 16 % less cycles for the computation of a 128-point fast Fourier transform (FFT) compared to related programmable digital signal processors. The power consumption during register file accesses is decreased by 6 %to 17 % with isolation and by-pass techniques. The hardware-induced audio latency is 34 %lower compared to related audio interfaces for frame size of 64 samples.The final hearing aid system-on-chip (SoC) with four KAVUAKA processor cores and ten co-processors is integrated as an application-specific integrated circuit (ASIC) using a 40 nm low-power technology. The die size is 3.6 mm2. Each of the processors and co-processors contains individual customizations and hardware features with a varying datapath width between 24-bit to 64-bit. The core area of the 64-bit processor configuration is 0.134 mm2. The processors are organized in two clusters that share memory, an audio interface, co-processors and serial interfaces. The average power consumption at a clock speed of 10 MHz is 2.4 mW for SoC and 0.6 mW for the 64-bit processor.Case studies with four reference hearing aid algorithms are used to present and evaluate the proposed hardware architectures and optimizations. The program code for each processor and co-processor is generated and optimized with evolutionary algorithms for operation merging,instruction scheduling and register allocation. The KAVUAKA processor architecture is com-pared to related processor architectures in terms of processing performance, average power consumption, and silicon area requirements
    corecore