466 research outputs found

    Design and development from single core reconfigurable accelerators to a heterogeneous accelerator-rich platform

    Get PDF
    The performance of a platform is evaluated based on its ability to deal with the processing of multiple applications of different nature. In this context, the platform under evaluation can be of homogeneous, heterogeneous or of hybrid architecture. The selection of an architecture type is generally based on the set of different target applications and performance parameters, where the applications can be of serial or parallel nature. The evaluation is normally based on different performance metrics, e.g., resource/area utilization, execution time, power and energy consumption. This process can also include high-level performance metrics, e.g., Operations Per Second (OPS), OPS/Watt, OPS/Hz, Watt/Area etc. An example of architecture selection can be related to a wireless communication system where the processing of computationally-intensive signal-processing algorithms has strict execution-time constraints and in this case, a platform with special-purpose accelerators is relatively more suitable than a typical homogeneous platform. A couple of decades ago, it was expensive to plant many special-purpose accelerators on a chip as the cost per unit area was relatively higher than today. The utilization wall is also becoming a limiting factor in homogeneous multicore scaling which means that all the cores on a platform cannot be operated at their maximum frequency due to a possible thermal meltdown. In this case, some of the processing cores have to be turned-off or to be operated at very low frequencies making most of the part of the chip to stay underutilized. A possible solution lies in the use of heterogeneous multicore platforms where many application-specific cores operate at lower frequencies, therefore reducing power dissipation density and increasing other performance parameters. However, to achieve maximum flexibility in processing, a general-purpose flavor can also be introduced by adding a few Reduced Instruction-Set Computing (RISC) cores. A power class of heterogeneous multicore platforms is an accelerator-rich platform where many application-specific accelerators are loosely connected with each other for work load distribution or to execute the tasks independently. This research work spans from the design and development of three different types of template-based Coarse-Grain Reconfigurable Arrays (CGRAs), i.e., CREMA, AVATAR and SCREMA to a Heterogeneous Accelerator-Rich Platform (HARP). The accelerators generated from the three CGRAs could perform different lengths and types of Fast Fourier Transform (FFT), real and complex Matrix-Vector Multiplication (MVM) algorithms. CREMA and AVATAR were fixed CGRAs with eight and sixteen number of Processing Element (PE) columns, respectively. SCREMA could flex between four, eight, sixteen and thirty two number of PE columns. Many case studies were conducted to evaluate the performance of the reconfigurable accelerators generated from these CGRA templates. All of these CGRAs work in a processor/coprocessor model tightly integrated with a Direct Memory Access (DMA) device. Apart from these platforms, a reconfigurable Application-Specific Instruction-set Processor (rASIP) is also designed, tested for FFT execution under IEEE-802.11n timing constraints and evaluated against a processor/coprocessor model. It was designed by integrating AVATAR generated radix-(2, 4) FFT accelerator into the datapath of a RISC processor. The instruction set of the RISC processor was extended to perform additional operations related to AVATAR. As mentioned earlier, the underutilized part of the chip, now-a-days called Dark Silicon is posing many challenges for the designers. Apart from software optimizations, clock gating, dynamic voltage/frequency scaling and other high-level techniques, one way of dealing with this problem is to use many application-specific cores. In an effort to maximize the number of reconfigurable processing resources on a platform, the accelerator-rich architecture HARP was designed and evaluated in terms of different performance metrics. HARP is constructed on a Network-on-Chip (NoC) of 3x3 nodes where with every node, a CGRA of application-specific size is integrated other than the central node which is attached to a RISC processor. The RISC establishes synchronization between the nodes for data transfer and also performs the supervisory control. While using the NoC as the backbone of communication between the cores, it becomes possible for all the cores to address each other and also perform execution simultaneously and independently of each other. The performance of accelerators generated from CREMA, AVATAR and SCREMA templates were evaluated individually and also when attached to HARP's NoC nodes. The individual CGRAs show promising results in their own capacity but when integrated all together in the framework of HARP, interesting comparisons were established in terms of overall execution times, resource utilization, operating frequencies, power and energy consumption. In evaluating HARP, estimates and measurements were also made in some advanced performance metrics, e.g., in MOPS/mW and MOPS/MHz. The overall research work promotes the idea of heterogeneous accelerator-rich platform as a solution to current problems and future needs of industry and academia

    Transaction Generator - Tool for Network-on-Chip Benchmarking

    Get PDF
    This thesis focuses on benchmarking on-chip communication networks. Multiprocessor System-on-Chips (MP-SoC) utilizing the Network-on-Chip (NoC) paradigm are becoming more prominent. Required communication capabilities differ considerably among the diverse set of application categories. A standard and commonly used benchmarking methodology for networks is needed to ease finding a suitable network topology and its configuration parameters for those applications. This thesis presents a simulation based tool Transaction Generator (TG) for evaluating NoCs. TG conforms the Open Core Protocol - International Partnership (OCP-IP) NoC benchmarking group's proposed methodology. TG relies on abstract task graphs made after real Multiprocessor System-on-Chip (MP-SoC) applications or synthetic test cases. TG simulates the workload tasks on Processing Elements (PEs) and generates the network traffc accordingly and collects statistics. TG was initially introduced in 2003 and this thesis presents the current state of the tool and the modifications made. The work for this thesis includes refactoring the whole program from a TCL and C++ SystemC 1 based code generator to a C++ SystemC 2 based dynamic simulation kernel. In addition to the refactoring, new features were implemented, such as memory modeling with the Accurate Dynamic Random Access Memory (DRAM) Model (ADM) package, possibility of simulating Mobile Computing System Lab (MCSL) NoC Tra c Patterns workload models and diversity to modeling the workload. The current implementation of TG consists of 10k lines of code for the simulator core, the result of this thesis, and 50k lines of code for the support programs and example NoC models. Thesis presents 3 example use cases requiring around 100 simulations, which can be executed and analyzed in a work day with the TG

    Networks on Chips: Structure and Design Methodologies

    Get PDF

    Facilitating Internet of Things on the Edge

    Get PDF
    The evolution of electronics and wireless technologies has entered a new era, the Internet of Things (IoT). Presently, IoT technologies influence the global market, bringing benefits in many areas, including healthcare, manufacturing, transportation, and entertainment. Modern IoT devices serve as a thin client with data processing performed in a remote computing node, such as a cloud server or a mobile edge compute unit. These computing units own significant resources that allow prompt data processing. The user experience for such an approach relies drastically on the availability and quality of the internet connection. In this case, if the internet connection is unavailable, the resulting operations of IoT applications can be completely disrupted. It is worth noting that emerging IoT applications are even more throughput demanding and latency-sensitive which makes communication networks a practical bottleneck for the service provisioning. This thesis aims to eliminate the limitations of wireless access, via the improvement of connectivity and throughput between the devices on the edge, as well as their network identification, which is fundamentally important for IoT service management. The introduction begins with a discussion on the emerging IoT applications and their demands. Subsequent chapters introduce scenarios of interest, describe the proposed solutions and provide selected performance evaluation results. Specifically, we start with research on the use of degraded memory chips for network identification of IoT devices as an alternative to conventional methods, such as IMEI; these methods are not vulnerable to tampering and cloning. Further, we introduce our contributions for improving connectivity and throughput among IoT devices on the edge in a case where the mobile network infrastructure is limited or totally unavailable. Finally, we conclude the introduction with a summary of the results achieved

    Design of Intellectual Property-Based Hardware Blocks Integrable with Embedded RISC Processors

    Get PDF
    The main focus of this thesis is to research methods, architecture, and implementation of hardware acceleration for a Reduced Instruction Set Computer (RISC) platform. The target platform is a single-core general-purpose embedded processor (the COFFEE core) which was developed by our group at Tampere University of Technology. The COFFEE core alone cannot meet the requirements of the modern applications due to the lack of several components of which the Memory Management Unit (MMU) is one of the prominent ones. Since the MMU is one of the main requirements of today’s processors, COFFEE with no MMU was not able to run an operating system. In the design of the MMU, we employed two additional micro-Translation-Lookaside Buffers (TLBs) to speed up the translation process, as well as minimizing congestions of the data/instruction address translations with a unified TLB. The MMU is tightly-coupled with the COFFEE RISC core through the Peripheral Control Block (PCB) interface of the core. The hardware implementation, alongside some optimization techniques and post synthesis results are presented, as well.Another intention of this work is to prepare a reconfigurable platform to send and receive data packets of the next generation wireless communications. Hence, we will further discuss a recently emerged wireless modulation technique known as Non-Contiguous Orthogonal Frequency Division Multiplexing (NC-OFDM), a promising technique to alleviate spectrum scarcity problem. However, one of the primary concerns in such systems is the synchronization. To that end, we developed a reconfigurable hardware component to perform as a synchronizer. The developed module exploits Partial Reconfiguration (PR) feature in order to reconfigure itself. Eventually, we will come up with several architectural choices for systems with different limiting factors such as power consumption, operating frequency, and silicon area. The synchronizer can be loosely-coupled via one of the available co-processor slots of the target processor, the COFFEE RISC core.In addition, we are willing to improve the versatility of the COFFEE core even in industrial use cases. Hence, we developed a reconfigurable hardware component capable of operating in the Controller Area Network (CAN) protocol. In the first step of this implementation, we mainly concentrate on receiving, decoding, and extracting the data segment of a CAN-based packet. Moreover, this hardware block can reconfigure itself on-the-fly to operate on different data frames. More details regarding hardware implementation issues, as well as post synthesis results are also presented. The CAN module is loosely-coupled with the COFFEE RISC processor through one of the available co-processor block

    Power and Energy Aware Heterogeneous Computing Platform

    Get PDF
    During the last decade, wireless technologies have experienced significant development, most notably in the form of mobile cellular radio evolution from GSM to UMTS/HSPA and thereon to Long-Term Evolution (LTE) for increasing the capacity and speed of wireless data networks. Considering the real-time constraints of the new wireless standards and their demands for parallel processing, reconfigurable architectures and in particular, multicore platforms are part of the most successful platforms due to providing high computational parallelism and throughput. In addition to that, by moving toward Internet-of-Things (IoT), the number of wireless sensors and IP-based high throughput network routers is growing at a rapid pace. Despite all the progression in IoT, due to power and energy consumption, a single chip platform for providing multiple communication standards and a large processing bandwidth is still missing.The strong demand for performing different sets of operations by the embedded systems and increasing the computational performance has led to the use of heterogeneous multicore architectures with the help of accelerators for computationally-intensive data-parallel tasks acting as coprocessors. Currently, highly heterogeneous systems are the most power-area efficient solution for performing complex signal processing systems. Additionally, the importance of IoT has increased significantly the need for heterogeneous and reconfigurable platforms.On the other hand, subsequent to the breakdown of the Dennardian scaling and due to the enormous heat dissipation, the performance of a single chip was obstructed by the utilization wall since all cores cannot be clocked at their maximum operating frequency. Therefore, a thermal melt-down might be happened as a result of high instantaneous power dissipation. In this context, a large fraction of the chip, which is switched-off (Dark) or operated at a very low frequency (Dim) is called Dark Silicon. The Dark Silicon issue is a constraint for the performance of computers, especially when the up-coming IoT scenario will demand a very high performance level with high energy efficiency. Among the suggested solution to combat the problem of Dark-Silicon, the use of application-specific accelerators and in particular Coarse-Grained Reconfigurable Arrays (CGRAs) are the main motivation of this thesis work.This thesis deals with design and implementation of Software Defined Radio (SDR) as well as High Efficiency Video Coding (HEVC) application-specific accelerators for computationally intensive kernels and data-parallel tasks. One of the most important data transmission schemes in SDR due to its ability of providing high data rates is Orthogonal Frequency Division Multiplexing (OFDM). This research work focuses on the evaluation of Heterogeneous Accelerator-Rich Platform (HARP) by implementing OFDM receiver blocks as designs for proof-of-concept. The HARP template allows the designer to instantiate a heterogeneous reconfigurable platform with a very large amount of custom-tailored computational resources while delivering a high performance in terms of many high-level metrics. The availability of this platform lays an excellent foundation to investigate techniques and methods to replace the Dark or Dim part of chip with high-performance silicon dissipating very low power and energy. Furthermore, this research work is also addressing the power and energy issues of the embedded computing systems by tailoring the HARP for self-aware and energy-aware computing models. In this context, the instantaneous power dissipation and therefore the heat dissipation of HARP are mitigated on FPGA/ASIC by using Dynamic Voltage and Frequency Scaling (DVFS) to minimize the dark/dim part of the chip. Upgraded HARP for self-aware and energy-aware computing can be utilized as an energy-efficient general-purpose transceiver platform that is cognitive to many radio standards and can provide high throughput while consuming as little energy as possible. The evaluation of HARP has shown promising results, which makes it a suitable platform for avoiding Dark Silicon in embedded computing platforms and also for diverse needs of IoT communications.In this thesis, the author designed the blocks of OFDM receiver by crafting templatebased CGRA devices and then attached them to HARP’s Network-on-Chip (NoC) nodes. The performance of application-specific accelerators generated from templatebased CGRAs, the performance of the entire platform subsequent to integrating the CGRA nodes on HARP and the NoC traffic are recorded in terms of several highlevel performance metrics. In evaluating HARP on FPGA prototype, it delivers a performance of 0.012 GOPS/mW. Because of the scalability and regularity in HARP, the author considered its value as architectural constant. In addition to showing the gain and the benefits of maximizing the number of reconfigurable processing resources on a platform in comparison to the scaled performance of several state-of-the-art platforms, HARP’s architectural constant ensures application-independent figure of merit. HARP is further evaluated by implementing various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for HEVC standard, which showed its ability to sustain Full HD 1080p format at 30 fps on FPGA. The author also integrated self-aware computing model in HARP to mitigate the power dissipation of an OFDM receiver. In the case of FPGA implementation, the total power dissipation of the platform showed 16.8% reduction due to employing the Feedback Control System (FCS) technique with Dynamic Frequency Scaling (DFS). Furthermore, by moving to ASIC technology and scaling both frequency and voltage simultaneously, significant dynamic power reduction (up to 82.98%) was achieved, which proved the DFS/DVFS techniques as one step forward to mitigate the Dark Silicon issue

    Revisiting the high-performance reconfigurable computing for future datacenters

    Get PDF
    Modern datacenters are reinforcing the computational power and energy efficiency by assimilating field programmable gate arrays (FPGAs). The sustainability of this large-scale integration depends on enabling multi-tenant FPGAs. This requisite amplifies the importance of communication architecture and virtualization method with the required features in order to meet the high-end objective. Consequently, in the last decade, academia and industry proposed several virtualization techniques and hardware architectures for addressing resource management, scheduling, adoptability, segregation, scalability, performance-overhead, availability, programmability, time-to-market, security, and mainly, multitenancy. This paper provides an extensive survey covering three important aspects-discussion on non-standard terms used in existing literature, network-on-chip evaluation choices as a mean to explore the communication architecture, and virtualization methods under latest classification. The purpose is to emphasize the importance of choosing appropriate communication architecture, virtualization technique and standard language to evolve the multi-tenant FPGAs in datacenters. None of the previous surveys encapsulated these aspects in one writing. Open problems are indicated for scientific community as well

    Wireless distributed intelligence in personal applications

    Get PDF
    Tietokoneet ovat historian kuluessa kehittyneet keskustietokoneista hajautettujen, langattomasti toimivien järjestelmien suuntaan. Elektroniikalla toteutetut automaattiset toiminnot ympärillämme lisääntyvät kiihtyvällä vauhdilla. Tällaiset sovellukset lisääntyvät tulevaisuudessa, mutta siihen soveltuva tekniikka on vielä kehityksen alla ja vaadittavia ominaisuuksia ei aina löydy. Nykyiset lyhyen kantaman langattoman tekniikan standardit ovat tarkoitettu lähinnä teollisuuden ja multimedian käyttöön, siksi ne ovat vain osittain soveltuvia uudenlaisiin ympäristöälykkäisiin käyttötarkoituksiin. Ympäristöälykkäät sovellukset palvelevat enimmäkseen jokapäiväistä elämäämme, kuten turvallisuutta, kulunvalvontaa ja elämyspalveluita. Ympäristöälykkäitä ratkaisuja tarvitaan myös hajautetussa automaatiossa ja kohteiden automaattisessa seurannassa. Tutkimuksen aikana Seinäjoen ammattikorkeakoulussa on tutkittu lyhyen kantaman langatonta tekniikkaa: suunniteltu ja kehitetty pienivirtaisia radionappeja, niitten ohjelmointiympäristöä sekä langattoman verkon synkronointia, tiedonkeruuta ja reititystä. Lisäksi on simuloitu eri reititystapoja, sisäpaikannusta ja kaivinkoneen kalibrointia soveltaen mm. neurolaskentaa. Tekniikkaa on testattu myös käytännön sovelluksissa. Ympäristöälykkäät sovellusalueet ovat ehkä nopeimmin kasvava lähitulevaisuuden ala tietotekniikassa. Tutkitulla tekniikalla on runsaasti uusia haasteita ihmisten hyvinvointia, terveyttä ja turvallisuutta lisäävissä sovelluksissa, kuten myös teollisuuden uusissa sovelluksissa, esimerkiksi älykkäässä energiansiirtoverkossa.The development of computing is moving from mainframe computers to distributed intelligence with wireless features. The automated functions around us, in the form of small electronic devices, are increasing and the pace is continuously accelerating. The number of these applications will increase in the future, but suitable features needed are lacking and suitable technology development is still ongoing. The existing wireless short-range standards are mostly suitable for use in industry and in multimedia applications, but they are only partly suitable for the new network feature demands of the ambient intelligence applications. The ambient intelligent applications will serve us in our daily lives: security, access control and exercise services. Ambient intelligence is also adopted by industry in distributed amorphous automation, in access monitoring and the control of machines and devices. During this research, at Seinäjoki University of Applied Sciences, we have researched, designed and developed short-range wireless technology: low-power radio buttons with a programming environment for them as well as synchronization, data collecting and routing features for the wireless network. We have simulated different routing methods, indoor positioning and excavator calibration using for example neurocomputing. In addition, we have tested the technology in practical applications. The ambient intelligent applications are perhaps the area growing the most in information technology in the future. There will be many new challenges to face to increase welfare, health, security, as well as industrial applications (for example, at factories and in smart grids) in the future.fi=vertaisarvioitu|en=peerReviewed

    Design and Implementation of MIMO OFDM IEEE802.11n Receiver Blocks on Heterogeneous Multicore Architecture

    Get PDF
    In this thesis, the performance of a heterogeneous multicore platform in terms of technical capability is evaluated. Therefore, the choice of architecture in general can be based on a set of diverse applications. Selected applications can be parallel or serial in nature. Applications evaluation are often based on various performance metrics including the resource utilization and execution time. The wireless communication systems are expanded to accelerate their functions execution in both software and hardware. The embedded systems which involve several types of communication systems perform a large number of computations which require short execution time and minimized power consumption. Also, there is a growing demand for application-specific accelerators aiding general-purpose. One feasible way is to use heterogeneous multi-core platforms. Furthermore, many application-specific accelerators are loosely connected with each other. In this study, the implementation of Multiple-Input Multiple-Output (MIMO) Orthogonal Frequency Division Multiplexing (OFDM) receiver is evaluated by applying a Heterogeneous Multicore Architecture (HMA). The MIMO OFDM receiver is composed of computationally intensive and general-purpose processing tasks and can serve maximum coverage for evaluation of the HMA. The receiver blocks are designed by crafting template-based Coarse-grained Reconfigurable Array (CGRA) devices. In this case study, four streams (antennas) are proposed in order to process the data over CGRAs simultaneously. HMA nodes will be reconfigured at run-time in different blocks of the receiver. In this experimental work, according to the performance of each CGRA, the collective performance of the entire platform as well as NoC traffic is recorded considering the number of clock cycles and also several high-level performance criteria. The implementation of OFDM receiver scaled CGRAs to various dimensions. The data can also be exchanged between diverse nodes on the NoC structure by utilizing direct memory access (DMA) devices independently
    corecore