1,422 research outputs found

    Hardware schemes for early register release

    Get PDF
    Register files are becoming one of the critical components of current out-of-order processors in terms of delay and power consumption, since their potential to exploit instruction-level parallelism is quite related to the size and number of ports of the register file. In conventional register renaming schemes, register releasing is conservatively done only after the instruction that redefines the same register is committed. Instead, we propose a scheme that releases registers as soon as the processor knows that there will be no further use of them. We present two early releasing hardware implementations with different performance/complexity trade-offs. Detailed cycle-level simulations show either a significant speedup for a given register file size, or a reduction in register file size for a given performance level.Peer ReviewedPostprint (published version

    HERO: Heterogeneous Embedded Research Platform for Exploring RISC-V Manycore Accelerators on FPGA

    Full text link
    Heterogeneous embedded systems on chip (HESoCs) co-integrate a standard host processor with programmable manycore accelerators (PMCAs) to combine general-purpose computing with domain-specific, efficient processing capabilities. While leading companies successfully advance their HESoC products, research lags behind due to the challenges of building a prototyping platform that unites an industry-standard host processor with an open research PMCA architecture. In this work we introduce HERO, an FPGA-based research platform that combines a PMCA composed of clusters of RISC-V cores, implemented as soft cores on an FPGA fabric, with a hard ARM Cortex-A multicore host processor. The PMCA architecture mapped on the FPGA is silicon-proven, scalable, configurable, and fully modifiable. HERO includes a complete software stack that consists of a heterogeneous cross-compilation toolchain with support for OpenMP accelerator programming, a Linux driver, and runtime libraries for both host and PMCA. HERO is designed to facilitate rapid exploration on all software and hardware layers: run-time behavior can be accurately analyzed by tracing events, and modifications can be validated through fully automated hard ware and software builds and executed tests. We demonstrate the usefulness of HERO by means of case studies from our research

    A configurable vector processor for accelerating speech coding algorithms

    Get PDF
    The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths

    Scheduling Analysis from Architectural Models of Embedded Multi-Processor Systems

    No full text
    International audienceAs embedded systems need more and more computing power, many products require hardware platforms based on multiple processors. In case of real-time constrained systems, the use of scheduling analysis tools is mandatory to validate the design choices, and to better use the processing capacity of the system. To this end, this paper presents the extension of the scheduling analysis tool Cheddar to deal with multi-processor schedul- ing. In a Model Driven Engineering approach, useful infor- mation about the scheduling of the application is extracted from a model expressed with an architectural language called AADL. We also define how the AADL model must be writen to express the standard policies for the multi-processor scheduling

    Comparisons of some large scientific computers

    Get PDF
    In 1975, the National Aeronautics and Space Administration (NASA) began studies to assess the technical and economic feasibility of developing a computer having sustained computational speed of one billion floating point operations per second and a working memory of at least 240 million words. Such a powerful computer would allow computational aerodynamics to play a major role in aeronautical design and advanced fluid dynamics research. Based on favorable results from these studies, NASA proceeded with developmental plans. The computer was named the Numerical Aerodynamic Simulator (NAS). To help insure that the estimated cost, schedule, and technical scope were realistic, a brief study was made of past large scientific computers. Large discrepancies between inception and operation in scope, cost, or schedule were studied so that they could be minimized with NASA's proposed new compter. The main computers studied were the ILLIAC IV, STAR 100, Parallel Element Processor Ensemble (PEPE), and Shuttle Mission Simulator (SMS) computer. Comparison data on memory and speed were also obtained on the IBM 650, 704, 7090, 360-50, 360-67, 360-91, and 370-195; the CDC 6400, 6600, 7600, CYBER 203, and CYBER 205; CRAY 1; and the Advanced Scientific Computer (ASC). A few lessons learned conclude the report

    Architectural Verification of Four-instruction Superscalar Processor for MIPS I Instruction Set

    Get PDF
    The study undertaken in this thesis tries to tackle this inefficiency by having extra register locations other than the architectural registers called pseudo-registers, and a pointer scheme is followed to reference both architectural and pseudo registers. This scheme renames each logical destination register of an incoming instruction, to a pseudo register referenced by pointers called pseudo-pointers. Two separate lists of these pointers are maintained, one for all types of instructions and the other for only unspeculated instructions. When a branch instruction preceding the speculated instruction is evaluated and it is established that the prediction was correct, the machine state is altered by updating the pointer lists instead of moving the data. As the pointes are only 6-bits, the inefficiency is considerably reduced. This processor scheme is implemented using the Verilog hardware description language (HDL). The following study provides architectural details of each component used in the processor, stressing issues involved in the implementation and methods used to overcome these issues. This study also discusses verification methodology, documenting steps involved in compiling a 'c' program and loading it onto the simulated instructions cache and data cache for simulation. Finally, simulation results are presented for a sample 'c' program verifying the design

    Use of Architectural Simulation Tools in Education

    Get PDF
    This paper presents the author’s experience in using architectural simulation tools in the instruction of computer architecture courses. In particular, we develop the notion of incrementally building a programmable, trace–driven “timer ” tool, for use as a learning vehicle. We show how the cycle– by–cycle simulation output of such timers can be used to illustrate performance bottlenecks, and how this and other output statistics can be interpreted to convey key design tuning issues. As part of the overall simulation toolkit, we also use available cache simulators, trace generators and other utilities in illustrating key performance determinants and architectural trade–off issues. Undergraduate or beginning graduate courses in computer architecture, such as those based on the well–known texts by Hennessey and Patterson [1, 2] often use a simple processor, e.g. DLX [1] as a running example to develop and illustrate key machine design concepts. Projects and assignments centered around the example processor, are crafted to enable the student to grasp alternate design an

    On the Determinism of Multi-core Processors

    Get PDF
    Hard real time systems are evolving in order to respond to the increasing demand in complex functionalities while taking advantage of newer hardware. Software development for safety critical systems has to comply with strict requirements that will facilitate the certification process. During this process, each part of the system is evaluated, requiring a certain level of assurance in order to provide confidence in the product. In particular there must be a level of confidence that the system behaves deterministically that may be based on functionality, resources and time. The success of system verification depends greatly on the capacity to determine its exact behavior. Nonetheless, hardware evolved in order to maximize the average computation power throughput with little to no regard to the deterministic aspect. Therefore modern architectural features of processors, like pipelines, cache memories and co-processors, make it hard to verify that all the needed properties are respected. The multi-core is furthermore difficult to analyze as the architecture employs mechanisms that compromise strong spatial and temporal partitioning when using shared resources without rigorous access control like shared caches or shared input/outputs. In this paper we identify and analyze the main sources of nondeterminism of the multi-cores with regard to the timing estimation. Precise determination of the worst case execution time is a challenging task even in single-core architectures. The problems are accentuated in the multi-core context mainly due to the resource sharing that can lead to highly complex interactions or to nondeterminism. Most of the units that generate behaviors that are hard to take into account can be deactivated, but it is not always easy to predict the impact on the performance. Nevertheless some of the features cannot be disabled (such as the out of order execution or some nondeterministic crossbar access policies) which leads to the invalidation of the respective platform for applications with high criticality level. We will address the problematic units, propose configuration or architecture guidelines and estimate their impact on the performance and determinism of the system
    • …
    corecore