72 research outputs found

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

    Multicore and FPGA implementations of emotional-based agent architectures

    Get PDF
    The final publication is available at Springer via http://dx.doi.org/10.1007/s11227-014-1307-6.Control architectures based on Emotions are becoming promising solutions for the implementation of future robotic agents. The basic controllers of the architecture are the emotional processes that decide which behaviors of the robot must activate to fulfill the objectives. The number of emotional processes increases (hundreds of millions/s) with the complexity level of the application, reducing the processing capacity of the main processor to solve complex problems (millions of decisions in a given instant). However, the potential parallelism of the emotional processes permits their execution in parallel on FPGAs or Multicores, thus enabling slack computing in the main processor to tackle more complex dynamic problems. In this paper, an emotional architecture for mobile robotic agents is presented. The workload of the emotional processes is evaluated. Then, the main processor is extended with FPGA co-processors through Ethernet link. The FPGAs will be in charge of the execution of the emotional processes in parallel. Different Stratix FPGAs are compared to analyze their suitability to cope with the proposed mobile robotic agent applications. The applications are set up taking into account different environmental conditions, robot dynamics and emotional states. Moreover, the applications are run also on Multicore processors to compare their performance in relation to the FPGAs. Experimental results show that Stratix IV FPGA increases the performance in about one order of magnitude over the main processor and solves all the considered problems. Quad-Core increases the performance in 3.64 times, allowing to tackle about 89 % of the considered problems. Quad-Core has a lower cost than a Stratix IV, so more adequate solution but not for the most complex application. Stratix III could be applied to solve problems with around the double of the requirements that the main processor could support. Finally, a Dual-Core provides slightly better performance than stratix III and it is relatively cheaper.This work was supported in part under Spanish Grant PAID/2012/325 of "Programa de Apoyo a la Investigacion y Desarrollo. Proyectos multidisciplinares", Universitat Politecnica de Valencia, Spain.Domínguez Montagud, CP.; Hassan Mohamed, H.; Crespo, A.; Albaladejo Meroño, J. (2015). Multicore and FPGA implementations of emotional-based agent architectures. Journal of Supercomputing. 71(2):479-507. https://doi.org/10.1007/s11227-014-1307-6S479507712Malfaz M, Salichs MA (2010) Using MUDs as an experimental platform for testing a decision making system for self-motivated autonomous agents. Artif Intell Simul Behav J 2(1):21–44Damiano L, Cañamero L (2010) Constructing emotions. Epistemological groundings and applications in robotics for a synthetic approach to emotions. In: Proceedings of international symposium on aI-inspired biology, The Society for the Study of Artificial Intelligence, pp 20–28Hawes N, Wyatt J, Sloman A (2009) Exploring design space for an integrated intelligent system. Knowl Based Syst 22(7):509–515Sloman A (2009) Some requirements for human-like robots: why the recent over-emphasis on embodiment has held up progress. Creat Brain Like Intell 2009:248–277Arkin RC, Ulam P, Wagner AR (2012) Moral decision-making in autonomous systems: enforcement, moral emotions, dignity, trust and deception. In: Proceedings of the IEEE, Mar 2012, vol 100, no 3, pp 571–589iRobot industrial robots website. http://www.irobot.com/gi/ground/ . Accessed 22 Sept 2014Moravec H (2009) Rise of the robots: the future of artificial intelligence. Scientific American, March 2009. http://www.scientificamerican.com/article/rise-of-the-robots/ . Accessed 14 Oct 2014.Thu Bui L, Abbass HA, Barlow M, Bender A (2012) Robustness against the decision-maker’s attitude to risk in problems with conflicting objectives. IEEE Trans Evolut Comput 16(1):1–19Pedrycz W, Song M (2011) Analytic hierarchy process (AHP) in group decision making and its optimization with an allocation of information granularity. IEEE Trans Fuzzy Syst 19(3):527–539Lee-Johnson CP, Carnegie DA (2010) Mobile robot navigation modulated by artificial emotions. IEEE Trans Syst Man Cybern Part B 40(2):469–480Daglarli E, Temeltas H, Yesiloglu M (2009) Behavioral task processing for cognitive robots using artificial emotions. Neurocomputing 72(13):2835–2844Ventura R, Pinto-Ferreira C (2009) Responding efficiently to relevant stimuli using an emotion-based agent architecture. Neurocomputing 72(13):2923–2930Arkin RC, Ulam P, Wagner AR (2012) Moral decision-making in autonomous systems: enforcement, moral emotions, dignity, trust and deception. Proc IEEE 100(3):571–589Salichs MA, Malfaz M (2012) A new approach to modeling emotions and their use on a decision-making system for artificial agents. Affect Comput IEEE Trans 3(1):56–68Altera Corporation (2011) Stratix III device handbook, vol 1–2, version 2.2. http://www.altera.com/literature/lit-stx3.jsp . Accessed 14 Oct 2014.Altera Corporation (2014) Stratix IV device handbook, vol 1–4, version 5.9. http://www.altera.com/literature/lit-stratix-iv.jsp . Accessed 14 Oct 2014.Naouar MW, Monmasson E, Naassani AA, Slama-Belkhodja I, Patin N (2007) FPGA-based current controllers for AC machine drives: a review. IEEE Trans Ind Electr 54(4):1907–1925Intel Corporation (2014) Desktop 4th generation Intel Core Processor Family, Desktop Intel Pentium Processor Family, and Desktop Intel Celeron Processor Family, Datasheet, vol 1, 2March JL, Sahuquillo J, Hassan H, Petit S, Duato J (2011) A new energy-aware dynamic task set partitioning algorithm for soft and hard embedded real-time systems. Comput J 54(8):1282–1294Del Campo I, Basterretxea K, Echanobe J, Bosque G, Doctor F (2012) A system-on-chip development of a neuro-fuzzy embedded agent for ambient-intelligence environments. IEEE Trans Syst Man Cybern Part B 42(2):501–512Pedraza C, Castillo J, Martínez JI, Huerta P, Bosque JL, Cano J (2011) Genetic algorithm for Boolean minimization in an FPGA cluster. J Supercomput 58(2):244–252Orlowska-Kowalska T, Kaminski M (2011) FPGA implementation of the multilayer neural network for the speed estimation of the two-mass drive system. IEEE Trans Ind Inf 7(3):436–445Cassidy AS, Merolla P, Arthur JV, Esser SK, Jackson B, Alvarez-icaza R, Datta P, Sawada J, Wong TM, Feldman V, Amir A, Ben-dayan D, Mcquinn E, Risk WP, Modha DS (2013) Cognitive computing building block: a versatile and efficient digital neuron model for neurosynaptic cores. In: Proceedings of international joint conference on neural networks, IEEE (IJCNN’2013)IBM Cognitive Computing and Neurosynaptic chips website. http://www.research.ibm.com/cognitive-computing/neurosynaptic-chips.shtml . Accessed 22 Sept 2014Seo E, Jeong J, Park S, Lee J (2008) Energy efficient scheduling of real-time tasks on multicore processors. IEEE Trans Parallel Distrib Syst 19(11):1540–1552Lehoczky J, Sha L, Ding Y (1989) The rate monotonic scheduling algorithm: exact characterization and average case behavior. In: Proceedings of real time systems symposium, IEEE 1989, pp 166–171Ng-Thow-Hing V, Lim J, Wormer J, Sarvadevabhatla RK, Rocha C, Fujimura K, Sakagami Y (2008) The memory game: creating a human-robot interactive scenario for ASIMO. In: Proceedings of intelligent robots and systems, 2008, IROS 2008, IEEE/RSJ international conference, pp 779–78

    Superscalar RISC-V Processor with SIMD Vector Extension

    Get PDF
    With the increasing number of digital products in the market, the need for robust and highly configurable processors rises. The demand is convened by the stable and extensible open-sourced RISC-V instruction set architecture. RISC-V processors are becoming popular in many fields of applications and research. This thesis presents a dual-issue superscalar RISC-V processor design with dynamic execution. The proposed design employs the global sharing scheme for branch prediction and Tomasulo algorithm for out-of-order execution. The processor is capable of speculative execution with five checkpoints. Data flow in the instruction dispatch and commit stages is optimized to achieve higher instruction throughput. The superscalar processor is extended with a customized vector instruction set of single-instruction-multiple-data computations to specifically improve the performance on machine learning tasks. According to the definition of the proposed vector instruction set, the scratchpad memory and element-wise arithmetic units are implemented in the vector co-processor. Different test programs are evaluated on the fully-tested superscalar processor. Compared to the reference work, the proposed design improves 18.9% on average instruction throughput and 4.92% on average prediction hit rate, with 16.9% higher operating clock frequency synthesized on the Intel Arria 10 FPGA board. The forward propagation of a convolution neural network model is evaluated by the standalone superscalar processor and the integration of the vector co-processor. The vector program with software-level optimizations achieves 9.53Ă— improvement on instruction throughput and 10.18Ă— improvement on real-time throughput. Moreover, the integration also provides 2.22Ă— energy efficiency compared with the superscalar processor along

    Driving the Network-on-Chip Revolution to Remove the Interconnect Bottleneck in Nanoscale Multi-Processor Systems-on-Chip

    Get PDF
    The sustained demand for faster, more powerful chips has been met by the availability of chip manufacturing processes allowing for the integration of increasing numbers of computation units onto a single die. The resulting outcome, especially in the embedded domain, has often been called SYSTEM-ON-CHIP (SoC) or MULTI-PROCESSOR SYSTEM-ON-CHIP (MP-SoC). MPSoC design brings to the foreground a large number of challenges, one of the most prominent of which is the design of the chip interconnection. With a number of on-chip blocks presently ranging in the tens, and quickly approaching the hundreds, the novel issue of how to best provide on-chip communication resources is clearly felt. NETWORKS-ON-CHIPS (NoCs) are the most comprehensive and scalable answer to this design concern. By bringing large-scale networking concepts to the on-chip domain, they guarantee a structured answer to present and future communication requirements. The point-to-point connection and packet switching paradigms they involve are also of great help in minimizing wiring overhead and physical routing issues. However, as with any technology of recent inception, NoC design is still an evolving discipline. Several main areas of interest require deep investigation for NoCs to become viable solutions: • The design of the NoC architecture needs to strike the best tradeoff among performance, features and the tight area and power constraints of the onchip domain. • Simulation and verification infrastructure must be put in place to explore, validate and optimize the NoC performance. • NoCs offer a huge design space, thanks to their extreme customizability in terms of topology and architectural parameters. Design tools are needed to prune this space and pick the best solutions. • Even more so given their global, distributed nature, it is essential to evaluate the physical implementation of NoCs to evaluate their suitability for next-generation designs and their area and power costs. This dissertation performs a design space exploration of network-on-chip architectures, in order to point-out the trade-offs associated with the design of each individual network building blocks and with the design of network topology overall. The design space exploration is preceded by a comparative analysis of state-of-the-art interconnect fabrics with themselves and with early networkon- chip prototypes. The ultimate objective is to point out the key advantages that NoC realizations provide with respect to state-of-the-art communication infrastructures and to point out the challenges that lie ahead in order to make this new interconnect technology come true. Among these latter, technologyrelated challenges are emerging that call for dedicated design techniques at all levels of the design hierarchy. In particular, leakage power dissipation, containment of process variations and of their effects. The achievement of the above objectives was enabled by means of a NoC simulation environment for cycleaccurate modelling and simulation and by means of a back-end facility for the study of NoC physical implementation effects. Overall, all the results provided by this work have been validated on actual silicon layout

    Defining interfaces between hardware and software: Quality and performance

    Get PDF
    One of the most important interfaces in a computer system is the interface between hardware and software. This interface is the contract between the hardware designer and the programmer that defines the functional behaviour of the hardware. This thesis examines two critical aspects of defining the hardware-software interface: quality and performance. The first aspect is creating a high quality specification of the interface as conventionally defined in an instruction set architecture. The majority of this thesis is concerned with creating a specification that covers the full scope of the interface; that is applicable to all current implementations of the architecture; and that can be trusted to accurately describe the behaviour of implementations of the architecture. We describe the development of a formal specification of the two major types of Arm processors: A-class (for mobile devices such as phones and tablets) and M-class (for micro-controllers). These specifications are unparalleled in their scope, applicability and trustworthiness. This thesis identifies and illustrates what we consider the key ingredient in achieving this goal: creating a specification that is used by many different user groups. Supporting many different groups leads to improved quality as each group finds different problems in the specification; and, by providing value to each different group, it helps justify the considerable effort required to create a high quality specification of a major processor architecture. The work described in this thesis led to a step change in Arm's ability to use formal verification techniques to detect errors in their processors; enabled extensive testing of the specification against Arm's official architecture conformance suite; improved the quality of Arm's architecture conformance suite based on measuring the architectural coverage of the tests; supported earlier, faster development of architecture extensions by enabling animation of changes as they are being made; and enabled early detection of problems created from architecture extensions by performing formal validation of the specification against semi-structured natural language specifications. As far as we are aware, no other mainstream processor architecture has this capability. The formal specifications are included in Arm's publicly released architecture reference manuals and the A-class specification is also released in machine-readable form. The second aspect is creating a high performance interface by defining the hardware-software interface of a software-defined radio subsystem using a programming language. That is, an interface that allows software to exploit the potential performance of the underlying hardware. While the hardware-software interface is normally defined in terms of machine code, peripheral control registers and memory maps, we define it using a programming language instead. This higher level interface provides the opportunity for compilers to hide some of the low-level differences between different systems from the programmer: a potentially very efficient way of providing a stable, portable interface without having to add hardware to provide portability between different hardware platforms. We describe the design and implementation of a set of extensions to the C programming language to support programming high performance, energy efficient, software defined radio systems. The language extensions enable the programmer to exploit the pipeline parallelism typically present in digital signal processing applications and to make efficient use of the asymmetric multiprocessor systems designed to support such applications. The extensions consist primarily of annotations that can be checked for consistency and that support annotation inference in order to reduce the number of annotations required. Reducing the number of annotations does not just save programmer effort, it also improves portability by reducing the number of annotations that need to be changed when porting an application from one platform to another. This work formed part of a project that developed a high-performance, energy-efficient, software defined radio capable of implementing the physical layers of the 4G cellphone standard (LTE), 802.11a WiFi and Digital Video Broadcast (DVB) with a power and silicon area budget that was competitive with a conventional custom ASIC solution. The Arm architecture is the largest computer architecture by volume in the world. It behooves us to ensure that the interface it describes is appropriately defined

    A configurable vector processor for accelerating speech coding algorithms

    Get PDF
    The growing demand for voice-over-packer (VoIP) services and multimedia-rich applications has made increasingly important the efficient, real-time implementation of low-bit rates speech coders on embedded VLSI platforms. Such speech coders are designed to substantially reduce the bandwidth requirements thus enabling dense multichannel gateways in small form factor. This however comes at a high computational cost which mandates the use of very high performance embedded processors. This thesis investigates the potential acceleration of two major ITU-T speech coding algorithms, namely G.729A and G.723.1, through their efficient implementation on a configurable extensible vector embedded CPU architecture. New scalar and vector ISAs were introduced which resulted in up to 80% reduction in the dynamic instruction count of both workloads. These instructions were subsequently encapsulated into a parametric, hybrid SISD (scalar processor)–SIMD (vector) processor. This work presents the research and implementation of the vector datapath of this vector coprocessor which is tightly-coupled to a Sparc-V8 compliant CPU, the optimization and simulation methodologies employed and the use of Electronic System Level (ESL) techniques to rapidly design SIMD datapaths
    • …
    corecore