Search CORE

112 research outputs found

ACOTES project: Advanced compiler technologies for embedded streaming

Author: Albert Cohen
Alex Ramírez
Andrea Ornstein
Antoniu Pop
Ayal Zaks
Cupertino Miranda
Cédric Bastoul
David Ródenas
Dorit Nuzman
E. Blossom
E.A. Lee
Eduard Ayguadé
Erven Rohou
Harm Munk
Ira Rosen
J. Hoogerbrugge
Konrad Trifunović
Louis-Noël Pouchet
M. Gschwind
M. Wolfe
Marc Duranton
Marco Cornero
Menno Lindwer
Mohammed Fellahi
Paul Carpenter
Philippe Dumont
R. Allen
R.G. Scarborough
Razya Ladelsky
Roger Ferrer
S. Campanoni
Sebastian Pop
Uzi Shvadron
Xavier Martorell
Zbigniew Chamski
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version

HAL-CentraleSupelec

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

UPCommons. Portal del coneixement obert de la UPC

INRIA a CCSD electronic archive server

HAL-MINES ParisTech

The University of Manchester - Institutional Repository

HAL-Rennes 1

Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002

Author: Wolf Tilman
Publication venue: Washington University Open Scholarship
Publication date: 03/05/2002
Field of study

The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design

Washington University St. Louis: Open Scholarship

Connecting Palladio with multicore CPU simulators

Author: Graef Sebastian
Publication venue
Publication date: 01/01/2018
Field of study

In Software Engineering simulators are typically used for Software Performance En- gineering (SPE). It is important that the simulations are accurate in order to allow engineers to predict the performance in detail. Palladio is one of these approaches. Currently, Palladio only supports single-core CPU simulators, but there is also an auxiliary approach for multicore simulation. The main problem of this approach is the huge inaccuracy, which is about 74% with 16 cores. This bachelor thesis aims to investigate and improve Palladio’s performance in hardware CPU simulation and performance prediction. This work presents a new approach for connecting a multicore CPU simulator to Palladio to improve the simulation accuracy. The result of this thesis is a conceptual implemen- tation of an embedded multicore CPU Simulator in Palladio to enable more accurate multicore performance predictions. The presented approach enables Palladio to connect to a multicore simulator called MaxSim via a Java prototype, but the predictions aren’t more accurate in general. With a mean speedup deviation of 67.81% at 16 cores, the simulation is only slightly more accurate for the tested system.Softwareingenieure verwenden in der Regel Simulatoren für das Software Performance Engineering (SPE). Die Simulationsergebnisse müssen dabei genau sein, damit die Ingenieure die Leistung detailliert vorhersagen können. Palladio ist eines der Tools, welches für SPE eingesetzt wird. Aktuell unterstützt Palladio nur Single-Core CPU-Simulatoren, allerdings existiert auch ein BehilfsAnsatz für die Multicore-Simulation. Das Problem des Ansatzes ist die enorme Ungenauigkeit, welche bei 16 Cores rund 74, 48% beträgt. Diese Bachelorarbeit zielt darauf ab, die Leistungsfähigkeit von Palladio bei der Abbildung komplexer Architekturen auf Hardwaremodelle und die Genauigkeit der Leistungsprognosen zu untersuchen. In dieser Arbeit wird ein neuer Ansatz zur Anbindung eines Multicore-CPU-Simulators an eine bestehende Palladio-Komponente vorgestellt, um die Simulations-Genauigkeit für Multicore Leistungs-Prognosen zu verbessern. Der Vorgestellte Ansatz konnte mittels MaxSim und ProtoCom umgesetzt werden, jedoch sind die Vorhersagen im Allgemeinen nicht genauer. Die Leistungsprognose ist mit einer mittleren Abweichung der Beschleunigung von −67, 81% bei 16 Cores, für den getesteten Fall lediglich unwesentlich geringer

Prefetching techniques for client server object-oriented database systems

Author: Knafla Nils
Publication venue: The University of Edinburgh
Publication date: 01/01/1999
Field of study

The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..

CiteSeerX

Edinburgh Research Archive

Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies

Author: Han Wei
Publication venue: The University of Edinburgh
Publication date: 01/01/2010
Field of study

Broadband Wireless Access technologies have significant market potential, especially the WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high performance WiMAX solutions is forcing designers to seek help from multi-core processors that offer competitive advantages in terms of all performance metrics, such as speed, power and area. Through the provision of a degree of flexibility similar to that of a DSP and performance and power consumption advantages approaching that of an ASIC, coarse-grained dynamically reconfigurable processors are proving to be strong candidates for processing cores used in future high performance multi-core processor systems. This thesis investigates multi-core architectures with a newly emerging dynamically reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC based simulator, called MRPSIM, is devised to model this multi-core architecture. This simulator provides fast simulation speed and timing accuracy, offers flexible architectural options to configure the multi-core architecture, and enables the analysis and investigation of multi-core architectures. Meanwhile a profiling-driven mapping methodology is developed to partition the WiMAX application into multiple tasks as well as schedule and map these tasks onto the multi-core architecture, aiming to reduce the overall system execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly integrated with the existing RICA tool flow. Based on the proposed master-slave multi-core architecture, a series of diverse homogeneous and heterogeneous multi-core solutions are designed for different fixed WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at relatively low area costs. Meanwhile a design space exploration methodology is developed to search the design space for multi-core systems to find suitable solutions under certain system constraints. Finally, laying a foundation for future multithreading exploration on the proposed multi-core architecture, this thesis investigates the porting of a real-time operating system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is implemented on a single RICA processor with the operating system support

Edinburgh Research Archive

Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI

Author: Benini Luca
Cavalcante Matheus
Schaffner Michael
Schuiki Fabian
Zaruba Florian
Publication venue
Publication date: 27/10/2019
Field of study

In this paper, we present Ara, a 64-bit vector processor based on the version 0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a set of identical lanes, each containing part of the processor's vector register file and functional units. It achieves up to 97% FPU utilization when running a 256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41 DP-GFLOPS/W under the same conditions, which is slightly superior to similar vector processors found in literature. An analysis on several vectorizable linear algebra computation kernels for a range of different matrix and vector sizes gives insight into performance limitations and bottlenecks for vector processors and outlines directions to maintain high energy efficiency even for small matrix sizes where the vector architecture achieves suboptimal utilization of the available FPUs.Comment: 13 pages. Accepted for publication in IEEE Transactions on Very Large Scale Integration System

arXiv.org e-Print Archive

Code Generation and Global Optimization Techniques for a Reconfigurable PRAM-NUMA Multicore Architecture

Author
Publication venue: 'Linkoping University Electronic Press'
Publication date
Field of study

Crossref

Programming techniques for efficient and interoperable software defined radios

Author: DI DIO MARIO
Publication venue: 'Pisa University Press'
Publication date: 29/04/2013
Field of study

Recently, Software-Dened Radios (SDRs) has became a hot research topic in wireless communications eld. This is jointly due to the increasing request of reconfigurable and interoperable multi-standard radio systems able to learn from their surrounding environment and efficiently exploit the available frequency spectrum resources, so realizing the cognitive radio paradigm, and to the availability of reprogrammable hardware architectures providing the computing power necessary to meet the tight real-time constraints typical of the state-of-art wideband communications standards. Most SDR implementations are based on mixed architectures in which Field Programmable Gate Arrays (FPGA), Digital Signal Processors (DSP) and General Purpose Processors (GPP) coexist. GPP-based solutions, even if providing the highest level of flexibility, are typically avoided because of their computational inefficiency and power consumption. Starting from these assumptions, this thesis tries to jointly face two of the main important issues in GPP-based SDR systems: the computational efficiency and the interoperability capacity. In the first part, this thesis presents the potential of a novel programming technique, named Memory Acceleration (MA), in which the memory resources typical of GPP-based systems are used to assist central processor in executing real-time signal processing operations. This technique, belonging to the classical computer-science optimization techniques known as Space-Time trade-offs, defines novel algorithmic methods to assist developers in designing their software-defined signal processing algorithms. In order to show its applicability some "real-world" case studies are presented together with the acceleration factor obtained. In the second part of the thesis, the interoperability issue in SDR systems is also considered. Existing software architectures, like the Software Communications Architecture (SCA), abstract the hardware/software components of a radio communications chain using a middleware like CORBA for providing full portability and interoperability to the implemented chain, called waveform in the SCA parlance. This feature is paid in terms of computational overhead introduced by the software communications middleware and this is one of the reasons why GPP-based architecture are generally discarded also for the implementation of narrow-band SCA-compliant communications standards. In this thesis we briefly analyse SCA architecture and an open-source SCA-compliant framework, ie. OSSIE, and provide guidelines to enable component-based multithreading programming and CPU affinity in that framework. We also detail the implementation of a real-time SCA-compliant waveform developed inside this modified framework, i.e. the VHF analogue aeronautical communications transceiver. Finally, we provide the proof of how it is possible to implement an efficient and interoperable real-time wideband SCA-compliant waveform, i.e. the AeroMACS waveform, on a GPP-based architecture by merging the acceleration factor provided by MA technique and the interoperability feature ensured by SCA architecture

Electronic Thesis and Dissertation Archive - Università di Pisa

Understanding the Performance of Managed Runtime Environments

Author: Hartley Timothy
Publication venue
Publication date: 31/12/2022
Field of study

The University of Manchester - Institutional Repository

Large-scale Wireless Local-area Network Measurement and Privacy Analysis

Author: Tan Keren
Publication venue: Dartmouth Digital Commons
Publication date: 01/08/2011
Field of study

The edge of the Internet is increasingly becoming wireless. Understanding the wireless edge is therefore important for understanding the performance and security aspects of the Internet experience. This need is especially necessary for enterprise-wide wireless local-area networks (WLANs) as organizations increasingly depend on WLANs for mission- critical tasks. To study a live production WLAN, especially a large-scale network, is a difficult undertaking. Two fundamental difficulties involved are (1) building a scalable network measurement infrastructure to collect traces from a large-scale production WLAN, and (2) preserving user privacy while sharing these collected traces to the network research community. In this dissertation, we present our experience in designing and implementing one of the largest distributed WLAN measurement systems in the United States, the Dartmouth Internet Security Testbed (DIST), with a particular focus on our solutions to the challenges of efficiency, scalability, and security. We also present an extensive evaluation of the DIST system. To understand the severity of some potential trace-sharing risks for an enterprise-wide large-scale wireless network, we conduct privacy analysis on one kind of wireless network traces, a user-association log, collected from a large-scale WLAN. We introduce a machine-learning based approach that can extract and quantify sensitive information from a user-association log, even though it is sanitized. Finally, we present a case study that evaluates the tradeoff between utility and privacy on WLAN trace sanitization

Dartmouth Digital Commons (Dartmouth College)