112 research outputs found
ACOTES project: Advanced compiler technologies for embedded streaming
Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version
Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002
The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design
Connecting Palladio with multicore CPU simulators
In Software Engineering simulators are typically used for Software Performance En- gineering (SPE). It is important that the simulations are accurate in order to allow engineers to predict the performance in detail.
Palladio is one of these approaches. Currently, Palladio only supports single-core CPU simulators, but there is also an auxiliary approach for multicore simulation. The main problem of this approach is the huge inaccuracy, which is about 74% with 16 cores. This bachelor thesis aims to investigate and improve Palladio’s performance in hardware CPU simulation and performance prediction.
This work presents a new approach for connecting a multicore CPU simulator to Palladio to improve the simulation accuracy. The result of this thesis is a conceptual implemen- tation of an embedded multicore CPU Simulator in Palladio to enable more accurate multicore performance predictions.
The presented approach enables Palladio to connect to a multicore simulator called MaxSim via a Java prototype, but the predictions aren’t more accurate in general. With a mean speedup deviation of 67.81% at 16 cores, the simulation is only slightly more accurate for the tested system.Softwareingenieure verwenden in der Regel Simulatoren für das Software Performance Engineering (SPE). Die Simulationsergebnisse müssen dabei genau sein, damit die Ingenieure die Leistung detailliert vorhersagen können.
Palladio ist eines der Tools, welches für SPE eingesetzt wird. Aktuell unterstützt Palladio nur Single-Core CPU-Simulatoren, allerdings existiert auch ein BehilfsAnsatz für die Multicore-Simulation. Das Problem des Ansatzes ist die enorme Ungenauigkeit, welche bei 16 Cores rund 74, 48% beträgt. Diese Bachelorarbeit zielt darauf ab, die Leistungsfähigkeit von Palladio bei der Abbildung komplexer Architekturen auf Hardwaremodelle und die Genauigkeit der Leistungsprognosen zu untersuchen.
In dieser Arbeit wird ein neuer Ansatz zur Anbindung eines Multicore-CPU-Simulators an eine bestehende Palladio-Komponente vorgestellt, um die Simulations-Genauigkeit für Multicore Leistungs-Prognosen zu verbessern.
Der Vorgestellte Ansatz konnte mittels MaxSim und ProtoCom umgesetzt werden, jedoch sind die Vorhersagen im Allgemeinen nicht genauer. Die Leistungsprognose ist mit einer mittleren Abweichung der Beschleunigung von −67, 81% bei 16 Cores, für den getesteten Fall lediglich unwesentlich geringer
Prefetching techniques for client server object-oriented database systems
The performance of many object-oriented database applications suffers from the page fetch latency which is determined by the expense of disk access. In this work we suggest several prefetching techniques to avoid, or at least to reduce, page fetch latency. In practice no prediction technique is perfect and no prefetching technique can entirely eliminate delay due to page fetch latency. Therefore we are interested in the trade-off between the level of accuracy required for obtaining good results in terms of elapsed time reduction and the processing overhead needed to achieve this level of accuracy. If prefetching accuracy is high then the total elapsed time of an application can be reduced significantly otherwise if the prefetching accuracy is low, many incorrect pages are prefetched and the extra load on the client, network, server and disks decreases the whole system performance. Access pattern of object-oriented databases are often complex and usually hard to predict accurately. The ..
Multi-core architectures with coarse-grained dynamically reconfigurable processors for broadband wireless access technologies
Broadband Wireless Access technologies have significant market potential, especially the
WiMAX protocol which can deliver data rates of tens of Mbps. Strong demand for high
performance WiMAX solutions is forcing designers to seek help from multi-core processors
that offer competitive advantages in terms of all performance metrics, such as speed, power
and area. Through the provision of a degree of flexibility similar to that of a DSP and
performance and power consumption advantages approaching that of an ASIC,
coarse-grained dynamically reconfigurable processors are proving to be strong candidates
for processing cores used in future high performance multi-core processor systems.
This thesis investigates multi-core architectures with a newly emerging dynamically
reconfigurable processor – RICA, targeting WiMAX physical layer applications. A novel
master-slave multi-core architecture is proposed, using RICA processing cores. A SystemC
based simulator, called MRPSIM, is devised to model this multi-core architecture. This
simulator provides fast simulation speed and timing accuracy, offers flexible architectural
options to configure the multi-core architecture, and enables the analysis and investigation
of multi-core architectures. Meanwhile a profiling-driven mapping methodology is
developed to partition the WiMAX application into multiple tasks as well as schedule and
map these tasks onto the multi-core architecture, aiming to reduce the overall system
execution time. Both the MRPSIM simulator and the mapping methodology are seamlessly
integrated with the existing RICA tool flow.
Based on the proposed master-slave multi-core architecture, a series of diverse
homogeneous and heterogeneous multi-core solutions are designed for different fixed
WiMAX physical layer profiles. Implemented in ANSI C and executed on the MRPSIM
simulator, these multi-core solutions contain different numbers of cores, combine various memory architectures and task partitioning schemes, and deliver high throughputs at
relatively low area costs. Meanwhile a design space exploration methodology is developed
to search the design space for multi-core systems to find suitable solutions under certain
system constraints. Finally, laying a foundation for future multithreading exploration on the
proposed multi-core architecture, this thesis investigates the porting of a real-time operating
system – Micro C/OS-II to a single RICA processor. A multitasking version of WiMAX is
implemented on a single RICA processor with the operating system support
Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision Floating Point Support in 22 nm FD-SOI
In this paper, we present Ara, a 64-bit vector processor based on the version
0.5 draft of RISC-V's vector extension, implemented in GlobalFoundries 22FDX
FD-SOI technology. Ara's microarchitecture is scalable, as it is composed of a
set of identical lanes, each containing part of the processor's vector register
file and functional units. It achieves up to 97% FPU utilization when running a
256 x 256 double precision matrix multiplication on sixteen lanes. Ara runs at
more than 1 GHz in the typical corner (TT/0.80V/25 oC) achieving a performance
up to 33 DP-GFLOPS. In terms of energy efficiency, Ara achieves up to 41
DP-GFLOPS/W under the same conditions, which is slightly superior to similar
vector processors found in literature. An analysis on several vectorizable
linear algebra computation kernels for a range of different matrix and vector
sizes gives insight into performance limitations and bottlenecks for vector
processors and outlines directions to maintain high energy efficiency even for
small matrix sizes where the vector architecture achieves suboptimal
utilization of the available FPUs.Comment: 13 pages. Accepted for publication in IEEE Transactions on Very Large
Scale Integration System
Programming techniques for efficient and interoperable software defined radios
Recently, Software-Dened Radios (SDRs) has became a hot research topic in wireless communications eld. This is jointly due to the increasing request of reconfigurable and interoperable multi-standard radio systems able to learn from their surrounding
environment and efficiently exploit the available frequency spectrum resources, so realizing the cognitive radio paradigm, and to the availability of reprogrammable hardware architectures providing the computing power necessary to meet the tight
real-time constraints typical of the state-of-art wideband communications standards.
Most SDR implementations are based on mixed architectures in which Field Programmable Gate Arrays (FPGA), Digital Signal Processors (DSP) and General Purpose Processors (GPP) coexist. GPP-based solutions, even if providing the highest
level of flexibility, are typically avoided because of their computational inefficiency
and power consumption.
Starting from these assumptions, this thesis tries to jointly face two of the main important issues in GPP-based SDR systems: the computational efficiency and the interoperability capacity. In the first part, this thesis presents the potential of a novel programming technique, named Memory Acceleration (MA), in which the memory resources typical of GPP-based systems are used to assist central processor in executing real-time signal processing operations. This technique, belonging to the classical computer-science optimization techniques known as Space-Time trade-offs, defines
novel algorithmic methods to assist developers in designing their software-defined signal processing algorithms. In order to show its applicability some "real-world" case studies are presented together with the acceleration factor obtained. In the second part of the thesis, the interoperability issue in SDR systems is also considered. Existing software architectures, like the Software Communications Architecture
(SCA), abstract the hardware/software components of a radio communications chain using a middleware like CORBA for providing full portability and interoperability to the implemented chain, called waveform in the SCA parlance. This feature is
paid in terms of computational overhead introduced by the software communications middleware and this is one of the reasons why GPP-based architecture are generally discarded also for the implementation of narrow-band SCA-compliant communications standards. In this thesis we briefly analyse SCA architecture and an
open-source SCA-compliant framework, ie. OSSIE, and provide guidelines to enable component-based multithreading programming and CPU affinity in that framework.
We also detail the implementation of a real-time SCA-compliant waveform developed inside this modified framework, i.e. the VHF analogue aeronautical communications transceiver. Finally, we provide the proof of how it is possible to implement an efficient and interoperable real-time wideband SCA-compliant waveform, i.e. the AeroMACS
waveform, on a GPP-based architecture by merging the acceleration factor provided by MA technique and the interoperability feature ensured by SCA architecture
Large-scale Wireless Local-area Network Measurement and Privacy Analysis
The edge of the Internet is increasingly becoming wireless. Understanding the wireless edge is therefore important for understanding the performance and security aspects of the Internet experience. This need is especially necessary for enterprise-wide wireless local-area networks (WLANs) as organizations increasingly depend on WLANs for mission- critical tasks. To study a live production WLAN, especially a large-scale network, is a difficult undertaking. Two fundamental difficulties involved are (1) building a scalable network measurement infrastructure to collect traces from a large-scale production WLAN, and (2) preserving user privacy while sharing these collected traces to the network research community. In this dissertation, we present our experience in designing and implementing one of the largest distributed WLAN measurement systems in the United States, the Dartmouth Internet Security Testbed (DIST), with a particular focus on our solutions to the challenges of efficiency, scalability, and security. We also present an extensive evaluation of the DIST system. To understand the severity of some potential trace-sharing risks for an enterprise-wide large-scale wireless network, we conduct privacy analysis on one kind of wireless network traces, a user-association log, collected from a large-scale WLAN. We introduce a machine-learning based approach that can extract and quantify sensitive information from a user-association log, even though it is sanitized. Finally, we present a case study that evaluates the tradeoff between utility and privacy on WLAN trace sanitization
- …