17 research outputs found
Recommended from our members
Securing Network Processors with Hardware Monitors
As an essential part of modern society, the Internet has fundamentally changed our lives during the last decade. Novel applications and technologies, such as online shopping, social networking, cloud computing, mobile networking, etc, have sprung up at an astonishing pace. These technologies not only influence modern life styles but also impact Internet infrastructure. Numerous new network applications and services require better programmability and flexibility for network devices, such as routers and switches. Since traditional fixed function network routers based on application specific integrated circuits (ASICs) have difficulty keeping pace with the growing demands of next-generation Internet applications, there is an ongoing shift in the industry toward implementing network devices using programmable network processors (NPs).
While network processors offer great benefits in terms of flexibility, their reprogrammable nature exposes potential security risks. Similar to network end-systems, such as general-purpose computers, software-based network processors have security vulnerabilities that can be attacked remotely. Recent research has shown that a new type of data plane attack is able to modify the functionality of a network processor and cause a denial-of-service (DoS) attack by sending a single malformed UDP packet. Since this attack relies solely on data plane access and does not need access to the control plane, it can be particularly difficult to control.
Hardware security monitors have been introduced to identify and eliminate these malicious packets before they can propagate and cause devastating effects in the network. However, previous work on hardware monitors only focus on single core systems with static (or very slowly changing) workloads. In network processors that use up to hundreds of parallel processor cores and have processing workloads that can change dynamically based on the network traffic, the realization of a complete multicore hardware monitoring system remains a critical challenge. Our research work in this thesis provides a comprehensive solution to this problem.
Our first contribution is the design and prototype implementation of a Scalable Hardware Monitoring Grid (SHMG). This scalable architecture balances area cost and performance overhead by using a clustered approach for multicore NP systems. In order to adapt to dynamically changing network traffic, a resource reallocation algorithm is designed to reassign the processing resources in SHMG to different network applications at runtime. An evaluation of the prototype SHMG on an Altera DE4 board demonstrates low resource and performance overheads. The functionality and performance of a runtime resource reallocation algorithm are tested using a simulation environment.
A second significant contribution of this work is a network system-level security solution for multicore network processors with hardware monitors. It addresses two key problems: (1) how to securely manage and reprogram processor cores and monitors in a deployed router in the network, and (2) how to prevent the large number of identical router devices in the network from an attack that can circumvent one specific monitoring system and lead to Internet-scale failures. A Secure Dynamic Multicore Hardware Monitoring System (SDMMon) is designed based on cryptographic principles and suitable key management to ensure the secure installation of processor binaries and monitor graphs. We present a Merkle tree based parameterizable high performance hash function that can be configured to perform a variety of functions in different devices via a 32-bit configuration parameter. A prototype system composed of both the SDMMon and the parameterizable hash is implemented and evaluated on an Altera DE4 board.
Finally, a fully-functional, comprehensive Multicore NP Security Platform, which integrates both the SHMG and the SDMMon security features, has been implemented on an Altera DE5 board
Software-based and regionally-oriented traffic management in Networks-on-Chip
Since the introduction of chip-multiprocessor systems, the number of integrated cores has been steady growing and workload applications have been adapted to exploit the increasing parallelism. This changed the importance of efficient on-chip communication significantly and the infrastructure has to keep step with these new requirements.
The work at hand makes significant contributions to the state-of-the-art of the latest generation of such solutions, called Networks-on-Chip, to improve the performance, reliability, and flexible management of these on-chip infrastructures
Methoden und Beschreibungssprachen zur Modellierung und Verifikation vonSchaltungen und Systemen: MBMV 2015 - Tagungsband, Chemnitz, 03. - 04. März 2015
Der Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV 2015) findet nun schon zum 18. mal statt. Ausrichter sind in diesem Jahr die Professur Schaltkreis- und Systementwurf der Technischen Universität Chemnitz und das Steinbeis-Forschungszentrum Systementwurf und Test.
Der Workshop hat es sich zum Ziel gesetzt, neueste Trends, Ergebnisse und aktuelle Probleme auf dem Gebiet der Methoden zur Modellierung und Verifikation sowie der Beschreibungssprachen digitaler, analoger und Mixed-Signal-Schaltungen zu diskutieren. Er soll somit ein Forum zum Ideenaustausch sein.
Weiterhin bietet der Workshop eine Plattform für den Austausch zwischen Forschung und Industrie sowie zur Pflege bestehender und zur Knüpfung neuer Kontakte. Jungen Wissenschaftlern erlaubt er, ihre Ideen und Ansätze einem breiten Publikum aus Wissenschaft und Wirtschaft zu präsentieren und im Rahmen der Veranstaltung auch fundiert zu diskutieren. Sein langjähriges Bestehen hat ihn zu einer festen Größe in vielen Veranstaltungskalendern gemacht. Traditionell sind auch die Treffen der ITGFachgruppen an den Workshop angegliedert.
In diesem Jahr nutzen zwei im Rahmen der InnoProfile-Transfer-Initiative durch das Bundesministerium für Bildung und Forschung geförderte Projekte den Workshop, um in zwei eigenen Tracks ihre Forschungsergebnisse einem breiten Publikum zu präsentieren. Vertreter der Projekte Generische Plattform für Systemzuverlässigkeit und Verifikation (GPZV) und GINKO - Generische Infrastruktur zur nahtlosen energetischen Kopplung von Elektrofahrzeugen stellen Teile ihrer gegenwärtigen Arbeiten vor. Dies bereichert denWorkshop durch zusätzliche Themenschwerpunkte und bietet eine wertvolle Ergänzung zu den Beiträgen der Autoren. [... aus dem Vorwort
Intelligent systems for efficiency and security
As computing becomes ubiquitous and personalized, resources like energy, storage and time are becoming increasingly scarce and, at the same time, computing systems must deliver in multiple dimensions, such as high performance, quality of service, reliability, security and low power. Building such computers is hard, particularly when the operating environment is becoming more dynamic, and systems are becoming heterogeneous and distributed.
Unfortunately, computers today manage resources with many ad hoc heuristics that are suboptimal, unsafe, and cannot be composed across the computer’s subsystems. Continuing this approach has severe consequences: underperforming systems, resource waste, information loss, and even life endangerment.
This dissertation research develops computing systems which, through intelligent adaptation, deliver efficiency along multiple dimensions. The key idea is to manage computers with principled methods from formal control. It is with these methods that the multiple subsystems of a computer sense their environment and configure themselves to meet system-wide goals.
To achieve the goal of intelligent systems, this dissertation makes a series of contributions, each building on the previous. First, it introduces the use of formal MIMO (Multiple Input Multiple Output) control for processors, to simultaneously optimize many goals like performance, power, and temperature. Second, it develops the Yukta control system, which uses coordinated formal controllers in different layers of the stack (hardware and operating system). Third, it uses robust control to develop a fast, globally coordinated and decentralized control framework called Tangram, for heterogeneous computers. Finally, it presents Maya, a defense against power side-channel attacks that uses formal control to reshape the power dissipated by a computer, confusing the attacker. The ideas in the dissertation have been demonstrated successfully with several prototypes, including one built along with AMD (Advanced Micro Devices, Inc.) engineers. These designs significantly outperformed the state of the art.
The research in this dissertation brought formal control closer to computer architecture and has been well-received in both domains. It has the first application of full-fledged MIMO control for processors, the first use of robust control in computer systems, and the first application of formal control for side-channel defense. It makes a significant stride towards intelligent systems that are efficient, secure and reliable
Distributed Smart City Services for Urban Ecosystems
A Smart City is a high-performance urban context, where citizens live independently and are more aware of the surrounding opportunities, thanks to forward-looking development of economy politics, governance, mobility
and environment. ICT infrastructures play a key-role in this new research field being also a mean for society to allow new ideas to prosper and new, more efficient approaches to be developed. The aim of this work is to research and develop novel solutions, here called smart services, in order to solve several upcoming problems and known issues in urban areas and more in general in the modern society context. A specific focus is posed on smart governance and on privacy issues which have been arisen in the cellular age
Power-constrained aware and latency-aware microarchitectural optimizations in many-core processors
As the transistor budgets outpace the power envelope (the power-wall issue), new architectural and microarchitectural techniques are needed to improve, or at least maintain, the power efficiency of next-generation processors. Run-time adaptation, including core, cache and DVFS adaptations, has recently emerged as a promising area to keep the pace for acceptable power efficiency.
However, none of the adaptation techniques proposed so far is able to provide good results when we consider the stringent power budgets that will be common in the next decades, so new techniques that attack the problem from several fronts using different specialized mechanisms are necessary. The combination of different power management mechanisms, however, bring extra levels of complexity, since other factors such as workload behavior and run-time conditions must also be considered to properly allocate power among cores and threads.
To address the power issue, this thesis first proposes Chrysso, an integrated and scalable model-driven power management that quickly selects the best combination of adaptation methods out of different core and uncore micro-architecture adaptations, per-core DVFS, or any combination thereof. Chrysso can quickly search the adaptation space by making performance/power projections to identify Pareto-optimal configurations, effectively pruning the search space. Chrysso achieves 1.9x better chip performance over core-level gating for multi-programmed workloads, and 1.5x higher performance for multi-threaded workloads.
Most existing power management schemes use a centralized approach to regulate power dissipation. Unfortunately, the complexity and overhead of centralized power management increases significantly with core count rendering it in-viable at fine-grain time slices. The work leverages a two-tier hierarchical power manager. This solution is highly scalable with low overhead on a tiled many-core architecture with shared LLC and per-tile DVFS at fine-grain time slices. The global power is first distributed across tiles using GPM and then within a tile (in parallel across all tiles). Additionally, this work also proposes DVFS and cache-aware thread migration (DCTM) to ensure optimum per-tile co-scheduling of compatible threads at runtime over the two-tier hierarchical power manager. DCTM outperforms existing solutions by up to 12% on adaptive many-core tile processor.
With the advancements in the core micro-architectural techniques and technology scaling, the performance gap between the computational component and memory component is increasing significantly (the memory-wall issue). To bridge this gap, the architecture community is pushing forward towards multi-core architecture with on-die near-memory DRAM cache memory (faster than conventional DRAM). Gigascale DRAM Caches poses a problem of how to efficiently manage the tags. The Tags-in-DRAM designs aims at efficiently co-locate tags with data, but it still suffer from high latency especially in multi-way associativity.
The thesis finally proposes Tag Cache mechanism, an on-chip distributed tag caching mechanism with limited space and latency overhead to bypass the tag read operation in multi-way DRAM Caches, thereby reducing hit latency. Each Tag Cache, stored in L2, stores tag information of the most recently used DRAM Cache ways. The Tag Cache is able to exploit temporal locality of the DRAM Cache, thereby contributing to on average 46% of the DRAM Cache hits.A mesura que el consum dels transistors supera el nivell de potència desitjable es necessiten noves tècniques arquitectòniques i microarquitectòniques per millorar, o almenys mantenir, l'eficiència energètica dels processadors de les pròximes generacions. L'adaptació en temps d'execució, tant de nuclis com de les cachés, així com també adaptacions DVFS són idees que han sorgit recentment que fan preveure que sigui un àrea prometedora per mantenir un ritme d'eficiència energètica acceptable. Tanmateix, cap de les tècniques d'adaptació proposades fins ara és capaç d'oferir bons resultats si tenim en compte les restriccions estrictes de potència que seran comuns a les pròximes dècades. És convenient definir noves tècniques que ataquin el problema des de diversos fronts utilitzant diferents mecanismes especialitzats. La combinació de diferents mecanismes de gestió d'energia porta aparellada nivells addicionals de complexitat, ja que altres factors com ara el comportament de la càrrega de treball així com condicions específiques de temps d'execució també han de ser considerats per assignar adequadament la potència entre els nuclis del sistema computador. Per tractar el tema de la potència, aquesta tesi proposa en primer lloc Chrysso, una administració d'energia integrada i escalable que selecciona ràpidament la millor combinació entre diferents adaptacions microarquitectòniques. Chrysso pot buscar ràpidament l'adaptació adequada al fer projeccions òptimes de rendiment i potència basades en configuracions de Pareto, permetent així reduir de manera efectiva l'espai de cerca. Chrysso arriba a un rendiment de 1,9 sobre tècniques convencionals d'inhibició de portes amb una càrrega d'aplicacions seqüencials; i un rendiment de 1,5 quan les aplicacions corresponen a programes parla·lels. La majoria dels sistemes de gestió d'energia existents utilitzen un enfocament centralitzat per regular la dissipació d'energia. Malauradament, la complexitat i el temps d'administració s'incrementen significativament amb una gran quantitat de nuclis. En aquest treball es defineix un gestor jeràrquic de potència basat en dos nivells. Aquesta solució és altament escalable amb baix cost operatiu en una arquitectura de múltiples nuclis integrats en clústers, amb memòria caché de darrer nivell compartida a nivell de cluster, i DVFS establert en intervals de temps de gra fi a nivell de clúster. La potència global es distribueix en primer lloc a través dels clústers utilitzant GPM i després es distribueix dins un clúster (en paral·lel si es consideren tots els clústers). A més, aquest treball també proposa DVFS i migració de fils conscient de la memòria caché (DCTM) que garanteix una òptima distribució de tasques entre els nuclis. DCTM supera les solucions existents fins a un 12%. Amb els avenços en la tecnologia i les tècniques de micro-arquitectura de nuclis, la diferència de rendiment entre el component computacional i la memòria està augmentant significativament. Per omplir aquest buit, s'està avançant cap a arquitectures de múltiples nuclis amb memòries caché integrades basades en DRAM. Aquestes memòries caché DRAM a gran escala plantegen el problema de com gestionar de forma eficaç les etiquetes. Els dissenys de cachés amb dades i etiquetes juntes són un primer pas, però encara pateixen per tenir una alta latència, especialment en cachés amb un grau alt d'associativitat. En aquesta tesi es proposa l'estudi d'una tècnica anomenada Tag Cache, un mecanisme distribuït d'emmagatzematge d'etiquetes, que redueix la latència de les operacions de lectura d'etiquetes en les memòries caché DRAM. Cada Tag Cache, que resideix a L2, emmagatzema la informació de les vies que s'han accedit recentment de les memòries caché DRAM. D'aquesta manera es pot aprofitar la localitat temporal d'una caché DRAM, fet que contribueix en promig en un 46% dels encerts en les caché DRAM
D1.1 Refined scenarios and requirements, consolidated use cases, and qualitative techno-economic feasibility assessment
This document describes scenarios, consolidated use cases and associated requirements for wireless access networks in the 2020-2030 timeframe. These are based on METIS project and also taking into account work done in other 5G projects and forums such as ITU-R and NGMN. The document introduces spectrum authorization modes and describes spectrum usages scenarios, spectrum bands and spectrum demand for 5G services. Finally, this document provides qualitative techno-economic feasibility assessment by analyzing main players involved in service delivery, from the radio access network point of view, and describing their mutual positions and relationships.Monserrat Del Río, JF.; Martín-Sacristán Gandía, D. (2016). D1.1 Refined scenarios and requirements, consolidated use cases, and qualitative techno-economic feasibility assessment. https://doi.org/10.13140/RG.2.2.32091.7760
Locality Awareness for Task Parallel Computation
The task parallel programming model allows programmers to express concurrency at a high level of abstraction and places the burden of scheduling parallel execution on the run time system. Efficient scheduling of tasks on multi-socket multicore shared memory systems requires careful consideration of an increasingly complex memory hierarchy, including shared caches and non-uniform memory access (NUMA) characteristics. In this dissertation, we study the performance impact of these issues and other performance factors that limit parallel speedup in task parallel program executions and propose new scheduling strategies to improve performance. Our performance model characterizes lost efficiency in terms of overhead time, idle time, and work time inflation due to increased data access costs. We introduce a hierarchical run time scheduler that combines the benefits of work stealing and parallel depth-first schedulers. Matching the scheduler design to the memory hierarchy of multicore NUMA systems limits costly remote data accesses while maintaining load balance and exploiting constructive data sharing among threads that share a cache. We also propose a locality- based scheduling framework based on locality domains and comprising an API for programmers to specify application locality and a scheduler that honors those specifications. Implementations of the hierarchical and locality-based schedulers in our OpenMP run time system exhibit performance improvements on several task parallel benchmark applications over existing scheduling strategies and production OpenMP run time systems.Doctor of Philosoph
On Design and Applications of Practical Concurrent Data Structures
The proliferation of multicore processors is having an enormous impact on software design and development. In order to exploit parallelism available in multicores, there is a need to design and implement abstractions that programmers can use for general purpose applications development. A common abstraction for coordinated access to memory is a concurrent data structure. Concurrent data structures are challenging to design and implement as they are required to be correct, scalable, and practical under various application constraints. In this thesis, we contribute to the design of efficient concurrent data structures, propose new design techniques and improvements to existing implementations. Additionally, we explore the utilization of concurrent data structures in demanding application contexts such as data stream processing.In the first part of the thesis, we focus on data structures that are difficult to parallelize due to inherent sequential bottlenecks. We present a lock-free vector design that efficiently addresses synchronization bottlenecks by utilizing the combining technique. Typical combining techniques are blocking. Our design introduces combining without sacrificing non-blocking progress guarantees. We extend the vector to present a concurrent lock-free unbounded binary heap that implements a priority queue with mutable priorities.In the second part of the thesis, we shift our focus to concurrent search data structures. In order to offer strong progress guarantee, typical implementations of non-blocking search data structures employ a "helping" mechanism. However, helping may result in performance degradation. We propose help-optimality, which expresses optimization in amortized step complexity of concurrent operations. To describe the concept, we revisit the lock-free designs of a linked-list and a binary search tree and present improved algorithms. We design the algorithms without using any language/platform specific constructs; we do not use bit-stealing or runtime type introspection of objects. Thus, our algorithms are portable. We further delve into multi-dimensional data and similarity search. We present the first lock-free multi-dimensional data structure and linearizable nearest neighbor search algorithm. Our algorithm for nearest neighbor search is generic and can be adapted to other data structures.In the last part of the thesis, we explore the utilization of concurrent data structures for deterministic stream processing. We propose solutions to two challenges prevalent in data stream processing: (1) efficient processing on cloud as well as edge devices and (2) deterministic data-parallel processing at high-throughput and low-latency. As a first step, we present a methodology for customization of streaming aggregation on low-power multicore embedded platforms. Then we introduce Viper, a communication module that can be integrated into stream processing engines for the coordination of threads analyzing data in parallel
Effiziente externe Beobachtung von CPU-Aktivitäten auf SoCs
Die umfassende Beobachtbarkeit von System‐on‐Chips (SoCs) ist eine wichtige Voraussetzung für das effiziente Testen und Debuggen eingebetteter Systeme. Ausgehend von einer Analyse verschiedener Anwendungsfälle ergibt sich ein Katalog von Anforderungen an die Beobachtbarkeit von SoCs. Ein wichtiges Kriterium ist hier die Vollständigkeit der Beobachtung und umfasst die Aktivitäten der CPU (ausgeführte Instruktionen, gelesene und geschriebene Daten, Verhalten des Caches, Ausführungszeiten), des Bussystems und von Umgebungsbedingungen.
Weitere Kriterien sind die Echtzeitfähigkeit und die Kontinuität der Beobachtung sowie die gleichzeitige Durchführung verschiedener Beobachtungsaufgaben. Dabei soll es zu einer möglichst geringen Beeinflussung des SoCs kommen. Weitere wichtige Aspekt sind die Kosten der Lösung, die Universalität, die Skalierbarkeit sowie die Latenz der Verfügbarkeit der Beobachtungsergebnisse.
Für viele Anwendungen, besonders in sicherheitskritischen Bereichen, muss zudem nachgewiesen werden, dass das Beobachtungsverfahren kein Fehlverhalten des SoCs bewirkt bzw. ein solches maskiert. Eine besondere Herausforderung stellen Multiprozessor‐SoCs (MPSoCs) dar, da hier die Kommunikation zwischen den einzelnen CPUs im Inneren des SoC stattfindet und entsprechend schwierig für einen externen Bobachter sichtbar zu machen ist.
Der Stand der Technik zur Beobachtung von SoCs wird im Wesentlichen durch zwei Verfahren dargestellt. Bei der Software‐Instrumentierung wird zum funktionalen Programmcode zusätzlicher Code hinzugefügt, welcher zur Beobachtung des Programms dient. Diese Methode ist einfach und universell anwendbar, erfüllt aber die genannten Kriterien nur sehr eingeschränkt. Nachteilig ist hier der Ressourcenverbrauch im Falle des Verbleibs der Instrumentierung im fertigen Produkt. Wird die Instrumentierung nur temporär dem Code hinzugefügt, muss sichergestellt werden, dass das Beobachtungsergebnis auch für den finalen Code anwendbar ist – was besonders bei ressourcen‐abhängigen Integrationstests nur schwierig erfüllbar ist.
Eine alternative Lösung stellt eine spezielle Hardware‐Unterstützung in SoCs („embedded Trace“) dar. Hier werden im SoC Zustandsinformationen (z.B. Taskwechsel, ausgeführte Instruktionen, Datentransfers) gesammelt und mittels Trace‐Nachrichten an den Beobachter übermittelt. Dabei stellt die Bandbreite, die zur Ausgabe der Trace‐Nachrichten vom SoC verfügbar ist, ein entscheidendes Nadelöhr dar ‐ im SoC sind viel mehr den Beobachter interessierende Informationen verfügbar als nach außen transferiert werden können.
Damit haben beide dem gegenwärtige Stand der Technik entsprechende Beobachtungsverfahren eine Reihe von Einschränkungen, die sich besonders bei der Vollständigkeit der Beobachtung, der Flexibilität, der Kontinuität und der Unterstützung von MPSoCs zeigen.
In dieser Arbeit wird nun ein neuer Ansatz vorgestellt, welcher gegenüber dem Stand der Technik in einigen Bereichen deutliche Verbesserungen bietet. Dabei werden die Trace‐Daten nicht vom zu beobachtenden SoC direkt, sondern aus einer parallel mitlaufenden Emulation gewonnen. Die Bandbreite der für die Synchronisation der Emulation erforderlichen Daten ist in vielen Fällen deutlich geringer als bei der Ausgabe von umfassenden Trace‐Nachrichten mittels „embedded Trace“‐Lösungen. Gleichzeitig ist eine vollständige, äußerst detaillierte Beobachtung der Vorgänge innerhalb des SoC möglich. Das neue Beobachtungsverfahren wurde mittels verschiedener FPGA-basierter Implementierungen evaluiert, hier konnte auch die Anwendbarkeit für MPSoCs gezeigt werden