576 research outputs found
Smart technologies for effective reconfiguration: the FASTER approach
Current and future computing systems increasingly require that their functionality stays flexible after the system is operational, in order to cope with changing user requirements and improvements in system features, i.e. changing protocols and data-coding standards, evolving demands for support of different user applications, and newly emerging applications in communication, computing and consumer electronics. Therefore, extending the functionality and the lifetime of products requires the addition of new functionality to track and satisfy the customers needs and market and technology trends. Many contemporary products along with the software part incorporate hardware accelerators for reasons of performance and power efficiency. While adaptivity of software is straightforward, adaptation of the hardware to changing requirements constitutes a challenging problem requiring delicate solutions. The FASTER (Facilitating Analysis and Synthesis Technologies for Effective Reconfiguration) project aims at introducing a complete methodology to allow designers to easily implement a system specification on a platform which includes a general purpose processor combined with multiple accelerators running on an FPGA, taking as input a high-level description and fully exploiting, both at design time and at run time, the capabilities of partial dynamic reconfiguration. The goal is that for selected application domains, the FASTER toolchain will be able to reduce the design and verification time of complex reconfigurable systems providing additional novel verification features that are not available in existing tool flows
Implicitly-multithreaded processors
This paper proposes the Implicitly-MultiThreaded (IMT) architecture to execute compiler-specified speculative threads on to a modified Simultaneous Multithreading pipeline. IMT reduces hardware complexity by relying on the compiler to select suitable thread spawning points and orchestrate inter-thread register communication. To enhance IMT's effectiveness, this paper proposes three novel microarchitectural mechanisms: (1) resource- and dependence-based fetch policy to fetch and execute suitable instructions, (2) context multiplexing to improve utilization and map as many threads to a single context as allowed by availability of resources, and (3) early thread-invocation to hide thread start-up overhead by overlapping one thread's invocation with other threads' execution. We use SPEC2K benchmarks and cycle-accurate simulation to show that an microarchitecture-optimized IMT improves performance on average by 24% and at best by 69% over an aggressive superscalar. We also compare IMT to two prior proposals, TME and DMT, for speculative threading on an SMT using hardware-extracted threads. Our best IMT design outperforms a comparable TME and DMT on average by 26% and 38% respectively
Recommended from our members
Overcoming the Intuition Wall: Measurement and Analysis in Computer Architecture
These are exciting times for computer architecture research. Today there is significant demand to improve the performance and energy-efficiency of emerging, transformative applications which are being hammered out by the hundreds for new computing platforms and usage models. This booming growth of applications and the variety of programming languages used to create them is challenging our ability as architects to rapidly and rigorously characterize these applications. Concurrently, hardware has become more complex with the emergence of accelerators, multicore systems, and heterogeneity caused by further divergence between processor market segments. No one architect can now understand all the complexities of many systems and reason about the full impact of changes or new applications.
To that end, this dissertation presents four case studies in quantitative methods. Each case study attacks a different application and proposes a new measurement or analytical technique. In each case study we find at least one surprising or unintuitive result which would likely not have been found without the application of our method
A Multi-core processor for hard real-time systems
The increasing demand for new functionalities in current and future hard real-time embedded systems, like the ones deployed in automotive and avionics industries, is driving an increment in the performance required in current embedded processors. Multi-core processors represent a good design solution to cope with such higher performance requirements due to their better performance-per-watt ratio while maintaining the core design simple. Moreover, multi-cores also allow executing mixed-criticality level workloads composed of tasks with and without hard real-time requirements, maximizing the utilization of the hardware resources while guaranteeing low cost and low power consumption.
Despite those benefits, current multi-core processors are less analyzable than single-core ones due to the interferences between different tasks when accessing hardware shared resources. As a result, estimating a meaningful Worst-Case Execution Time (WCET) estimation - i.e. to compute an upper bound of the application's execution time - becomes extremely difficult, if not even impossible, because the execution time of a task may change depending on the other threads running at the same time. This makes the WCET of a task dependent on the set of inter-task interferences introduced by the co-running tasks.
Providing a WCET estimation independent from the other tasks (time composability property) is a key requirement in hard real-time systems.
This thesis proposes a new multi-core processor design in which time composability is achieved, hence enabling the use of multi-cores in hard real-time systems. With our proposals the WCET estimation of a HRT is independent from the other co-running tasks. To that end, we design a multi-core processor in which the maximum delay a request from a Hard Real-time Task (HRT), accessing a hardware shared resource can suffer due to other tasks is bounded: our processor guarantees that a request to a shared resource cannot be delayed longer than a given Upper Bound Delay (UBD).
In addition, the UBD allows identifying the impact that different processor configurations may have on the WCET by determining the sensitivity of a HRT to different resource allocations. This thesis proposes an off-line task allocation algorithm (called IA3: Interference-Aware Allocation Algorithm), that allocates tasks in a task set based on the HRT's sensitivity to different resource allocations. As a result the hardware shared resources used by HRTs are minimized, by allowing Non Hard Real-time Tasks (NHRTs) to use the rest of resources. Overall, our proposals provide analyzability for the HRTs allowing NHRTs to be executed into the same chip without any effect on the HRTs.
The previous first two proposals of this thesis focused on supporting the execution of multi-programmed workloads with mixed-criticality levels (composed of HRTs and NHRTs).
Higher performance could be achieved by implementing multi-threaded applications. As a first step towards supporting hard real-time parallel applications, this thesis proposes a new hardware/software approach to guarantee a predictable execution of software pipelined parallel programs.
This thesis also investigates a solution to verify the timing correctness of HRTs without requiring any modification in the core design: we design a hardware unit which is interfaced with the processor and integrated into a functional-safety aware methodology. This unit monitors the execution time of a block of instructions and it detects if it exceeds the WCET. Concretely, we show how to handle timing faults on a real industrial automotive platform.La creciente demanda de nuevas funcionalidades en los sistemas empotrados de tiempo real actuales y futuros en
industrias como la automovilÃstica y la de aviación, está impulsando un incremento en el rendimiento necesario en los
actuales procesadores empotrados. Los procesadores multi-núcleo son una solución eficiente para obtener un mayor
rendimiento ya que aumentan el rendimiento por vatio, manteniendo el diseño del núcleo simple.
Por otra parte, los procesadores multi-núcleo también permiten ejecutar cargas de trabajo con niveles de tiempo real mixtas
(formadas por tareas de tiempo real duro y laxo asà como tareas sin requerimientos de tiempo real), maximizando asà la
utilización de los recursos de procesador y garantizando el bajo consumo de energÃa.
Sin embargo, a pesar los beneficios mencionados anteriormente, los actuales procesadores multi-núcleo son menos
analizables que los de un solo núcleo debido a las interferencias surgidas cuando múltiples tareas acceden
simultáneamente a los recursos compartidos del procesador.
Como resultado, la estimación del peor tiempo de ejecución (conocido como WCET) - es decir, una cota superior del tiempo
de ejecución de la aplicación - se convierte en extremadamente difÃcil, si no imposible, porque el tiempo de ejecución de
una tarea puede cambiar dependiendo de las otras tareas que se estén ejecutando concurrentemente. Determinar una
estimación del WCET independiente de las otras tareas es un requisito clave en los sistemas empotrados de tiempo real
duro. Esta tesis propone un nuevo diseño de procesador multi-núcleo en el que el tiempo de ejecución de las tareas se
puede componer, lo que permitirá el uso de procesadores multi-núcleo en los sistemas de tiempo real duro. Para ello,
diseñamos un procesador multi-núcleo en el que la máxima demora que puede sufrir una petición de una tarea de tiempo
real duro (HRT) para acceder a un recurso hardware compartido debido a otras tareas está acotado, tiene un lÃmite superior
(UBD).
Además, UBD permite identificar el impacto que las diferentes posibles configuraciones del procesador pueden tener en el
WCET, mediante la determinación de la sensibilidad en la variación del tiempo de ejecución de diferentes reservas de
recursos del procesador. Esta tesis propone un algoritmo estático de reserva de recursos (llamado IA3), que asigna tareas
a núcleos en función de dicha sensibilidad. Como resultado los recursos compartidos del procesador usados por tareas
HRT se reducen al mÃnimo, permitiendo que las tareas sin requerimiento de tiempo real (NHRTs) puedas beneficiarse del
resto de recursos.
Por lo tanto, las propuestas presentadas en esta tesis permiten el análisis del WCET para tareas HRT, permitiendo asÃ
mismo la ejecución de tareas NHRTs en el mismo procesador multi-núcleo, sin que estas tengan ningún efecto sobre las
tareas HRT.
Las propuestas presentadas anteriormente se centran en el soporte a la ejecución de múltiples cargas de trabajo con
diferentes niveles de tiempo real (HRT y NHRTs).
Sin embargo, un mayor rendimiento puede lograrse mediante la transformación una tarea en múltiples sub-tareas
paralelas. Esta tesis propone una nueva técnica, con soporte del procesador y del sistema operativo, que garantiza una
ejecución analizable del modelo de ejecución paralela software pipelining.
Esta tesis también investiga una solución para verificar la corrección del WCET de HRT sin necesidad de ninguna
modificación en el diseño de la base: un nuevo componente externo al procesador se conecta a este sin necesidad de
modificarlo. Esta nueva unidad monitorea el tiempo de ejecución de un bloque de instrucciones y detecta si se excede el
WCET. Esta unidad permite detectar fallos de sincronización en sistemas de computación utilizados en automóviles
OSS architecture for mixed-criticality systems – a dual view from a software and system engineering perspective
Computer-based automation in industrial appliances led to a growing number of
logically dependent, but physically separated embedded control units per
appliance. Many of those components are safety-critical systems, and require
adherence to safety standards, which is inconsonant with the relentless demand
for features in those appliances. Features lead to a growing amount of control
units per appliance, and to a increasing complexity of the overall software
stack, being unfavourable for safety certifications. Modern CPUs provide means
to revise traditional separation of concerns design primitives: the consolidation
of systems, which yields new engineering challenges that concern the entire
software and system stack.
Multi-core CPUs favour economic consolidation of formerly separated
systems with one efficient single hardware unit. Nonetheless, the system
architecture must provide means to guarantee the freedom from interference
between domains of different criticality. System consolidation demands for
architectural and engineering strategies to fulfil requirements (e.g., real-time
or certifiability criteria) in safety-critical environments.
In parallel, there is an ongoing trend to substitute ordinary proprietary base
platform software components by mature OSS variants for economic and
engineering reasons. There are fundamental differences of processual properties
in development processes of OSS and proprietary software. OSS in
safety-critical systems requires development process assessment techniques to
build an evidence-based fundament for certification efforts that is based upon
empirical software engineering methods.
In this thesis, I will approach from both sides: the software and system
engineering perspective. In the first part of this thesis, I focus on the
assessment of OSS components: I develop software engineering techniques
that allow to quantify characteristics of distributed OSS development
processes. I show that ex-post analyses of software development processes can
be used to serve as a foundation for certification efforts, as it is required
for safety-critical systems.
In the second part of this thesis, I present a system architecture based on
OSS components that allows for consolidation of mixed-criticality systems
on a single platform. Therefore, I exploit virtualisation extensions of modern
CPUs to strictly isolate domains of different criticality. The proposed
architecture shall eradicate any remaining hypervisor activity in order to
preserve real-time capabilities of the hardware by design, while
guaranteeing strict isolation across domains.Computergestützte Automatisierung industrieller Systeme führt zu einer
wachsenden Anzahl an logisch abhängigen, aber physisch voneinander getrennten
Steuergeräten pro System. Viele der Einzelgeräte sind sicherheitskritische
Systeme, welche die Einhaltung von Sicherheitsstandards erfordern, was durch
die unermüdliche Nachfrage an Funktionalitäten erschwert wird. Diese führt zu
einer wachsenden Gesamtzahl an Steuergeräten, einhergehend mit wachsender
Komplexität des gesamten Softwarekorpus, wodurch Zertifizierungsvorhaben
erschwert werden. Moderne Prozessoren stellen Mittel zur Verfügung, welche es
ermöglichen, das traditionelle >Trennung von Belangen< Designprinzip zu
erneuern: die Systemkonsolidierung. Sie stellt neue ingenieurstechnische
Herausforderungen, die den gesamten Software und Systemstapel betreffen.
Mehrkernprozessoren begünstigen die ökonomische und effiziente Konsolidierung
vormals getrennter Systemen zu einer effizienten Hardwareeinheit. Geeignete
Systemarchitekturen müssen jedoch die Rückwirkungsfreiheit zwischen Domänen
unterschiedlicher Kritikalität sicherstellen. Die Konsolidierung erfordert
architektonische, als auch ingenieurstechnische Strategien um die Anforderungen
(etwa Echtzeit- oder Zertifizierbarkeitskriterien) in sicherheitskritischen
Umgebungen erfüllen zu können.
Zunehmend werden herkömmliche proprietär entwickelte Basisplattformkomponenten
aus ökonomischen und technischen Gründen vermehrt durch ausgereifte OSS
Alternativen ersetzt. Jedoch hindern fundamentale Unterschiede bei prozessualen
Eigenschaften des Entwicklungsprozesses bei OSS den Einsatz in
sicherheitskritischen Systemen. Dieser erfordert Techniken, welche es erlauben
die Entwicklungsprozesse zu bewerten um ein evidenzbasiertes Fundament für
Zertifizierungsvorhaben basierend auf empirischen Methoden des Software
Engineerings zur Verfügung zu stellen.
In dieser Arbeit nähere ich mich von beiden Seiten: der Softwaretechnik, und
der Systemarchitektur. Im ersten Teil befasse ich mich mit der Beurteilung von
OSS Komponenten: Ich entwickle Softwareanalysetechniken, welche es
ermöglichen, prozessuale Charakteristika von verteilten OSS
Entwicklungsvorhaben zu quantifizieren. Ich zeige, dass rückschauende Analysen
des Entwicklungsprozess als Grundlage für Softwarezertifizierungsvorhaben
genutzt werden können.
Im zweiten Teil dieser Arbeit widme ich mich der Systemarchitektur. Ich stelle
eine OSS-basierte Systemarchitektur vor, welche die Konsolidierung von
Systemen gemischter Kritikalität auf einer alleinstehenden Plattform
ermöglicht. Dazu nutze ich Virtualisierungserweiterungen moderner Prozessoren
aus, um die Hardware in strikt voneinander isolierten Rechendomänen unterschiedlicher
Kritikalität unterteilen zu können. Die vorgeschlagene Architektur soll jegliche
Betriebsstörungen des Hypervisors beseitigen, um die Echtzeitfähigkeiten der
Hardware bauartbedingt aufrecht zu erhalten, während strikte Isolierung
zwischen Domänen stets sicher gestellt ist
Efficient Adaptive Hard Real-time Multi-processor Systems
Modern computing systems are based on multi-processor systems, i.e. multiple cores on the same chip. Hard real-time systems are required to perform particular tasks within certain amount of time; failure to do so characterises an unaccepted behavior. Hard real-time systems are found in safety-critical applications, e.g. airbag control software, flight control software, etc. In safety-critical applications, failure to meet the real-time constraints can have catastrophic effects. The safe and, at the same time, efficient deployment of applications, with hard real-time constraints, on multi-processors is a challenging task. Scheduling methods and Models of Computation, that provide safe deployments, require a realistic estimation of the Worst-Case Execution Time (WCET) of tasks. The simultaneous access of shared resources by parallel tasks, causes interference delays due to hardware arbitration. Interference delays can be accounted for, with the pessimistic assumption that all possible interference can happen. The resulting schedules would be exceedingly conservative, thus the benefits of multi-processor would be significantly negated. Producing less pessimistic schedules is challenging due to the inter-dependency between WCET estimation and deployment optimisation. Accurate estimation of interference delays -and thus estimation of task WCET- depends on the way an application is deployed; deployment is an optimisation problem that depends on the estimation of task WCET. Another efficiency gap, which is of consequence in several systems (e.g. airbag control), stems from the fact that rarely tasks execute with their WCET. Safe runtime adaptation based on the Actual Execution Times, can yield additional improvements in terms of latency (more responsive systems). To achieve efficiency and retain adaptability, we propose that interference analysis should be coupled with the deployment process. The proposed interference analysis method estimates the possible amount of interference, based on an architecture and an application model. As more information is provided, such as scheduling, memory mapping, etc, the per-task interference estimation becomes more accurate. Thus, the method computes interference-sensitive WCET estimations (isWCET). Based on the isWCET method, we propose a method to break the inter-dependency between WCET estimation and deployment optimisation. Initially, the isWCETs are over-approximated, by assuming worst-case interference, and a safe deployment is derived. Subsequently, the proposed method computes accurate isWCETs by spatio-temporal exclusion, i.e. excluding interferences from non-overlapping tasks that share resources (space). Based on accurate isWCETs, the deployment solution is improved to provide better latency guarantees. We also propose a distributed runtime adaptation technique, that aims to improve run-time latency. Using isWCET estimations restricts the possible adaptations, as an adaptation might increase the interference and violate the safety guarantees. The proposed technique introduces statically scheduling dependencies between tasks that prevent additional interference. At runtime, a self-timed scheduling policy that respects these dependencies, is applied, proven to be safe, and with minimal overhead. Experimental evaluation on Kalray MPPA-256 shows that our methods improve isWCET up to 36%, guaranteed latency up to 46%, runtime performance up to 42%, with a consolidated performance gain of 50%
Enhancing the efficiency and practicality of software transactional memory on massively multithreaded systems
Chip Multithreading (CMT) processors promise to deliver higher performance by running more than one stream of instructions in parallel. To exploit CMT's capabilities, programmers have to parallelize their applications, which is not a trivial task. Transactional Memory (TM) is one of parallel programming models that aims at simplifying synchronization by raising the level of abstraction between semantic atomicity and the means by which that atomicity is achieved. TM is a promising programming model but there are still important challenges that must be addressed to make it more practical and efficient in mainstream parallel programming.
The first challenge addressed in this dissertation is that of making the evaluation of TM proposals more solid with realistic TM benchmarks and being able to run the same benchmarks on different STM systems. We first introduce a benchmark suite, RMS-TM, a comprehensive benchmark suite to evaluate HTMs and STMs. RMS-TM consists of seven applications from the Recognition, Mining and Synthesis (RMS) domain that are representative of future workloads. RMS-TM features current TM research issues such as nesting and I/O inside transactions, while also providing various TM characteristics. Most STM systems are implemented as user-level libraries: the programmer is expected to manually instrument not only transaction boundaries, but also individual loads and stores within transactions. This library-based approach is increasingly tedious and error prone and also makes it difficult to make reliable performance comparisons. To enable an "apples-to-apples" performance comparison, we then develop a software layer that allows researchers to test the same applications with interchangeable STM back ends.
The second challenge addressed is that of enhancing performance and scalability of TM applications running on aggressive multi-core/multi-threaded processors. Performance and scalability of current TM designs, in particular STM desings, do not always meet the programmer's expectation, especially at scale. To overcome this limitation, we propose a new STM design, STM2, based on an assisted execution model in which time-consuming TM operations are offloaded to auxiliary threads while application threads optimistically perform computation. Surprisingly, our results show that STM2 provides, on average, speedups between 1.8x and 5.2x over state-of-the-art STM systems. On the other hand, we notice that assisted-execution systems may show low processor utilization. To alleviate this problem and to increase the efficiency of STM2, we enriched STM2 with a runtime mechanism that automatically and adaptively detects application and auxiliary threads' computing demands and dynamically partition hardware resources between the pair through the hardware thread prioritization mechanism implemented in POWER machines.
The third challenge is to define a notion of what it means for a TM program to be correctly synchronized. The current definition of transactional data race requires all transactions to be totally ordered "as if'' serialized by a global lock, which limits the scalability of TM designs. To remove this constraint, we first propose to relax the current definition of transactional data race to allow a higher level of concurrency. Based on this definition we propose the first practical race detection algorithm for C/C++ applications (TRADE) and implement the corresponding race detection tool. Then, we introduce a new definition of transactional data race that is more intuitive, transparent to the underlying TM implementation, can be used for a broad set of C/C++ TM programs. Based on this new definition, we proposed T-Rex, an efficient and scalable race detection tool for C/C++ TM applications. Using TRADE and T-Rex, we have discovered subtle transactional data races in widely-used STAMP applications which have not been reported in the past
- …