8 research outputs found
Doctor of Philosophy
dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches
Cooperative auto-tuning of parallel skeletons
Improving program performance through the use of multiple homogeneous processing
elements, or cores, is common-place. However, these architectures increase the
complexity required at the software level. Existing work is focused on optimising
programs that run in isolation on these systems, but ignores the fact that, in reality,
these systems run multiple parallel programs concurrently with programs competing
for system resources. In order to improve performance in this shared environment,
cooperative tuning of multiple, concurrently running parallel programs is required.
Moreover, the set of programs running on the system – the system workload – is dynamic
and rapidly changing. This makes cooperative tuning a challenge, as it must
react rapidly to changes in the system workload.
This thesis explores the scope for performance improvement from cooperatively
tuning skeleton parallel programs, and techniques that can be used to cooperatively
auto-tune parallel programs. Parallel skeletons provide a clear separation between
algorithm description and implementation, and provide tuning knobs that the system
can use to make high-level changes to a programs implementation. This work
is in three parts: (i) how many threads should be allocated to each program running
on the system, (ii) on which cores should a programs threads be executed and
(iii) what values should be chosen for high-level parameters of the parallel skeletons.
We demonstrate that significant performance improvements are available in each of
these areas, compared to the current state-of-the-art
Heterogeneity-awareness in multithreaded multicore processors
During the last decades, Computer Architecture has experienced a great series of revolutionary changes. The increasing transistor count on a single chip has led to some of the main milestones in the field, from the release of the first Superscalar (1965) to the state-of-the-art Multithreaded Multicore Architectures, like the Intel Core i7 (2009).Moore's Law has continued for almost half of a century and is not expected to stop for at least another decade, and perhaps much longer. Moore observed a trend in the process technology advances. So, the number of transistors that can be placed inexpensively on an integrated circuit has increased exponentially, doubling approximately every two years. Nevertheless, having more available transistors can not be always directly translated into having more performance.The complexity of state-of-the-art software has reached heights unthinkable in prior ages, both in terms of the amount of computation and the complexity involved. If we deeply analyze this complexity in software we would realize that software is comprised of smaller execution processes that, although maintaining certain spatial/temporal locality, imply an inherently heterogeneous behavior. That is, during execution time the hardware executes very different portions of software, with huge differences in terms of behavior and hardware requirements. This heterogeneity in the behaviour of the software is not specific of the latest videogame, but it is inherent to software programming itself, since the very beginning of Algorithmics.In this PhD dissertation we deeply analyze the inherent heterogeneity present in software behavior. We identify the main issues and sources of this heterogeneity, that hamper most of the state-of-the-art processor designs from obtaining their maximum potential. Hence, the heterogeneity in software turns most of the current processors, commonly called general-purpose processors, into overdesigned. That is, they have much more hardware resources than really needed to execute the software running on them. This fact would not represent a main problem if we were not concerned on the additional power consumption involved in software computation.The final goal of this PhD dissertation consists in assigning each portion of software exactly the amount of hardware resources really needed to fully exploit its maximal potential; without consuming more energy than the strictly needed. That is, obtaining complexity-effective executions using the inherent heterogeneity in software behavior as steering indicator. Thus, we start deeply analyzing the heterogenous behaviour of the software run on top of general-purpose processors and then matching it on top of a heterogeneously distributed hardware, which explicitly exploit heterogeneous hardware requirements. Only by being heterogeneity-aware in software, and appropriately matching this software heterogeneity on top of hardware heterogeneity, may we effectively obtain better processor designs.The PhD dissertation is comprised of four main contributions that cover both multithreaded single-core (hdSMT) and multicore (TCA Algorithm, hTCA Framework and MFLUSH) scenarios, deeply explained in their corresponding chapters in the PhD dissertation memory. Overall, these contributions cover a significant range of the Heterogeneity-Aware Processors' design space. Within this design space, we have focused on the state-of-the-art trend in processor design: Multithreaded Multicore (CMP+SMT) Processors.We make special emphasis on the MPsim simulation tool, specifically designed and developed for this PhD dissertation. This tool has already gone beyond this PhD dissertation, becoming a reference tool by an important group of researchers spread over the Computer Architecture Department (DAC) at the Polytechnic University of Catalonia (UPC), the Barcelona Supercomputing Center (BSC) and the University of Las Palmas de Gran Canaria (ULPGC)
Generalizing List Scheduling for Stochastic Soft Real-time Parallel Applications
Advanced architecture processors provide features such as caches and branch prediction that result in improved, but variable, execution time of software. Hard real-time systems require tasks to complete within timing constraints. Consequently, hard real-time systems are typically designed conservatively through the use of tasks? worst-case execution times (WCET) in order to compute deterministic schedules that guarantee task?s execution within giving time constraints. This use of pessimistic execution time assumptions provides real-time guarantees at the cost of decreased performance and resource utilization. In soft real-time systems, however, meeting deadlines is not an absolute requirement (i.e., missing a few deadlines does not severely degrade system performance or cause catastrophic failure). In such systems, a guaranteed minimum probability of completing by the deadline is sufficient. Therefore, there is considerable latitude in such systems for improving resource utilization and performance as compared with hard real-time systems, through the use of more realistic execution time assumptions. Given probability distribution functions (PDFs) representing tasks? execution time requirements, and tasks? communication and precedence requirements, represented as a directed acyclic graph (DAG), this dissertation proposes and investigates algorithms for constructing non-preemptive stochastic schedules. New PDF manipulation operators developed in this dissertation are used to compute tasks? start and completion time PDFs during schedule construction. PDFs of the schedules? completion times are also computed and used to systematically trade the probability of meeting end-to-end deadlines for schedule length and jitter in task completion times. Because of the NP-hard nature of the non-preemptive DAG scheduling problem, the new stochastic scheduling algorithms extend traditional heuristic list scheduling and genetic list scheduling algorithms for DAGs by using PDFs instead of fixed time values for task execution requirements. The stochastic scheduling algorithms also account for delays caused by communication contention, typically ignored in prior DAG scheduling research. Extensive experimental results are used to demonstrate the efficacy of the new algorithms in constructing stochastic schedules. Results also show that through the use of the techniques developed in this dissertation, the probability of meeting deadlines can be usefully traded for performance and jitter in soft real-time systems
A Survey of Timing Verification Techniques for Multi-Core Real-Time Systems
This survey provides an overview of the scientific literature on timing verification techniques for multi-core real-time systems. It reviews the key results in the field from its origins around 2006 to the latest research published up to the end of 2018. The survey highlights the key issues involved in providing guarantees of timing correctness for multi-core systems. A detailed review is provided covering four main categories: full integration, temporal isolation, integrating interference effects into schedulability analysis, and mapping and allocation. The survey concludes with a discussion of the advantages and disadvantages of these different approaches, identifying open issues, key challenges, and possible directions for future research
Anales del XIII Congreso Argentino de Ciencias de la Computación (CACIC)
Contenido:
Arquitecturas de computadoras
Sistemas embebidos
Arquitecturas orientadas a servicios (SOA)
Redes de comunicaciones
Redes heterogéneas
Redes de Avanzada
Redes inalámbricas
Redes móviles
Redes activas
Administración y monitoreo de redes y servicios
Calidad de Servicio (QoS, SLAs)
Seguridad informática y autenticación, privacidad
Infraestructura para firma digital y certificados digitales
Análisis y detección de vulnerabilidades
Sistemas operativos
Sistemas P2P
Middleware
Infraestructura para grid
Servicios de integración (Web Services o .Net)Red de Universidades con Carreras en Informática (RedUNCI
Anales del XIII Congreso Argentino de Ciencias de la Computación (CACIC)
Contenido:
Arquitecturas de computadoras
Sistemas embebidos
Arquitecturas orientadas a servicios (SOA)
Redes de comunicaciones
Redes heterogéneas
Redes de Avanzada
Redes inalámbricas
Redes móviles
Redes activas
Administración y monitoreo de redes y servicios
Calidad de Servicio (QoS, SLAs)
Seguridad informática y autenticación, privacidad
Infraestructura para firma digital y certificados digitales
Análisis y detección de vulnerabilidades
Sistemas operativos
Sistemas P2P
Middleware
Infraestructura para grid
Servicios de integración (Web Services o .Net)Red de Universidades con Carreras en Informática (RedUNCI