5 research outputs found

    Characterization and Avoidance of Critical Pipeline Structures in Aggressive Superscalar Processors

    Get PDF
    In recent years, with only small fractions of modern processors now accessible in a single cycle, computer architects constantly fight against propagation issues across the die. Unfortunately this trend continues to shift inward, and now the even most internal features of the pipeline are designed around communication, not computation. To address the inward creep of this constraint, this work focuses on the characterization of communication within the pipeline itself, architectural techniques to avoid it when possible, and layout co-design for early detection of problems. I present work in creating a novel detection tool for common case operand movement which can rapidly characterize an applications dataflow patterns. The results produced are suitable for exploitation as a small number of patterns can describe a significant portion of modern applications. Work on dynamic dependence collapsing takes the observations from the pattern results and shows how certain groups of operations can be dynamically grouped, avoiding unnecessary communication between individual instructions. This technique also amplifies the efficiency of pipeline data structures such as the reorder buffer, increasing both IPC and frequency. I also identify the same sets of collapsible instructions at compile time, producing the same benefits with minimal hardware complexity. This technique is also done in a backward compatible manner as the groups are exposed by simple reordering of the binarys instructions. I present aggressive pipelining approaches for these resources which avoids the critical timing often presumed necessary in aggressive superscalar processors. As these structures are designed for the worst case, pipelining them can produce greater frequency benefit than IPC loss. I also use the observation that the dynamic issue order for instructions in aggressive superscalar processors is predictable. Thus, a hardware mechanism is introduced for caching the wakeup order for groups of instructions efficiently. These wakeup vectors are then used to speculatively schedule instructions, avoiding the dynamic scheduling when it is not necessary. Finally, I present a novel approach to fast and high-quality chip layout. By allowing architects to quickly evaluate what if scenarios during early high-level design, chip designs are less likely to encounter implementation problems later in the process.Ph.D.Committee Chair: Scott Wills; Committee Member: David Schimmel; Committee Member: Gabriel Loh; Committee Member: Hsien-Hsin Lee; Committee Member: Yorai Ward

    ADAM: A Decentralized Parallel Computer Architecture Featuring Fast Thread and Data Migration and a Uniform Hardware Abstraction

    Get PDF
    The furious pace of Moore's Law is driving computer architecture into a realm where the the speed of light is the dominant factor in system latencies. The number of clock cycles to span a chip are increasing, while the number of bits that can be accessed within a clock cycle is decreasing. Hence, it is becoming more difficult to hide latency. One alternative solution is to reduce latency by migrating threads and data, but the overhead of existing implementations has previously made migration an unserviceable solution so far. I present an architecture, implementation, and mechanisms that reduces the overhead of migration to the point where migration is a viable supplement to other latency hiding mechanisms, such as multithreading. The architecture is abstract, and presents programmers with a simple, uniform fine-grained multithreaded parallel programming model with implicit memory management. In other words, the spatial nature and implementation details (such as the number of processors) of a parallel machine are entirely hidden from the programmer. Compiler writers are encouraged to devise programming languages for the machine that guide a programmer to express their ideas in terms of objects, since objects exhibit an inherent physical locality of data and code. The machine implementation can then leverage this locality to automatically distribute data and threads across the physical machine by using a set of high performance migration mechanisms. An implementation of this architecture could migrate a null thread in 66 cycles -- over a factor of 1000 improvement over previous work. Performance also scales well; the time required to move a typical thread is only 4 to 5 times that of a null thread. Data migration performance is similar, and scales linearly with data block size. Since the performance of the migration mechanism is on par with that of an L2 cache, the implementation simulated in my work has no data caches and relies instead on multithreading and the migration mechanism to hide and reduce access latencies

    Low-cost and efficient fault detection and diagnosis schemes for modern cores

    Get PDF
    Continuous improvements in transistor scaling together with microarchitectural advances have made possible the widespread adoption of high-performance processors across all market segments. However, the growing reliability threats induced by technology scaling and by the complexity of designs are challenging the production of cheap yet robust systems. Soft error trends are haunting, especially for combinational logic, and parity and ECC codes are therefore becoming insufficient as combinational logic turns into the dominant source of soft errors. Furthermore, experts are warning about the need to also address intermittent and permanent faults during processor runtime, as increasing temperatures and device variations will accelerate inherent aging phenomena. These challenges specially threaten the commodity segments, which impose requirements that existing fault tolerance mechanisms cannot offer. Current techniques based on redundant execution were devised in a time when high penalties were assumed for the sake of high reliability levels. Novel light-weight techniques are therefore needed to enable fault protection in the mass market segments. The complexity of designs is making post-silicon validation extremely expensive. Validation costs exceed design costs, and the number of discovered bugs is growing, both during validation and once products hit the market. Fault localization and diagnosis are the biggest bottlenecks, magnified by huge detection latencies, limited internal observability, and costly server farms to generate test outputs. This thesis explores two directions to address some of the critical challenges introduced by unreliable technologies and by the limitations of current validation approaches. We first explore mechanisms for comprehensively detecting multiple sources of failures in modern processors during their lifetime (including transient, intermittent, permanent and also design bugs). Our solutions embrace a paradigm where fault tolerance is built based on exploiting high-level microarchitectural invariants that are reusable across designs, rather than relying on re-execution or ad-hoc block-level protection. To do so, we decompose the basic functionalities of processors into high-level tasks and propose three novel runtime verification solutions that combined enable global error detection: a computation/register dataflow checker, a memory dataflow checker, and a control flow checker. The techniques use the concept of end-to-end signatures and allow designers to adjust the fault coverage to their needs, by trading-off area, power and performance. Our fault injection studies reveal that our methods provide high coverage levels while causing significantly lower performance, power and area costs than existing techniques. Then, this thesis extends the applicability of the proposed error detection schemes to the validation phases. We present a fault localization and diagnosis solution for the memory dataflow by combining our error detection mechanism, a new low-cost logging mechanism and a diagnosis program. Selected internal activity is continuously traced and kept in a memory-resident log whose capacity can be expanded to suite validation needs. The solution can catch undiscovered bugs, reducing the dependence on simulation farms that compute golden outputs. Upon error detection, the diagnosis algorithm analyzes the log to automatically locate the bug, and also to determine its root cause. Our evaluations show that very high localization coverage and diagnosis accuracy can be obtained at very low performance and area costs. The net result is a simplification of current debugging practices, which are extremely manual, time consuming and cumbersome. Altogether, the integrated solutions proposed in this thesis capacitate the industry to deliver more reliable and correct processors as technology evolves into more complex designs and more vulnerable transistors.El continuo escalado de los transistores junto con los avances microarquitect贸nicos han posibilitado la presencia de potentes procesadores en todos los segmentos de mercado. Sin embargo, varios problemas de fiabilidad est谩n desafiando la producci贸n de sistemas robustos. Las predicciones de "soft errors" son inquietantes, especialmente para la l贸gica combinacional: soluciones como ECC o paridad se est谩n volviendo insuficientes a medida que dicha l贸gica se convierte en la fuente predominante de soft errors. Adem谩s, los expertos est谩n alertando acerca de la necesidad de detectar otras fuentes de fallos (causantes de errores permanentes e intermitentes) durante el tiempo de vida de los procesadores. Los segmentos "commodity" son los m谩s vulnerables, ya que imponen unos requisitos que las t茅cnicas actuales de fiabilidad no ofrecen. Estas soluciones (generalmente basadas en re-ejecuci贸n) fueron ideadas en un tiempo en el que con tal de alcanzar altos nivel de fiabilidad se asum铆an grandes costes. Son por tanto necesarias nuevas t茅cnicas que permitan la protecci贸n contra fallos en los segmentos m谩s populares. La complejidad de los dise帽os est谩 encareciendo la validaci贸n "post-silicon". Su coste excede el de dise帽o, y el n煤mero de errores descubiertos est谩 aumentando durante la validaci贸n y ya en manos de los clientes. La localizaci贸n y el diagn贸stico de errores son los mayores problemas, empeorados por las altas latencias en la manifestaci贸n de errores, por la poca observabilidad interna y por el coste de generar las se帽ales esperadas. Esta tesis explora dos direcciones para tratar algunos de los retos causados por la creciente vulnerabilidad hardware y por las limitaciones de los enfoques de validaci贸n. Primero exploramos mecanismos para detectar m煤ltiples fuentes de fallos durante el tiempo de vida de los procesadores (errores transitorios, intermitentes, permanentes y de dise帽o). Nuestras soluciones son de un paradigma donde la fiabilidad se construye explotando invariantes microarquitect贸nicos gen茅ricos, en lugar de basarse en re-ejecuci贸n o en protecci贸n ad-hoc. Para ello descomponemos las funcionalidades b谩sicas de un procesador y proponemos tres soluciones de `runtime verification' que combinadas permiten una detecci贸n de errores a nivel global. Estas tres soluciones son: un verificador de flujo de datos de registro y de computaci贸n, un verificador de flujo de datos de memoria y un verificador de flujo de control. Nuestras t茅cnicas usan el concepto de firmas y permiten a los dise帽adores ajustar los niveles de protecci贸n a sus necesidades, mediante compensaciones en 谩rea, consumo energ茅tico y rendimiento. Nuestros estudios de inyecci贸n de errores revelan que los m茅todos propuestos obtienen altos niveles de protecci贸n, a la vez que causan menos costes que las soluciones existentes. A continuaci贸n, esta tesis explora la aplicabilidad de estos esquemas a las fases de validaci贸n. Proponemos una soluci贸n de localizaci贸n y diagn贸stico de errores para el flujo de datos de memoria que combina nuestro mecanismo de detecci贸n de errores, junto con un mecanismo de logging de bajo coste y un programa de diagn贸stico. Cierta actividad interna es continuamente registrada en una zona de memoria cuya capacidad puede ser expandida para satisfacer las necesidades de validaci贸n. La soluci贸n permite descubrir bugs, reduciendo la necesidad de calcular los resultados esperados. Al detectar un error, el algoritmo de diagn贸stico analiza el registro para autom谩ticamente localizar el bug y determinar su causa. Nuestros estudios muestran un alto grado de localizaci贸n y de precisi贸n de diagn贸stico a un coste muy bajo de rendimiento y 谩rea. El resultado es una simplificaci贸n de las pr谩cticas actuales de depuraci贸n, que son enormemente manuales, inc贸modas y largas. En conjunto, las soluciones de esta tesis capacitan a la industria a producir procesadores m谩s fiables, a medida que la tecnolog铆a evoluciona hacia dise帽os m谩s complejos y m谩s vulnerables

    Dependable Embedded Systems

    Get PDF
    This Open Access book introduces readers to many new techniques for enhancing and optimizing reliability in embedded systems, which have emerged particularly within the last five years. This book introduces the most prominent reliability concerns from today鈥檚 points of view and roughly recapitulates the progress in the community so far. Unlike other books that focus on a single abstraction level such circuit level or system level alone, the focus of this book is to deal with the different reliability challenges across different levels starting from the physical level all the way to the system level (cross-layer approaches). The book aims at demonstrating how new hardware/software co-design solution can be proposed to ef-fectively mitigate reliability degradation such as transistor aging, processor variation, temperature effects, soft errors, etc. Provides readers with latest insights into novel, cross-layer methods and models with respect to dependability of embedded systems; Describes cross-layer approaches that can leverage reliability through techniques that are pro-actively designed with respect to techniques at other layers; Explains run-time adaptation and concepts/means of self-organization, in order to achieve error resiliency in complex, future many core systems

    Investigation of a superscalar operand stack using FO4 and ASIC wire-delay metrics

    No full text
    Complexity in processor microarchitecture and the related issues of power density, hot spots and wire delay, are seen to be a major concern for design migration into low nanometer technologies of the future.This paper evaluates the hardware cost of an alternative to register-file organization, the superscalar stack issue array (SSIA).We believe this is the first such reported study using discrete stack elements. Several possible implementations are evaluated, using a 90 nm standard cell library as a reference model, yielding delay data and FO4 metrics.The evaluation, including reference to ASIC layout, RC extraction, and timing simulation, suggests a 4-wide issue rate of at least four Giga-ops/sec at 90nm and opportunities for twofold future improvement by using more advanced design approaches
    corecore