3,302 research outputs found

    Availability modeling and evaluation on high performance cluster computing systems

    Get PDF
    Cluster computing has been attracting more and more attention from both the industrial and the academic world for its enormous computing power, cost effective, and scalability. Beowulf type cluster, for example, is a typical High Performance Computing (HPC) cluster system. Availability, as a key attribute of the system, needs to be considered at the system design stage and monitored at mission time. Moreover, system monitoring is a must to help identify the defects and ensure the system\u27s availability requirement. In this study, novel solutions which provide availability modeling, model evaluation, and data analysis as a single framework have been investigated. Three key components in the investigation are availability modeling, model evaluation, and data analysis. The general availability concepts and modeling techniques are briefly reviewed. The system\u27s availability model is divided into submodels based upon their functionalities. Furthermore, an object oriented Markov model specification to facilitate availability modeling and runtime configuration has been developed. Numerical solutions for Markov models are examined, especially on the uniformization method. Alternative implementations of the method are discussed; particularly on analyzing the cost of an alternative solution for small state space model, and different ways for solving large sparse Markov models. The dissertation also presents a monitoring and data analysis framework, which is responsible for failure analysis and availability reconfiguration. In addition, the event logs provided from the Lawrence Livermore National Laboratory have been studied and applied to validate the proposed techniques

    A fault-tolerant multiprocessor architecture for aircraft, volume 1

    Get PDF
    A fault-tolerant multiprocessor architecture is reported. This architecture, together with a comprehensive information system architecture, has important potential for future aircraft applications. A preliminary definition and assessment of a suitable multiprocessor architecture for such applications is developed

    Innovative Techniques for Testing and Diagnosing SoCs

    Get PDF
    We rely upon the continued functioning of many electronic devices for our everyday welfare, usually embedding integrated circuits that are becoming even cheaper and smaller with improved features. Nowadays, microelectronics can integrate a working computer with CPU, memories, and even GPUs on a single die, namely System-On-Chip (SoC). SoCs are also employed on automotive safety-critical applications, but need to be tested thoroughly to comply with reliability standards, in particular the ISO26262 functional safety for road vehicles. The goal of this PhD. thesis is to improve SoC reliability by proposing innovative techniques for testing and diagnosing its internal modules: CPUs, memories, peripherals, and GPUs. The proposed approaches in the sequence appearing in this thesis are described as follows: 1. Embedded Memory Diagnosis: Memories are dense and complex circuits which are susceptible to design and manufacturing errors. Hence, it is important to understand the fault occurrence in the memory array. In practice, the logical and physical array representation differs due to an optimized design which adds enhancements to the device, namely scrambling. This part proposes an accurate memory diagnosis by showing the efforts of a software tool able to analyze test results, unscramble the memory array, map failing syndromes to cell locations, elaborate cumulative analysis, and elaborate a final fault model hypothesis. Several SRAM memory failing syndromes were analyzed as case studies gathered on an industrial automotive 32-bit SoC developed by STMicroelectronics. The tool displayed defects virtually, and results were confirmed by real photos taken from a microscope. 2. Functional Test Pattern Generation: The key for a successful test is the pattern applied to the device. They can be structural or functional; the former usually benefits from embedded test modules targeting manufacturing errors and is only effective before shipping the component to the client. The latter, on the other hand, can be applied during mission minimally impacting on performance but is penalized due to high generation time. However, functional test patterns may benefit for having different goals in functional mission mode. Part III of this PhD thesis proposes three different functional test pattern generation methods for CPU cores embedded in SoCs, targeting different test purposes, described as follows: a. Functional Stress Patterns: Are suitable for optimizing functional stress during I Operational-life Tests and Burn-in Screening for an optimal device reliability characterization b. Functional Power Hungry Patterns: Are suitable for determining functional peak power for strictly limiting the power of structural patterns during manufacturing tests, thus reducing premature device over-kill while delivering high test coverage c. Software-Based Self-Test Patterns: Combines the potentiality of structural patterns with functional ones, allowing its execution periodically during mission. In addition, an external hardware communicating with a devised SBST was proposed. It helps increasing in 3% the fault coverage by testing critical Hardly Functionally Testable Faults not covered by conventional SBST patterns. An automatic functional test pattern generation exploiting an evolutionary algorithm maximizing metrics related to stress, power, and fault coverage was employed in the above-mentioned approaches to quickly generate the desired patterns. The approaches were evaluated on two industrial cases developed by STMicroelectronics; 8051-based and a 32-bit Power Architecture SoCs. Results show that generation time was reduced upto 75% in comparison to older methodologies while increasing significantly the desired metrics. 3. Fault Injection in GPGPU: Fault injection mechanisms in semiconductor devices are suitable for generating structural patterns, testing and activating mitigation techniques, and validating robust hardware and software applications. GPGPUs are known for fast parallel computation used in high performance computing and advanced driver assistance where reliability is the key point. Moreover, GPGPU manufacturers do not provide design description code due to content secrecy. Therefore, commercial fault injectors using the GPGPU model is unfeasible, making radiation tests the only resource available, but are costly. In the last part of this thesis, we propose a software implemented fault injector able to inject bit-flip in memory elements of a real GPGPU. It exploits a software debugger tool and combines the C-CUDA grammar to wisely determine fault spots and apply bit-flip operations in program variables. The goal is to validate robust parallel algorithms by studying fault propagation or activating redundancy mechanisms they possibly embed. The effectiveness of the tool was evaluated on two robust applications: redundant parallel matrix multiplication and floating point Fast Fourier Transform

    Testability and redundancy techniques for improved yield and reliability of CMOS VLSI circuits

    Get PDF
    The research presented in this thesis is concerned with the design of fault-tolerant integrated circuits as a contribution to the design of fault-tolerant systems. The economical manufacture of very large area ICs will necessitate the incorporation of fault-tolerance features which are routinely employed in current high density dynamic random access memories. Furthermore, the growing use of ICs in safety-critical applications and/or hostile environments in addition to the prospect of single-chip systems will mandate the use of fault-tolerance for improved reliability. A fault-tolerant IC must be able to detect and correct all possible faults that may affect its operation. The ability of a chip to detect its own faults is not only necessary for fault-tolerance, but it is also regarded as the ultimate solution to the problem of testing. Off-line periodic testing is selected for this research because it achieves better coverage of physical faults and it requires less extra hardware than on-line error detection techniques. Tests for CMOS stuck-open faults are shown to detect all other faults. Simple test sequence generation procedures for the detection of all faults are derived. The test sequences generated by these procedures produce a trivial output, thereby, greatly simplifying the task of test response analysis. A further advantage of the proposed test generation procedures is that they do not require the enumeration of faults. The implementation of built-in self-test is considered and it is shown that the hardware overhead is comparable to that associated with pseudo-random and pseudo-exhaustive techniques while achieving a much higher fault coverage through-the use of the proposed test generation procedures. The consideration of the problem of testing the test circuitry led to the conclusion that complete test coverage may be achieved if separate chips cooperate in testing each other's untested parts. An alternative approach towards complete test coverage would be to design the test circuitry so that it is as distributed as possible and so that it is tested as it performs its function. Fault correction relies on the provision of spare units and a means of reconfiguring the circuit so that the faulty units are discarded. This raises the question of what is the optimum size of a unit? A mathematical model, linking yield and reliability is therefore developed to answer such a question and also to study the effects of such parameters as the amount of redundancy, the size of the additional circuitry required for testing and reconfiguration, and the effect of periodic testing on reliability. The stringent requirement on the size of the reconfiguration logic is illustrated by the application of the model to a typical example. Another important result concerns the effect of periodic testing on reliability. It is shown that periodic off-line testing can achieve approximately the same level of reliability as on-line testing, even when the time between tests is many hundreds of hours

    A new emergency control method and a preventive mechanism against cascaded events to avoid large-scale blackouts

    Get PDF
    Cascaded events may cause a major blackout which will lead to a massive economic loss and even fatalities. Significant research efforts have been made to address the issue systematically: preventive mechanisms are designed to mitigate the impact of initiating events on power systems; emergency control methods are proposed to prevent power systems from entering an unstable state; restorative control methods are developed to stop the propagation of power system instability and to prevent widespread blackouts. This work contributes to the development of new emergency control methods and preventive mechanisms. First, a new emergency control scheme is proposed for preventing power systems from a loss of synchronism. Traditional out-of-step relays may fail to predict losses of synchronism as the dynamics of power systems become more and more complex. In recent years, the installation of the Phasor Measurement Units (PMUs) on power grids has increased significantly and, therefore, a large amount of real-time data is available for on-line monitoring of power system dynamics. This research proposes a PMU-based application for on-line monitoring of rotor angle stability. The Lyapunov Exponents are used to predict a loss of synchronism within large power systems. The relationship between rotor angle stability and the Maximal Lyapunov Exponent (MLE) is established. A computational algorithm is developed for the calculation of MLE in an operational environment. The effectiveness of the monitoring scheme is illustrated with a 3-machine system and a 200-bus system model. Then, a preventive mechanism against cyber attacks is developed. Cyber threats are serious concerns for power systems. For example, hackers may attack power control systems via the interconnected enterprise networks. This research proposes a risk assessment framework to enhance the resilience of power systems against cyber attacks. The Duality Element Relative Fuzzy Evaluation Method (DERFEM) is employed to evaluate identified security vulnerabilities within cyber systems of power systems quantitatively; The Attack Graph is used to identify possible intrusion scenarios that exploit multiple vulnerabilities; an Intrusion Response system (IRS) is developed to monitor the impact of intrusion scenarios on power system dynamics in real time. IRS calculates the Conditional Lyapunov Exponents (CLEs) on line based on PMU data. Power system stability is predicted through values of CLEs. Control actions based on CLEs will be suggested if power system instability is likely to happen. A generic wind farm control system is used for case study. The effectiveness of IRS is illustrated with the IEEE 39 bus system model

    Diagnóstico de fallos y optimización de la planificación en un marco de e-mantenimiento.

    Get PDF
    324 p.El objetivo principal es demostrar el potencial de mejora que las técnicas y metodologías relacionadas con la analítica prescriptiva, pueden proporcionar en aplicaciones de mantenimiento industrial. Las tecnologías desarrolladas se pueden agrupar en tres ámbitos: - El e-mantenimiento, relacionado fundamentalmente con el desarrollo de plataformas colaborativas e inteligentes que permiten la integración de nuevos sensores, sistemas de comunicaciones, estándares y protocolos, conceptos, métodos de almacenamiento y análisis etc. que entran continuamente en nuestro abanico de posibilidades y nos ofrecen la posibilidad de seguir una tendencia de mejora en la optimización de activos y procesos, y en la interoperabilidad entre sistemas.- Las Redes Bayesianas (Bayesian Networks ¿ BNs) junto con otras metodologías de recogida de información utilizadas en ingeniería nos ofrecen la posibilidad de automatizar la tarea de diagnóstico y predicción de fallos.- La optimización de las estrategias de mantenimiento, mediante simulaciones de fallos y análisis coste-efectividad, que ayudan a la toma de decisiones a la hora de seleccionar una estrategia de mantenimiento adecuada para el activo. Además, mediante el uso de algoritmos de optimización logramos mejorar la planificación del mantenimiento, reduciendo los tiempos y costes para realizar las tareas en un parque de activos

    Wireless Monitoring Systems for Long-Term Reliability Assessment of Bridge Structures based on Compressed Sensing and Data-Driven Interrogation Methods.

    Full text link
    The state of the nation’s highway bridges has garnered significant public attention due to large inventories of aging assets and insufficient funds for repair. Current management methods are based on visual inspections that have many known limitations including reliance on surface evidence of deterioration and subjectivity introduced by trained inspectors. To address the limitations of current inspection practice, structural health monitoring (SHM) systems can be used to provide quantitative measures of structural behavior and an objective basis for condition assessment. SHM systems are intended to be a cost effective monitoring technology that also automates the processing of data to characterize damage and provide decision information to asset managers. Unfortunately, this realization of SHM systems does not currently exist. In order for SHM to be realized as a decision support tool for bridge owners engaged in performance- and risk-based asset management, technological hurdles must still be overcome. This thesis focuses on advancing wireless SHM systems. An innovative wireless monitoring system was designed for permanent deployment on bridges in cold northern climates which pose an added challenge as the potential for solar harvesting is reduced and battery charging is slowed. First, efforts advancing energy efficient usage strategies for WSNs were made. With WSN energy consumption proportional to the amount of data transmitted, data reduction strategies are prioritized. A novel data compression paradigm termed compressed sensing is advanced for embedment in a wireless sensor microcontroller. In addition, fatigue monitoring algorithms are embedded for local data processing leading to dramatic data reductions. In the second part of the thesis, a radical top-down design strategy (in contrast to global vibration strategies) for a monitoring system is explored to target specific damage concerns of bridge owners. Data-driven algorithmic approaches are created for statistical performance characterization of long-term bridge response. Statistical process control and reliability index monitoring are advanced as a scalable and autonomous means of transforming data into information relevant to bridge risk management. Validation of the wireless monitoring system architecture is made using the Telegraph Road Bridge (Monroe, Michigan), a multi-girder short-span highway bridge that represents a major fraction of the U.S. national inventory.PhDCivil EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/116749/1/ocosean_1.pd

    Simheuristics to support efficient and sustainable freight transportation in smart city logistics

    Get PDF
    La logística urbana intel·ligent constitueix un factor crucial en la creació de sistemes de transport urbà eficients i sostenibles. Entre altres factors, aquests sistemes es centren en la incorporació de dades en temps real i en la creació de models de negoci col·laboratius en el transport urbà de mercaderies, considerant l’augment dels habitants en les ciutats, la creixent complexitat de les demandes dels clients i els mercats altament competitius. Això permet als que planifiquen el transport minimitzar els costos monetaris i ambientals del transport de mercaderies a les àrees metropolitanes. Molts problemes de presa de decisions en aquest context es poden formular com a problemes d’optimació combinatòria. Tot i que hi ha diferents enfocaments de resolució exacta per a trobar solucions òptimes a aquests problemes, la seva complexitat i grandària, a més de la necessitat de prendre decisions instantànies pel que fa a l’encaminament de vehicles, la programació o la situació d’instal·lacions, fa que aquestes metodologies no s’apliquin a la pràctica. A causa de la seva capacitat per a trobar solucions pseudoòptimes en gairebé temps real, els algorismes metaheurístics reben una atenció creixent dels investigadors i professionals com a alternatives eficients i fiables per a resoldre nombrosos problemes d’optimació en la creació de la logística de les ciutats intel·ligents. Malgrat el seu èxit, les tècniques metaheurístiques tradicionals no representen plenament la complexitat dels sistemes més realistes. En assumir entrades (inputs) i restriccions de problemes deterministes, la incertesa i el dinamisme experimentats en els escenaris de transport urbà queden sense explicar. Els algorismes simheurístics persegueixen superar aquests inconvenients mitjançant la integració de qualsevol tipus de simulació en processos metaheurístics per a explicar la incertesa inherent a la majoria de les aplicacions de la vida real. Aquesta tesi defineix i investiga l’ús d’algorismes simheurístics com el mètode més adequat per a resoldre problemes d’optimació derivats de la logística de les ciutats. Alguns algorismes simheurístics s’apliquen a una sèrie de problemes complexos, com la recollida de residus urbans, els problemes de disseny de la cadena de subministrament integrada i els models de transport innovadors relacionats amb la col·laboració horitzontal entre els socis de la cadena de subministrament. A més de les discussions metodològiques i la comparació d’algorismes desenvolupats amb els referents de la bibliografia acadèmica, es mostra l’aplicabilitat i l’eficiència dels algorismes simheurístics en diferents casos de gran escala.Las actividades de logística en ciudades inteligentes constituyen un factor crucial en la creación de sistemas de transporte urbano eficientes y sostenibles. Entre otros factores, estos sistemas se centran en la incorporación de datos en tiempo real y la creación de modelos empresariales colaborativos en el transporte urbano de mercancías, al tiempo que consideran el aumento del número de habitantes en las ciudades, la creciente complejidad de las demandas de los clientes y los mercados altamente competitivos. Esto permite minimizar los costes monetarios y ambientales del transporte de mercancías en las áreas metropolitanas. Muchos de los problemas de toma de decisiones en este contexto se pueden formular como problemas de optimización combinatoria. Si bien existen diferentes enfoques de resolución exacta para encontrar soluciones óptimas a tales problemas, su complejidad y tamaño, además de la necesidad de tomar decisiones instantáneas con respecto al enrutamiento, la programación o la ubicación de las instalaciones, hacen que dichas metodologías sean inaplicables en la práctica. Debido a su capacidad para encontrar soluciones pseudoóptimas casi en tiempo real, los algoritmos metaheurísticos reciben cada vez más atención por parte de investigadores y profesionales como alternativas eficientes y fiables para resolver numerosos problemas de optimización en la creación de la logística de ciudades inteligentes. A pesar de su éxito, las técnicas metaheurísticas tradicionales no representan completamente la complejidad de los sistemas más realistas. Al asumir insumos y restricciones de problemas deterministas, se ignora la incertidumbre y el dinamismo experimentados en los escenarios de transporte urbano. Los algoritmos simheurísticos persiguen superar estos inconvenientes integrando cualquier tipo de simulación en procesos metaheurísticos con el fin de considerar la incertidumbre inherente en la mayoría de las aplicaciones de la vida real. Esta tesis define e investiga el uso de algoritmos simheurísticos como método adecuado para resolver problemas de optimización que surgen en la logística de ciudades inteligentes. Se aplican algoritmos simheurísticos a una variedad de problemas complejos, incluyendo la recolección de residuos urbanos, problemas de diseño de la cadena de suministro integrada y modelos de transporte innovadores relacionados con la colaboración horizontal entre los socios de la cadena de suministro. Además de las discusiones metodológicas y la comparación de los algoritmos desarrollados con los de referencia de la bibliografía académica, se muestra la aplicabilidad y la eficiencia de los algoritmos simheurísticos en diferentes estudios de casos a gran escala.Smart city logistics are a crucial factor in the creation of efficient and sustainable urban transportation systems. Among other factors, they focus on incorporating real-time data and creating collaborative business models in urban freight transportation concepts, whilst also considering rising urban population numbers, increasingly complex customer demands, and highly competitive markets. This allows transportation planners to minimize the monetary and environmental costs of freight transportation in metropolitan areas. Many decision-making problems faced in this context can be formulated as combinatorial optimization problems. While different exact solving approaches exist to find optimal solutions to such problems, their complexity and size, in addition to the need for instantaneous decision-making regarding vehicle routing, scheduling, or facility location, make such methodologies inapplicable in practice. Due to their ability to find pseudo-optimal solutions in almost real time, metaheuristic algorithms have received increasing attention from researchers and practitioners as efficient and reliable alternatives in solving numerous optimization problems in the creation of smart city logistics. Despite their success, traditional metaheuristic techniques fail to fully represent the complexity of most realistic systems. By assuming deterministic problem inputs and constraints, the uncertainty and dynamism experienced in urban transportation scenarios are left unaccounted for. Simheuristic frameworks try to overcome these drawbacks by integrating any type of simulation into metaheuristic-driven processes to account for the inherent uncertainty in most real-life applications. This thesis defines and investigates the use of simheuristics as a method of first resort for solving optimization problems arising in smart city logistics concepts. Simheuristic algorithms are applied to a range of complex problem settings including urban waste collection, integrated supply chain design, and innovative transportation models related to horizontal collaboration among supply chain partners. In addition to methodological discussions and the comparison of developed algorithms to state-of-the-art benchmarks found in the academic literature, the applicability and efficiency of simheuristic frameworks in different large-scaled case studies are shown

    Autonomous Recovery Of Reconfigurable Logic Devices Using Priority Escalation Of Slack

    Get PDF
    Field Programmable Gate Array (FPGA) devices offer a suitable platform for survivable hardware architectures in mission-critical systems. In this dissertation, active dynamic redundancy-based fault-handling techniques are proposed which exploit the dynamic partial reconfiguration capability of SRAM-based FPGAs. Self-adaptation is realized by employing reconfiguration in detection, diagnosis, and recovery phases. To extend these concepts to semiconductor aging and process variation in the deep submicron era, resilient adaptable processing systems are sought to maintain quality and throughput requirements despite the vulnerabilities of the underlying computational devices. A new approach to autonomous fault-handling which addresses these goals is developed using only a uniplex hardware arrangement. It operates by observing a health metric to achieve Fault Demotion using Recon- figurable Slack (FaDReS). Here an autonomous fault isolation scheme is employed which neither requires test vectors nor suspends the computational throughput, but instead observes the value of a health metric based on runtime input. The deterministic flow of the fault isolation scheme guarantees success in a bounded number of reconfigurations of the FPGA fabric. FaDReS is then extended to the Priority Using Resource Escalation (PURE) online redundancy scheme which considers fault-isolation latency and throughput trade-offs under a dynamic spare arrangement. While deep-submicron designs introduce new challenges, use of adaptive techniques are seen to provide several promising avenues for improving resilience. The scheme developed is demonstrated by hardware design of various signal processing circuits and their implementation on a Xilinx Virtex-4 FPGA device. These include a Discrete Cosine Transform (DCT) core, Motion Estimation (ME) engine, Finite Impulse Response (FIR) Filter, Support Vector Machine (SVM), and Advanced Encryption Standard (AES) blocks in addition to MCNC benchmark circuits. A iii significant reduction in power consumption is achieved ranging from 83% for low motion-activity scenes to 12.5% for high motion activity video scenes in a novel ME engine configuration. For a typical benchmark video sequence, PURE is shown to maintain a PSNR baseline near 32dB. The diagnosability, reconfiguration latency, and resource overhead of each approach is analyzed. Compared to previous alternatives, PURE maintains a PSNR within a difference of 4.02dB to 6.67dB from the fault-free baseline by escalating healthy resources to higher-priority signal processing functions. The results indicate the benefits of priority-aware resiliency over conventional redundancy approaches in terms of fault-recovery, power consumption, and resource-area requirements. Together, these provide a broad range of strategies to achieve autonomous recovery of reconfigurable logic devices under a variety of constraints, operating conditions, and optimization criteria
    corecore