124 research outputs found

    Report of the IEEE Workshop on Measurement and Modeling of Computer Dependability

    Get PDF
    Coordinated Science Laboratory was formerly known as Control Systems LaboratoryNASA Langley Research Center / NASA NAG-1-602 and NASA NAG-1-613ONR / N00014-85-K-000

    Determinism Enhancement and Reliability Assessment in Safety Critical AFDX Networks

    Get PDF
    RÉSUMÉ AFDX est une technologie basĂ©e sur Ethernet, qui a Ă©tĂ© dĂ©veloppĂ©e pour rĂ©pondre aux dĂ©fis qui dĂ©coulent du nombre croissant d’applications qui transmettent des donnĂ©es de criticitĂ© variable dans les systĂšmes modernes d’avionique modulaire intĂ©grĂ©e (Integrated Modular Avionics). Cette technologie de sĂ©curitĂ© critique a Ă©tĂ© notamment normalisĂ©e dans la partie 7 de la norme ARINC 664, dont le but est de dĂ©finir un rĂ©seau dĂ©terministe fournissant des garanties de performance prĂ©visibles. En particulier, AFDX est composĂ© de deux rĂ©seaux redondants, qui fournissent la haute fiabilitĂ© requise pour assurer son dĂ©terminisme. Le dĂ©terminisme de AFDX est principalement rĂ©alisĂ© par le concept de liens virtuels (Virtual Links), qui dĂ©finit une connexion unidirectionnelle logique entre les points terminaux (End Systems). Pour les liens virtuels, les limites supĂ©rieures des dĂ©lais de bout en bout peuvent ĂȘtre obtenues en utilisant des approches comme calcul rĂ©seau, mieux connu sous l’appellation Network Calculus. Cependant, il a Ă©tĂ© prouvĂ© que ces limites supĂ©rieures sont pessimistes dans de nombreux cas, ce qui peut conduire Ă  une utilisation inefficace des ressources et augmenter la complexitĂ© de la conception du rĂ©seau. En outre, en raison de l’asynchronisme de leur fonctionnement, il existe plusieurs sources de non-dĂ©terminisme dans les rĂ©seaux AFDX. Ceci introduit un problĂšme en lien avec la dĂ©tection des dĂ©fauts en temps rĂ©el. En outre, mĂȘme si un mĂ©canisme de gestion de la redondance est utilisĂ© pour amĂ©liorer la fiabilitĂ© des rĂ©seaux AFDX, il y a un risque potentiel soulignĂ© dans la partie 7 de la norme ARINC 664. La situation citĂ©e peut causer une panne en dĂ©pit des transmissions redondantes dans certains cas particuliers. Par consĂ©quent, l’objectif de cette thĂšse est d’amĂ©liorer la performance et la fiabilitĂ© des rĂ©seaux AFDX. Tout d’abord, un mĂ©canisme fondĂ© sur l’insertion de trames est proposĂ© pour renforcer le dĂ©terminisme de l’arrivĂ©e des trames au sein des rĂ©seaux AFDX. Parce que la charge du rĂ©seau et la bande passante moyenne utilisĂ©e augmente due Ă  l’insertion de trames, une stratĂ©gie d’agrĂ©gation des Sub-Virtual Links est introduite et formulĂ©e comme un problĂšme d’optimisation multi-objectif. En outre, trois algorithmes ont Ă©tĂ© dĂ©veloppĂ©s pour rĂ©soudre le problĂšme d’optimisation multi-objectif correspondant. Ensuite, une approche est introduite pour incorporer l’analyse de la performance dans l’évaluation de la fiabilitĂ© en considĂ©rant les violations des dĂ©lais comme des pannes.----------ABSTRACT AFDX is an Ethernet-based technology that has been developed to meet the challenges due to the growing number of data-intensive applications in modern Integrated Modular Avionics systems. This safety critical technology has been standardized in ARINC 664 Part 7, whose purpose is to define a deterministic network by providing predictable performance guarantees. In particular, AFDX is composed of two redundant networks, which provide the determinism required to obtain the desired high reliability. The determinism of AFDX is mainly achieved by the concept of Virtual Link, which defines a logical unidirectional connection from one source End System to one or more destination End Systems. For Virtual Links, the end-to-end delay upper bounds can be obtained by using the Network Calculus. However, it has been proved that such upper bounds are pessimistic in many cases, which may lead to an inefficient use of resources and aggravate network design complexity. Besides, due to asynchronism, there exists a source of non-determinism in AFDX networks, namely frame arrival uncertainty in a destination End System. This issue introduces a problem in terms of real-time fault detection. Furthermore, although a redundancy management mechanism is employed to enhance the reliability of AFDX networks, there still exist potential risks as pointed out in ARINC 664 Part 7, which may fail redundant transmissions in some special cases. Therefore, the purpose of this thesis is to improve the performance and the reliability of AFDX networks. First, a mechanism based on frame insertion is proposed to enhance the determinism of frame arrival within AFDX networks. As the network load and the average bandwidth used by a Virtual Link increase due to frame insertion, a Sub-Virtual Link aggregation strategy, formulated as a multi-objective optimization problem, is introduced. In addition, three algorithms have been developed to solve the corresponding multi-objective optimization problem. Next, an approach is introduced to incorporate performance analysis into reliability assessment by considering delay violations as failures. This allowed deriving tighter probabilistic upper bounds for Virtual Links that could be applied in AFDX network certification. In order to conduct the necessary reliability analysis, the well-known Fault-Tree Analysis technique is employed and Stochastic Network Calculus is applied to compute the upper bounds with various probability limits

    Information fusion architectures for security and resource management in cyber physical systems

    Get PDF
    Data acquisition through sensors is very crucial in determining the operability of the observed physical entity. Cyber Physical Systems (CPSs) are an example of distributed systems where sensors embedded into the physical system are used in sensing and data acquisition. CPSs are a collaboration between the physical and the computational cyber components. The control decisions sent back to the actuators on the physical components from the computational cyber components closes the feedback loop of the CPS. Since, this feedback is solely based on the data collected through the embedded sensors, information acquisition from the data plays an extremely vital role in determining the operational stability of the CPS. Data collection process may be hindered by disturbances such as system faults, noise and security attacks. Hence, simple data acquisition techniques will not suffice as accurate system representation cannot be obtained. Therefore, more powerful methods of inferring information from collected data such as Information Fusion have to be used. Information fusion is analogous to the cognitive process used by humans to integrate data continuously from their senses to make inferences about their environment. Data from the sensors is combined using techniques drawn from several disciplines such as Adaptive Filtering, Machine Learning and Pattern Recognition. Decisions made from such combination of data form the crux of information fusion and differentiates it from a flat structured data aggregation. In this dissertation, multi-layered information fusion models are used to develop automated decision making architectures to service security and resource management requirements in Cyber Physical Systems --Abstract, page iv

    Dual protocol performance using WiFi and ZigBee for industrial WLAN

    Get PDF
    The purpose of this thesis is to study the performance of a WNCS based on utilizing IEEE 802.15.4 and IEEE 802.11 in meeting industrial requirements as well as the extent of improvement on the network level in terms of latency and interference tolerance when using the two different protocols, namely WiFi and ZigBee, in parallel. The study evaluates the optimum performance of WNCS that utilizes only IEEE 802.15.4 protocol (which ZigBee is based on) without modifications as an alternative that is low cost and low power compared to other wireless technologies. The study also evaluates the optimum performance of WNCS that utilizes only the IEEE 802.11 protocol (WiFi) without modifications as a high bit network. OMNeT++ simulations are used to measure the end-to-end delay and packet loss from the sensors to the controller and from the controller to the actuators. It is demonstrated that the measured delay of the proposed WNCS including all types of transmission, encapsulation, de-capsulation, queuing and propagation, meet real-time control network requirements while guaranteeing correct packet reception with no packet loss. Moreover, it is shown that the demonstrated performance of the proposed WNCS operating redundantly on both networks in parallel is significantly superior to a WNCS operating on either a totally wireless ZigBee or WiFi network individually in terms of measured delay and interference tolerance. This proposed WNCS demonstrates the combined advantages of both the IEEE 802.15.4 protocol (which ZigBee is based on) without modifications being low cost and low power compared to other wireless technologies as well the advantages of the IEEE 802.11 protocol (WiFi) being increased bit rate and higher immunity to interference. All results presented in this study were based on a 95% confidence analysis

    Dependence-driven techniques in system design

    Get PDF
    Burstiness in workloads is often found in multi-tier architectures, storage systems, and communication networks. This feature is extremely important in system design because it can significantly degrade system performance and availability. This dissertation focuses on how to use knowledge of burstiness to develop new techniques and tools for performance prediction, scheduling, and resource allocation under bursty workload conditions.;For multi-tier enterprise systems, burstiness in the service times is catastrophic for performance. Via detailed experimentation, we identify the cause of performance degradation on the persistent bottleneck switch among various servers. This results in an unstable behavior that cannot be captured by existing capacity planning models. In this dissertation, beyond identifying the cause and effects of bottleneck switch in multi-tier systems, we also propose modifications to the classic TPC-W benchmark to emulate bursty arrivals in multi-tier systems.;This dissertation also demonstrates how burstiness can be used to improve system performance. Two dependence-driven scheduling policies, SWAP and ALoC, are developed. These general scheduling policies counteract burstiness in workloads and maintain high availability by delaying selected requests that contribute to burstiness. Extensive experiments show that both SWAP and ALoC achieve good estimates of service times based on the knowledge of burstiness in the service process. as a result, SWAP successfully approximates the shortest job first (SJF) scheduling without requiring a priori information of job service times. ALoC adaptively controls system load by infinitely delaying only a small fraction of the incoming requests.;The knowledge of burstiness can also be used to forecast the length of idle intervals in storage systems. In practice, background activities are scheduled during system idle times. The scheduling of background jobs is crucial in terms of the performance degradation of foreground jobs and the utilization of idle times. In this dissertation, new background scheduling schemes are designed to determine when and for how long idle times can be used for serving background jobs, without violating predefined performance targets of foreground jobs. Extensive trace-driven simulation results illustrate that the proposed schemes are effective and robust in a wide range of system conditions. Furthermore, if there is burstiness within idle times, then maintenance features like disk scrubbing and intra-disk data redundancy can be successfully scheduled as background activities during idle times

    Fault-tolerant satellite computing with modern semiconductors

    Get PDF
    Miniaturized satellites enable a variety space missions which were in the past infeasible, impractical or uneconomical with traditionally-designed heavier spacecraft. Especially CubeSats can be launched and manufactured rapidly at low cost from commercial components, even in academic environments. However, due to their low reliability and brief lifetime, they are usually not considered suitable for life- and safety-critical services, complex multi-phased solar-system-exploration missions, and missions with a longer duration. Commercial electronics are key to satellite miniaturization, but also responsible for their low reliability: Until 2019, there existed no reliable or fault-tolerant computer architectures suitable for very small satellites. To overcome this deficit, a novel on-board-computer architecture is described in this thesis.Robustness is assured without resorting to radiation hardening, but through software measures implemented within a robust-by-design multiprocessor-system-on-chip. This fault-tolerant architecture is component-wise simple and can dynamically adapt to changing performance requirements throughout a mission. It can support graceful aging by exploiting FPGA-reconfiguration and mixed-criticality.  Experimentally, we achieve 1.94W power consumption at 300Mhz with a Xilinx Kintex Ultrascale+ proof-of-concept, which is well within the powerbudget range of current 2U CubeSats. To our knowledge, this is the first COTS-based, reproducible on-board-computer architecture that can offer strong fault coverage even for small CubeSats.European Space AgencyComputer Systems, Imagery and Medi

    Fehlertolerante Mehrkernprozessoren fĂŒr gemischt-kritische Echtzeitsysteme

    Get PDF
    Current and future computing systems must be appropriately designed to cope with random hardware faults in order to provide a dependable service and correct functionality. Dependability has many facets to be addressed when designing a system and that is specially challenging in mixed-critical real-time systems, where safety standards play an important role and where responding in time can be as important as responding correctly or even responding at all. The thesis addresses the dependability of mixed-critical real-time systems, considering three important requirements: integrity, resilience and real-time. More specifically, it looks into the architectural and performance aspects of achieving dependability, concentrating its scope on error detection and handling in hardware -- more specifically in the Network-on-Chip (NoC), the backbone of modern MPSoC -- and on the performance of error handling and recovery in software. The thesis starts by looking at the impacts of random hardware faults on the NoC and on the system, with special focus on soft errors. Then, it addresses the uncovered weaknesses in the NoC by proposing a resilient NoC for mixed-critical real-time systems that is able to provide a highly reliable service with transparent protection for the applications. Formal communication time analysis is provided with common ARQ protocols modeled for NoCs and including a novel ARQ-based protocol optimized for DMAs. After addressing the efficient use of ARQ-based protocols in NoCs, the thesis proposes the Advanced Integrity Q-service (AIQ), a low-overhead mechanism to achieve integrity and real-time guarantees of NoC transactions on an End-to-End (E2E) basis. Inspired by transactions in distributed systems, the mechanism differs from the previous approach in that it does not provide error recovery in hardware but delegates the task to software, making use of existing functionality in cross-layer fault-tolerance solutions. Finally, the thesis addresses error handling in software as seen in cross-layer approaches. It addresses the performance of replicated software execution in many-core platforms. Replicated software execution provides protection to the system against random hardware faults. It relies on hardware-supported error detection and error handling in software. The replica-aware co-scheduling is proposed to achieve high performance with replicated execution, which is not possible with standard real-time schedulers.Um einen zuverlĂ€ssigen Betrieb und korrekte FunktionalitĂ€t zu gewĂ€hrleisten, mĂŒssen aktuelle und zukĂŒnftige Computersysteme so ausgelegt werden, dass sie mit diesen Fehlern umgehen können. ZuverlĂ€ssigkeit hat viele Aspekte, die bei der Entwicklung eines Systems berĂŒcksichtigt werden mĂŒssen. Das gilt insbesondere fĂŒr Echtzeitsysteme mit gemischter KritikalitĂ€t, bei denen Sicherheitsstandards, die ein korrektes und rechtzeitiges Verhalten fordern, eine wichtige Rolle spielen. Diese Dissertation befasst sich mit der ZuverlĂ€ssigkeit von gemischt-kritischen Echtzeitsystemen unter BerĂŒcksichtigung von drei wichtigen Anforderungen: IntegritĂ€t, Resilienz und Echtzeit. Genauer gesagt, behandelt sie Architektur- und Leistungsaspekte die notwendig sind um ZuverlĂ€ssigkeit zu erreichen, wobei der Schwerpunkt auf der Fehlererkennung und -behandlung in der Hardware – genauer gesagt im Network-on-Chip (NoC), dem RĂŒckgrat des modernen MPSoC – und auf der Leistung der Fehlerbehandlung und -behebung in der Software liegt. Die Arbeit beginnt mit der Untersuchung der Auswirkung von zufĂ€lligen Hardwarefehlern auf das NoC und das System, wobei der Schwerpunkt auf weichen Fehler (soft errors) liegt. Anschließend werden die aufgedeckten Schwachstellen im NoC behoben, indem ein widerstandsfĂ€higes NoC fĂŒr gemischt-kritische Echtzeitsysteme vorgeschlagen wird, das in der Lage ist, einen höchst zuverlĂ€ssigen Betrieb mit transparentem Schutz fĂŒr die Anwendungen zu bieten. Nach der Auseinandersetzung mit der effizienten Nutzung von ARQ-basierten Protokolle in NoCs, wird der Advanced Integrity Q-Service (AIQ) vorgestellt, der ein Mechanismus mit geringem Overhead ist, um IntegritĂ€t und Echtzeit-Garantien von NoC-Transaktionen auf Ende-zu-Ende (E2E)-Basis zu erreichen. Inspiriert von Transaktionen in verteilten Systemen unterscheidet sich der Mechanismus vom bisherigen Konzept dadurch, dass er keine Fehlerbehebung in der Hardware vorsieht, sondern diese Aufgabe an die Software delegiert. Schließlich befasst sich die Dissertation mit der Fehlerbehandlung in Software, wie sie in schichtĂŒbergreifenden Methoden zu sehen ist. Sie behandelt die Leistung der replizierten Software-AusfĂŒhrung in Many-Core-Plattformen. Es setzt auf hardwaregestĂŒtzte Fehlererkennung und Fehlerbehandlung in der Software. Das Replika-bewusste Co-Scheduling wird vorgeschlagen, um eine hohe Performance bei replizierter AusfĂŒhrung zu erreichen, was mit Standard-Echtzeit-Schedulern nicht möglich ist
    • 

    corecore