8,349 research outputs found

    Engenharia de Resiliência

    Get PDF
    This thesis presents a study of a new discipline called Chaos Engineering and its approaches, that help to verify the correct behavior of a system and to discover new information about it, through chaos experiments like the shutdown of a machine or the simulation of latency in the network connections between applications. The case study was carried out at the company Mindera, to verify and improve the resilience to failures of a client’s project. Initially the chaos maturity of the project within the Chaos Maturity Model wasin the first levels and it was necessary to increase its sophistication and adoption by conducting experiments to test and improve the resilience. The cloud environment that the project uses, and the architecture is explained to contextualize the components that the experiments will use and test. Different alternatives to test disaster recovery plans are compared as well as the differences between the use of a test environment and the production environment. The value of carrying out experiments for the client project is described, as well as the identification of their value proposal. In the end, the analysis of the different chaos tools is performed using the TOPSIS method. The four performed experiments test the system's resilience to failure of a database’s primary node, the impact of latency in the network connections between different components, the system's reaction to the exhaustion of physical resources of a machine and finally the global test of a system's resiliency in the face of a server failure. After the execution, the experiences were evaluated by company experts. In the end, the conclusions about the work developed are presented. The experiments carried out were classified as important for the project. A problem was found after in the latency introduction experiment and after changing the application’s code, the system reaction was positive, and the number of responses was increased.Esta tese apresenta um estudo de uma nova disciplina chamada Chaos Engineering e as suas abordagens, que ajudam a verificar o correto funcionamento e a descoberta de novas informações acerca de um sistema através de realização de experiências como o desligar de uma máquina ou a simulação de latência nas ligações de rede entre aplicações. O caso de estudo foi realizado na empresa Mindera, dentro de um projeto cliente, para verificar e melhorar a sua resiliência a falhas. Inicialmente a maturidade de caos do projeto dentro do Chaos Maturity Model encontra-se nos primeiros níveis e tornou-se necessário aumentar a sua sofisticação e adoção através da realização de experiências para testar e melhorar a resiliência. O ambiente de cloud que o projeto usa e a sua arquitetura é explicada para contextualizar os componentes que as experiências vão usar e testar. As diferentes alternativas de testar planos de recuperação a desastres são comparadas, assim como, as diferenças entre a utilização do ambiente de testes e de produção. O valor da realização de experiências para o projeto cliente é descrito, assim como a identificação da sua proposta de valor. No final, a análise das diferentes ferramentas de caos é realizada recorrendo ao método TOPSIS. As quatro experiências executadas testam a resiliência do sistema perante a falha de um nó primário de uma base de dados, o impacto da latência nas ligações de rede entre diferentes componentes, a reação do sistema perante a exaustão de recursos físicos de uma máquina e por último o teste global da resiliência de um sistema perante a falha de um servidor. As experiências são posteriormente avaliadas por experts da empresa. No final, as conclusões acerca do trabalho desenvolvido são apresentadas. As experiências realizadas foram classificadas como importantes para o projeto. Um problema foi encontrado na experiência de introdução de latência e após a alteração do seu código, a reação do sistema foi positiva e o número de respostas aumentou

    Methodologies synthesis

    Get PDF
    This deliverable deals with the modelling and analysis of interdependencies between critical infrastructures, focussing attention on two interdependent infrastructures studied in the context of CRUTIAL: the electric power infrastructure and the information infrastructures supporting management, control and maintenance functionality. The main objectives are: 1) investigate the main challenges to be addressed for the analysis and modelling of interdependencies, 2) review the modelling methodologies and tools that can be used to address these challenges and support the evaluation of the impact of interdependencies on the dependability and resilience of the service delivered to the users, and 3) present the preliminary directions investigated so far by the CRUTIAL consortium for describing and modelling interdependencies

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    Data Systems Fault Coping for Real-time Big Data Analytics Required Architectural Crucibles

    Get PDF
    This paper analyzes the properties and characteristics of unknown and unexpected faults introduced into information systems while processing Big Data in real-time. The authors hypothesize that there are new faults, and requirements for fault handling and propose an analytic model and architectural framework to assess and manage the faults and mitigate the risks of correlating or integrating otherwise uncorrelated Big Data, and to ensure the source pedigree, quality, set integrity, freshness, and validity of data being consumed. We argue that new architectures, methods, and tools for handling and analyzing Big Data systems functioning in real-time must design systems that address and mitigate concerns for faults resulting from real-time streaming processes while ensuring that variables such as synchronization, redundancy, and latency are addressed. This paper concludes that with improved designs, real-time Big Data systems may continuously deliver the value and benefits of streaming Big Data

    Experimental analysis of computer system dependability

    Get PDF
    This paper reviews an area which has evolved over the past 15 years: experimental analysis of computer system dependability. Methodologies and advances are discussed for three basic approaches used in the area: simulated fault injection, physical fault injection, and measurement-based analysis. The three approaches are suited, respectively, to dependability evaluation in the three phases of a system's life: design phase, prototype phase, and operational phase. Before the discussion of these phases, several statistical techniques used in the area are introduced. For each phase, a classification of research methods or study topics is outlined, followed by discussion of these methods or topics as well as representative studies. The statistical techniques introduced include the estimation of parameters and confidence intervals, probability distribution characterization, and several multivariate analysis methods. Importance sampling, a statistical technique used to accelerate Monte Carlo simulation, is also introduced. The discussion of simulated fault injection covers electrical-level, logic-level, and function-level fault injection methods as well as representative simulation environments such as FOCUS and DEPEND. The discussion of physical fault injection covers hardware, software, and radiation fault injection methods as well as several software and hybrid tools including FIAT, FERARI, HYBRID, and FINE. The discussion of measurement-based analysis covers measurement and data processing techniques, basic error characterization, dependency analysis, Markov reward modeling, software-dependability, and fault diagnosis. The discussion involves several important issues studies in the area, including fault models, fast simulation techniques, workload/failure dependency, correlated failures, and software fault tolerance
    corecore