3 research outputs found

    Increasing autonomous fault-tolerant FPGA-based systems ’ lifetime

    No full text
    Abstract—In this paper we propose an automated design flow for the implementation of autonomous fault-tolerant systems on SRAM-based FPGA platforms, able to cope with the occurrence of both transient and permanent faults. The goal of the proposed methodology is to increase the system’s lifetime, by designing it able to detect and mitigate the effects of soft errors, as well as of permanent, non-recoverable ones, by exploiting dynamic reconfiguration. The application of the hardening design flow to a real case study is reported, to validate the methodology

    Cross layer reliability estimation for digital systems

    Get PDF
    Forthcoming manufacturing technologies hold the promise to increase multifuctional computing systems performance and functionality thanks to a remarkable growth of the device integration density. Despite the benefits introduced by this technology improvements, reliability is becoming a key challenge for the semiconductor industry. With transistor size reaching the atomic dimensions, vulnerability to unavoidable fluctuations in the manufacturing process and environmental stress rise dramatically. Failing to meet a reliability requirement may add excessive re-design cost to recover and may have severe consequences on the success of a product. %Worst-case design with large margins to guarantee reliable operation has been employed for long time. However, it is reaching a limit that makes it economically unsustainable due to its performance, area, and power cost. One of the open challenges for future technologies is building ``dependable'' systems on top of unreliable components, which will degrade and even fail during normal lifetime of the chip. Conventional design techniques are highly inefficient. They expend significant amount of energy to tolerate the device unpredictability by adding safety margins to a circuit's operating voltage, clock frequency or charge stored per bit. Unfortunately, the additional cost introduced to compensate unreliability are rapidly becoming unacceptable in today's environment where power consumption is often the limiting factor for integrated circuit performance, and energy efficiency is a top concern. Attention should be payed to tailor techniques to improve the reliability of a system on the basis of its requirements, ending up with cost-effective solutions favoring the success of the product on the market. Cross-layer reliability is one of the most promising approaches to achieve this goal. Cross-layer reliability techniques take into account the interactions between the layers composing a complex system (i.e., technology, hardware and software layers) to implement efficient cross-layer fault mitigation mechanisms. Fault tolerance mechanism are carefully implemented at different layers starting from the technology up to the software layer to carefully optimize the system by exploiting the inner capability of each layer to mask lower level faults. For this purpose, cross-layer reliability design techniques need to be complemented with cross-layer reliability evaluation tools, able to precisely assess the reliability level of a selected design early in the design cycle. Accurate and early reliability estimates would enable the exploration of the system design space and the optimization of multiple constraints such as performance, power consumption, cost and reliability. This Ph.D. thesis is devoted to the development of new methodologies and tools to evaluate and optimize the reliability of complex digital systems during the early design stages. More specifically, techniques addressing hardware accelerators (i.e., FPGAs and GPUs), microprocessors and full systems are discussed. All developed methodologies are presented in conjunction with their application to real-world use cases belonging to different computational domains

    Instrumentação de FPGAs SRAM para recuperação e prevenção de faltas permanentes visando utilização em aplicações espaciais

    Get PDF
    Tese (doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Engenharia Elétrica, Florianópolis, 2016.Os dispositivos reprogramáveis Field Programmable Gate Arrays (FPGAs),embora construídos para serem robustos, não são eternos, nem completamenteimunes à ocorrência de faltas, sejam elas transitórias oupermanentes. Considerando que o teste após fabricação deteta todas asfaltas devidas ao processo de produção, em condições normais ao níveldo mar, mesmo com as tecnologias nanométricas recentes, a ocorrênciade faltas permanentes numa FPGA durante o seu previsível ciclo devida é praticamente nula. Já em condições hostis, como no espaço ondeo nível de radiação é elevado (ou mesmo ambientes terrestres comocentrais nucleares, centros de investigação de física nuclear, aceleradoresde partículas, etc.), a ocorrência de faltas permanentes numa FPGAnão pode ser desprezada. Para além da radiação, sendo um dispositivoeletrónico, está igualmente sujeito a envelhecimento (aging). O NegativeBias Temperature Instability (NBTI) e o Positive Bias Temperature Instability(PBTI) são dois dos fatores que provocam esse envelhecimento,e que embora não destruam a funcionalidade dos recursos da FPGA,aumentam os seus tempos de propagação. Este envelhecimento podepor isso também originar faltas permanentes a partir de um determinadoponto do ciclo de vida do sistema implementado numa FPGA. Asolução para esses casos é a substituição da FPGA ou até mesmo daplaca que inclui a mesma. Apesar do facto de que em muitas situaçõesa substituição da FPGA ser considerada uma tarefa simples, em tantasoutras, tais como ambientes aeroespaciais onde o acesso é difícil e/ouperigoso para quem tem de realizar a substituição, esta operação poderáser problemática ou impossível de realizar.Neste contexto, esta tese propõe o desenvolvimento de soluções, para queum sistema implementado numa FPGA possa autonomamente recuperarda ocorrência de faltas permanentes (evitando utilizar recursos dodispositivo que sofreram essas mesmas faltas), e ao mesmo tempo,atenuar o ritmo de envelhecimento do dispositivo devido ao NBTI (eeventualmente também ao PBTI). Para isso, este trabalho foca emdois objetivos principais: (1) O desenvolvimento de um mecanismo emhardware, baseado na Reconfiguração Parcial da FPGA, que suportea implementação de estratégias de recuperação e prevenção de faltaspermanentes (minimizando a evolução do envelhecimento causado peloNBTI). (2) Planear e implementar formas de recuperar ou prevenir daocorrência de faltas permanentes (delay faults), recorrendo ao mecanismodesenvolvido.O mecanismo apresentado passa por novo fluxo gerador de bitstreamsparciais, possíveis de realocar em múltiplas partições reconfiguráveis,uma flexibilidade que ultrapassa a proporcionada pelas ferramentas dereconfiguração dinâmica disponibilizadas pelo fabricante. Das estratégiasimplementadas, uma permite um sistema implementado numa FPGArecuperar de uma falta permanente, sem necessidade de excluir todaa partição. Para atenuação do envelhecimento do dispositivo, outraestratégia altera as partições onde os bitstreams se encontram alocados deuma forma cíclica, de forma a que o máximo de recursos dessas partiçõesnão estejam configurados da mesma forma um longo período de tempo.É proposto ainda um novo sensor de performance para FPGA e quepode permitir medir também o envelhecimento em cada partição. Comele é possível a estratégia de alocar módulos (existentes nos bitstreamsgerados), de modo a uniformizar o envelhecimento e a dissipação depotência pelas várias partições, em função do envelhecimento acumulado,da temperatura atual e da potência consumida por cada módulo.Abstract : FPGA devices although built to be robust, are not everlasting. Theyare not completely invulnerable to the occurrence of faults, whethertemporary or permanent. Whereas the test after manufacturing detectsall faults due to production process, in normal conditions, at sea level,even with the recent nanometric technologies, the manifestation ofpermanent faults in FPGAs during their expected life cycle is consideredto be near zero. However, in hostile conditions, such as in space whereradiation levels are higher (or terrestrial environments such as nuclearpower plants, nuclear physics research centers, particle accelerators, etc.),the rate of permanent faults in an FPGA device can not be neglected.In addition to the radiation, as the FPGA is an electronic device, it isalso susceptible to aging effects. NBTI and PBTI are two of the agingsources and, although they do not damage directly the functionality ofthe FPGA resources, they are responsible for the increase in the device?spropagation times. This aging can therefore also lead to permanentfaults in a certain moment in the life cycle of a system implementedon an FPGA. The solution for such cases is to replace the FPGA oreven the board where it is on. Despite the fact that in many casesreplacing the FPGA can be considered a simple task, in many others,such as in aerospace environments where the access is difficult and /or dangerous for those who have to do the replacement, this operationmay be challenging or even impossible to perform.In this context, this work proposes the development of solutions fora system implemented in an FPGA which can autonomously recoverfrom the manifestation of permanent faults (avoiding use device resourcesthat have suffered these same faults), and at the same timemitigating the rate of aging of the device due to NBTI (and possiblyalso the PBTI). Therefore, this work focuses on two main objectives:(1) The development of a mechanism in hardware, based on the FPGAPartial Reconfiguration mechanism, which supports the implementationof strategies for recovering and prevention of permanent faults(minimizing the evolution of aging caused by NBTI). (2) The planningand implementation of ways to recover or to prevent the occurrence ofpermanent faults (delay faults), using the developed mechanism.The presented mechanism includes a new flow to generate partial bitstreams,which allows to reallocate multiple reconfigurable partitions.This is a feature does not provided by the dynamic reconfiguration toolsdelivered by the manufacturers. The implemented strategies allow asystem implemented in an FPGA to recover from a permanent fault,with no need to exclude the entire partition. For the device aging mitigation,another strategy changes the partitions where the bitstreamsare allocated in a cyclical way, so that the maximum resources of thesepartitions are not configured in the same way for a long period of time.It is further proposed a new performance sensor for FPGA systems,which also may allow the measuting of aging in each partition. Withthis sensor the strategy allows the allocation of modules (existing inthe generated bitstreams) in order to standardize the aging and powerdissipation by the various partitions, as a function of cumulative aging,the current temperature and the power consumed by each module
    corecore