2,636 research outputs found
New techniques for functional testing of microprocessor based systems
Electronic devices may be affected by failures, for example due to physical defects. These defects may be introduced during the manufacturing process, as well as during the normal operating life of the device due to aging. How to detect all these defects is not a trivial task, especially in complex systems such as processor cores. Nevertheless, safety-critical applications do not tolerate failures, this is the reason why testing such devices is needed so to guarantee a correct behavior at any time. Moreover, testing is a key parameter for assessing the quality of a manufactured product.
Consolidated testing techniques are based on special Design for Testability (DfT) features added in the original design to facilitate test effectiveness. Design, integration, and usage of the available DfT for testing purposes are fully supported by commercial EDA tools, hence approaches based on DfT are the standard solutions adopted by silicon vendors for testing their devices.
Tests exploiting the available DfT such as scan-chains manipulate the internal state of the system, differently to the normal functional mode, passing through unreachable configurations. Alternative solutions that do not violate such functional mode are defined as functional tests.
In microprocessor based systems, functional testing techniques include software-based self-test (SBST), i.e., a piece of software (referred to as test program) which is uploaded in the system available memory and executed, with the purpose of exciting a specific part of the system and observing the effects of possible defects affecting it. SBST has been widely-studies by the research community for years, but its adoption by the industry is quite recent.
My research activities have been mainly focused on the industrial perspective of SBST. The problem of providing an effective development flow and guidelines for integrating SBST in the available operating systems have been tackled and results have been provided on microprocessor based systems for the automotive domain. Remarkably, new algorithms have been also introduced with respect to state-of-the-art approaches, which can be systematically implemented to enrich SBST suites of test programs for modern microprocessor based systems. The proposed development flow and algorithms are being currently employed in real electronic control units for automotive products.
Moreover, a special hardware infrastructure purposely embedded in modern devices for interconnecting the numerous on-board instruments has been interest of my research as well. This solution is known as reconfigurable scan networks (RSNs) and its practical adoption is growing fast as new standards have been created. Test and diagnosis methodologies have been proposed targeting specific RSN features, aimed at checking whether the reconfigurability of such networks has not been corrupted by defects and, in this case, at identifying the defective elements of the network. The contribution of my work in this field has also been included in the first suite of public-domain benchmark networks
Observation mechanisms for in-field software-based self-test
When electronic systems are used in safety critical applications, as in the space,
avionic, automotive or biomedical areas, it is required to maintain a very low
probability of failures due to faults of any kind. Standards and regulations play
a significant role, forcing companies to devise and adopt solutions able to achieve
predefined targets in terms of dependability. Different techniques can be used to
reduce fault occurrence or to minimize the probability that those faults produce
critical failures (e.g., by introducing redundancy).
Unfortunately, most of these techniques have a severe impact on the cost of
the resulting product and, in some cases, the probability of failures is too large
anyway. Hence, a solution commonly used in several scenarios lies on periodically
performing a test able to detect the occurrence of any fault before it produces
a failure (in-field test). This solution is normally based on forcing the processor
inside the Device Under Test to execute a properly written test program, which is
able to activate possible faults and to make their effects visible in some observable
locations. This approach is also called Software-Based Self-Test, or SBST.
If compared with testing in an end of manufacturing scenario, in-field testing
has strong limitations in terms of access to the system inputs and outputs
because Design for Testability structures and testing equipment are usually not
available. As a consequence there are reduced possibilities to activate the faults
and to observe their effects.
This reduced observability particularly affects the ability to detect performance
faults, i.e. faults that modify the timing but not the final value of computations.
This kind of faults are hard to detect by only observing the final content of
predefined memory locations, that is the usual test result observation method used
in-field.
Initially, the present work was focused on fault tolerance techniques against
transient faults induced by ionizing radiation, the so called Single Event Upsets
(SEUs). The main contribution of this early stage of the thesis lies in the experimental
validation of the feasibility of achieving a safe system by using an
architecture that combines task-level redundancy with already available IP cores,
thus minimizing the development time. Task execution is replicated and Memory
Protection is used to guarantee that any SEU may affect one and only one
of the replicas. A proof of concept implementation was developed and validated
using fault injection. Results outline the effectiveness of the architecture, and the
overhead analysis shows that the proposed architecture is effective in reducing the
resource occupation with respect to N-modular redundancy, at an affordable cost
in terms of application execution time.
The main part of the thesis is focused on in-field software-based self-test of
permanent faults. A set of observation methods exploiting existing or ad-hoc
hardware is proposed, aimed at obtaining a better coverage, in particular of performance
faults. An extensive quantitative evaluation of the proposed methods
is presented, including a comparison with the observation methods traditionally
used in end of manufacturing and in-field testing.
Results show that the proposed methods are a good complement to the traditionally
used final memory content observation. Moreover, they show that an
adequate combination of these complementary methods allows for achieving nearly
the same fault coverage achieved when continuously observing all the processor
outputs, which is an observation method commonly used for production test but
usually not available in-field.
A very interesting by-product of what is described above is a detailed description
of how to compute the fault coverage achieved by functional in-field tests
using a conventional fault simulator, a tool that is usually applied in an end of
manufacturing testing scenario.
Finally, another relevant result in the testing area is a method to detect permanent
faults inside the cache coherence logic integrated in each cache controller
of a multi-core system, based on the concurrent execution of a test program by
the different cores in a coordinated manner. By construction, the method achieves
full fault coverage of the static faults in the addressed logic.Cuando se utilizan sistemas electrónicos en aplicaciones críticas como en las áreas biomédica, aeroespacial o automotriz, se requiere mantener una muy baja probabilidad de malfuncionamientos debidos a cualquier tipo de fallas. Los estándares y normas juegan un papel importante, forzando a los desarrolladores a diseñar y adoptar soluciones que sean capaces de alcanzar objetivos predefinidos en cuanto a seguridad y confiabilidad. Pueden utilizarse diferentes técnicas para reducir la ocurrencia de fallas o para minimizar la probabilidad de que esas fallas produzcan mal funcionamientos críticos, por ejemplo a través de la incorporación de redundancia. Lamentablemente, muchas de esas técnicas afectan en gran medida el costo de los productos y, en algunos casos, la probabilidad de malfuncionamiento sigue siendo demasiado alta. En consecuencia, una solución usada a menudo en varios escenarios consiste en realizar periódicamente un test que sea capaz de detectar la ocurrencia de una falla antes de que esta produzca un mal funcionamiento (test en campo). En general, esta solución se basa en forzar a un procesador existente dentro del dispositivo bajo prueba a ejecutar un programa de test que sea capaz de activar las posibles fallas y de hacer que sus efectos sean visibles en puntos observables. A esta metodología también se la llama auto-test basado en software, o en inglés Software-Based Self-Test (SBST). Si se lo compara con un escenario de test de fin de fabricación, el test en campo tiene fuertes limitaciones en términos de posibilidad de acceso a las entradas y salidas del sistema, porque usualmente no se dispone de equipamiento de test ni de la infraestructura de Design for Testability. En consecuencia se tiene menos posibilidades de activar las fallas y de observar sus efectos. Esta observabilidad reducida afecta particularmente la habilidad para detectar fallas de performance, es decir fallas que modifican la temporización pero no el resultado final de los cálculos. Este tipo de fallas es difícil de detectar por la sola observación del contenido final de lugares de memoria, que es el método usual que se utiliza para observar los resultados de un test en campo. Inicialmente, el presente trabajo estuvo enfocado en técnicas para tolerar fallas transitorias inducidas por radiación ionizante, llamadas en inglés Single Event Upsets (SEUs). La principal contribución de esa etapa inicial de la tesis reside en la validación experimental de la viabilidad de obtener un sistema seguro, utilizando una arquitectura que combina redundancia a nivel de tareas con el uso de módulos hardware (IP cores) ya disponibles, que minimiza en consecuencia el tiempo de desarrollo. Se replica la ejecución de las tareas y se utiliza protección de memoria para garantizar que un SEU pueda afectar a lo sumo a una sola de las réplicas. Se desarrolló una implementación para prueba de concepto que fue validada mediante inyección de fallas. Los resultados muestran la efectividad de la arquitectura, y el análisis de los recursos utilizados muestra que la arquitectura propuesta es efectiva en reducir la ocupación con respecto a la redundancia modular con N réplicas, a un costo accesible en términos de tiempo de ejecución. La parte principal de esta tesis se enfoca en el área de auto-test en campo basado en software para la detección de fallas permanentes. Se propone un conjunto de métodos de observación utilizando hardware existente o ad-hoc, con el fin de obtener una mejor cobertura, en particular de las fallas de performance. Se presenta una extensa evaluación cuantitativa de los métodos propuestos, que incluye una comparación con los métodos tradicionalmente utilizados en tests de fin de fabricación y en campo. Los resultados muestran que los métodos propuestos son un buen complemento del método tradicionalmente usado que consiste en observar el valor final del contenido de memoria. Además muestran que una adecuada combinación de estos métodos complementarios permite alcanzar casi los mismos valores de cobertura de fallas que se obtienen mediante la observación continua de todas las salidas del procesador, método comúnmente usado en tests de fin de fabricación, pero que usualmente no está disponible en campo. Un subproducto muy interesante de lo arriba expuesto es la descripción detallada del procedimiento para calcular la cobertura de fallas lograda mediante tests funcionales en campo por medio de un simulador de fallas convencional, una herramienta que usualmente se aplica en escenarios de test de fin de fabricación. Finalmente, otro resultado relevante en el área de test es un método para detectar fallas permanentes dentro de la lógica de coherencia de cache que está integrada en el controlador de cache de cada procesador en un sistema multi procesador. El método está basado en la ejecución de un programa de test en forma coordinada por parte de los diferentes procesadores. Por construcción, el método cubre completamente las fallas de la lógica mencionad
Scan Test Coverage Improvement Via Automatic Test Pattern Generation (Atpg) Tool Configuration
The scan test coverage improvement by using automatic test pattern generation (ATPG) tool configuration was investigated. Improving the test coverage is essential in detecting manufacturing defects in semiconductor industry so that high quality products can be supplied to consumers. The ATPG tool used was Mentor Graphics Tessent TestKompress (version 2014.1). The study was done by setting up a few experiments of utilizing and modifying ATPG commands and switches, observing the test coverage improvement from the statistical reports provided during pattern generation process and providing relatable discussions. By modifying the ATPG commands, it can be expected to have some improvement in the test coverage. The scan test patterns generated were stuck-at test patterns. Based on the experiments done, comparison was made on the different coverage readings and the most optimized method and flow of ATPG were determined. The most optimized flow gave an improvement of 0.91% in test coverage which is acceptable since this method does not involve a change in design. The test patterns generated were converted and tested using automatic test equipment (ATE) to observe its performance on real silicon. The test coverage improvement using ATPG tool instead of the design-based method is important as a faster workaround for back-end engineers to provide high quality test contents in such a short product development duration
Automated Debugging Methodology for FPGA-based Systems
Electronic devices make up a vital part of our lives. These are seen from mobiles, laptops, computers, home automation, etc. to name a few. The modern designs constitute billions of transistors. However, with this evolution, ensuring that the devices fulfill the designer’s expectation under variable conditions has also become a great challenge. This requires a lot of design time and effort. Whenever an error is encountered, the process is re-started. Hence, it is desired to minimize the number of spins required to achieve an error-free product, as each spin results in loss of time and effort.
Software-based simulation systems present the main technique to ensure the verification of the design before fabrication. However, few design errors (bugs) are likely to escape the simulation process. Such bugs subsequently appear during the post-silicon phase. Finding such bugs is time-consuming due to inherent invisibility of the hardware. Instead of software simulation of the design in the pre-silicon phase, post-silicon techniques permit the designers to verify the functionality through the physical implementations of the design. The main benefit of the methodology is that the implemented design in the post-silicon phase runs many order-of-magnitude faster than its counterpart in pre-silicon. This allows the designers to validate their design more exhaustively.
This thesis presents five main contributions to enable a fast and automated debugging solution for reconfigurable hardware. During the research work, we used an obstacle avoidance system for robotic vehicles as a use case to illustrate how to apply the proposed debugging solution in practical environments.
The first contribution presents a debugging system capable of providing a lossless trace of debugging data which permits a cycle-accurate replay. This methodology ensures capturing permanent as well as intermittent errors in the implemented design. The contribution also describes a solution to enhance hardware observability. It is proposed to utilize processor-configurable concentration networks, employ debug data compression to transmit the data more efficiently, and partially reconfiguring the debugging system at run-time to save the time required for design re-compilation as well as preserve the timing closure.
The second contribution presents a solution for communication-centric designs. Furthermore, solutions for designs with multi-clock domains are also discussed.
The third contribution presents a priority-based signal selection methodology to identify the signals which can be more helpful during the debugging process. A connectivity generation tool is also presented which can map the identified signals to the debugging system.
The fourth contribution presents an automated error detection solution which can help in capturing the permanent as well as intermittent errors without continuous monitoring of debugging data. The proposed solution works for designs even in the absence of golden reference.
The fifth contribution proposes to use artificial intelligence for post-silicon debugging. We presented a novel idea of using a recurrent neural network for debugging when a golden reference is present for training the network. Furthermore, the idea was also extended to designs where golden reference is not present
Decompose and Conquer: Addressing Evasive Errors in Systems on Chip
Modern computer chips comprise many components, including microprocessor cores, memory modules, on-chip networks, and accelerators. Such system-on-chip (SoC) designs are deployed in a variety of computing devices: from internet-of-things, to smartphones, to personal computers, to data centers. In this dissertation, we discuss evasive errors in SoC designs and how these errors can be addressed efficiently. In particular, we focus on two types of errors: design bugs and permanent faults. Design bugs originate from the limited amount of time allowed for design verification and validation. Thus, they are often found in functional features that are rarely activated. Complete functional verification, which can eliminate design bugs, is extremely time-consuming, thus impractical in modern complex SoC designs. Permanent faults are caused by failures of fragile transistors in nano-scale semiconductor manufacturing processes. Indeed, weak transistors may wear out unexpectedly within the lifespan of the design. Hardware structures that reduce the occurrence of permanent faults incur significant silicon area or performance overheads, thus they are infeasible for most cost-sensitive SoC designs.
To tackle and overcome these evasive errors efficiently, we propose to leverage the principle of decomposition to lower the complexity of the software analysis or the hardware structures involved. To this end, we present several decomposition techniques, specific to major SoC components. We first focus on microprocessor cores, by presenting a lightweight bug-masking analysis that decomposes a program into individual instructions to identify if a design bug would be masked by the program's execution. We then move to memory subsystems: there, we offer an efficient memory consistency testing framework to detect buggy memory-ordering behaviors, which decomposes the memory-ordering graph into small components based on incremental differences. We also propose a microarchitectural patching solution for memory subsystem bugs, which augments each core node with a small distributed programmable logic, instead of including a global patching module. In the context of on-chip networks, we propose two routing reconfiguration algorithms that bypass faulty network resources. The first computes short-term routes in a distributed fashion, localized to the fault region. The second decomposes application-aware routing computation into simple routing rules so to quickly find deadlock-free, application-optimized routes in a fault-ridden network. Finally, we consider general accelerator modules in SoC designs. When a system includes many accelerators, there are a variety of interactions among them that must be verified to catch buggy interactions. To this end, we decompose such inter-module communication into basic interaction elements, which can be reassembled into new, interesting tests.
Overall, we show that the decomposition of complex software algorithms and hardware structures can significantly reduce overheads: up to three orders of magnitude in the bug-masking analysis and the application-aware routing, approximately 50 times in the routing reconfiguration latency, and 5 times on average in the memory-ordering graph checking. These overhead reductions come with losses in error coverage: 23% undetected bug-masking incidents, 39% non-patchable memory bugs, and occasionally we overlook rare patterns of multiple faults. In this dissertation, we discuss the ideas and their trade-offs, and present future research directions.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/147637/1/doowon_1.pd
On the use of embedded debug features for permanent and transient fault resilience in microprocessors
Microprocessor-based systems are employed in an increasing number of applications where dependability is a major constraint. For this reason detecting faults arising during normal operation while introducing the least possible penalties is a main concern. Different forms of redundancy have been employed to ensure error-free behavior, while error detection mechanisms can be employed where some detection latency is tolerated. However, the high complexity and the low observability of microprocessors internal resources make the identification of adequate on-line error detection strategies a very challenging task, which can be tackled at circuit or system level. Concerning system-level strategies, a common limitation is in the mechanism used to monitor program execution and then detect errors as soon as possible, so as to reduce their impact on the application. In this work, an on-line error detection approach based on the reuse of available debugging infrastructures is proposed. The approach can be applied to different system architectures profiting from the debug trace port available in most of current microprocessors to observe possible misbehaviors. Two microprocessors have been used to study the applicability of the solution. LEON3 and ARM7TDMI. Results show that the presented fault detection technique enhances observability and thus error detection abilities in microprocessor-based systems without requiring modifications on the core architecture
Test Generation Based on CLP
Functional ATPGs based on simulation are fast,
but generally, they are unable to cover corner cases, and
they cannot prove untestability. On the contrary, functional
ATPGs exploiting formal methods, being exhaustive,
cover corner cases, but they tend to suffer of the state
explosion problem when adopted for verifying large designs.
In this context, we have defined a functional ATPG
that relies on the joint use of pseudo-deterministic simulation
and Constraint Logic Programming (CLP), to
generate high-quality test sequences for solving complex
problems. Thus, the advantages of both simulation-based
and static-based verification techniques are preserved, while
their respective drawbacks are limited. In particular, CLP,
a form of constraint programming in which logic programming
is extended to include concepts from constraint satisfaction,
is well-suited to be jointly used with simulation. In
fact, information learned during design exploration by simulation
can be effectively exploited for guiding the search of
a CLP solver towards DUV areas not covered yet. The test
generation procedure relies on constraint logic programming
(CLP) techniques in different phases of the test generation
procedure.
The ATPG framework is composed of three functional
ATPG engines working on three different models of the
same DUV: the hardware description language (HDL)
model of the DUV, a set of concurrent EFSMs extracted
from the HDL description, and a set of logic constraints
modeling the EFSMs. The EFSM paradigm has been selected
since it allows a compact representation of the DUV
state space that limits the state explosion problem typical
of more traditional FSMs. The first engine is randombased,
the second is transition-oriented, while the last is
fault-oriented.
The test generation is guided by means of transition coverage and fault coverage. In particular, 100% transition
coverage is desired as a necessary condition for fault
detection, while the bit coverage functional fault model
is used to evaluate the effectiveness of the generated test
patterns by measuring the related fault coverage.
A random engine is first used to explore the DUV state
space by performing a simulation-based random walk. This
allows us to quickly fire easy-to-traverse (ETT) transitions
and, consequently, to quickly cover easy-to-detect (ETD)
faults. However, the majority of hard-to-traverse (HTT)
transitions remain, generally, uncovered.
Thus, a transition-oriented engine is applied to
cover the remaining HTT transitions by exploiting a
learning/backjumping-based strategy.
The ATPG works on a special kind of EFSM, called
SSEFSM, whose transitions present the most uniformly
distributed probability of being activated and can be effectively
integrated to CLP, since it allows the ATPG to invoke
the constraint solver when moving between EFSM states.
A constraint logic programming-based (CLP) strategy is
adopted to deterministically generate test vectors that satisfy
the guard of the EFSM transitions selected to be traversed. Given a transition of the SSEFSM, the solver
is required to generate opportune values for PIs that enable
the SSEFSM to move across such a transition.
Moreover, backjumping, also known as nonchronological
backtracking, is a special kind of backtracking
strategy which rollbacks from an unsuccessful
situation directly to the cause of the failure. Thus,
the transition-oriented engine deterministically backjumps
to the source of failure when a transition, whose guard
depends on previously set registers, cannot be traversed.
Next it modifies the EFSM configuration to satisfy the
condition on registers and successfully comes back to the
target state to activate the transition.
The transition-oriented engine generally allows us to
achieve 100% transition coverage. However, 100% transition
coverage does not guarantee to explore all DUV corner
cases, thus some hard-to-detect (HTD) faults can escape
detection preventing the achievement of 100% fault coverage.
Therefore, the CLP-based fault-oriented engine is finally
applied to focus on the remaining HTD faults.
The CLP solver is used to deterministically search for
sequences that propagate the HTD faults observed, but not
detected, by the random and the transition-oriented engine.
The fault-oriented engine needs a CLP-based representation
of the DUV, and some searching functions to generate
test sequences. The CLP-based representation is automatically
derived from the S2EFSM models according to the
defined rules, which follow the syntax of the ECLiPSe CLP
solver. This is not a trivial task, since modeling the
evolution in time of an EFSM by using logic constraints
is really different with respect to model the same behavior
by means of a traditional HW description language. At
first, the concept of time steps is introduced, required to
model the SSEFSM evolution through the time via CLP.
Then, this study deals with modeling of logical variables
and constraints to represent enabling functions and update
functions of the SSEFSM.
Formal tools that exhaustively search for a solution frequently
run out of resources when the state space to be analyzed
is too large. The same happens for the CLP solver,
when it is asked to find a propagation sequence on large sequential
designs. Therefore we have defined a set of strategies
that allow to prune the search space and to manage the
complexity problem for the solver
Reliability in Power Electronics and Power Systems
L'abstract è presente nell'allegato / the abstract is in the attachmen
- …