599 research outputs found

    Fault and Defect Tolerant Computer Architectures: Reliable Computing With Unreliable Devices

    Get PDF
    This research addresses design of a reliable computer from unreliable device technologies. A system architecture is developed for a fault and defect tolerant (FDT) computer. Trade-offs between different techniques are studied and yield and hardware cost models are developed. Fault and defect tolerant designs are created for the processor and the cache memory. Simulation results for the content-addressable memory (CAM)-based cache show 90% yield with device failure probabilities of 3 x 10(-6), three orders of magnitude better than non fault tolerant caches of the same size. The entire processor achieves 70% yield with device failure probabilities exceeding 10(-6). The required hardware redundancy is approximately 15 times that of a non-fault tolerant design. While larger than current FT designs, this architecture allows the use of devices much more likely to fail than silicon CMOS. As part of model development, an improved model is derived for NAND Multiplexing. The model is the first accurate model for small and medium amounts of redundancy. Previous models are extended to account for dependence between the inputs and produce more accurate results

    Low-overhead fault-tolerant logic for field-programmable gate arrays

    Get PDF
    While allowing for the fabrication of increasingly complex and efficient circuitry, transistor shrinkage and count-per-device expansion have major downsides: chiefly increased variation, degradation and fault susceptibility. For this reason, design-time consideration of faults will have to be given to increasing numbers of electronic systems in the future to ensure yields, reliabilities and lifetimes remain acceptably high. Many mathematical operators commonly accelerated in hardware are suited to modification resulting in datapath error detection and correction capabilities with far lower area, performance and/or power consumption overheads than those incurred through the utilisation of more established, general-purpose fault tolerance methods such as modular redundancy. Field-programmable gate arrays are uniquely placed to allow further area savings to be made thanks to their dynamic reconfigurability. The majority of the technical work presented within this thesis is based upon a benchmark hardware accelerator---a matrix multiplier---that underwent several evolutions in order to detect and correct faults manifesting along its datapath at runtime. In the first instance, fault detectability in excess of 99% was achieved in return for 7.87% additional area and 45.5% extra latency. In the second, the ability to correct errors caused by those faults was added at the cost of 4.20% more area, while 50.7% of this---and 46.2% of the previously incurred latency overhead---was removed through the introduction of partial reconfiguration in the third. The fourth demonstrates further reductions in both area and performance overheads---of 16.7% and 8.27%, respectively---through systematic data width reduction by allowing errors of less than ±0.5% of the maximum output value to propagate.Open Acces

    Indexed dependence metadata and its applications in software performance optimisation

    No full text
    To achieve continued performance improvements, modern microprocessor design is tending to concentrate an increasing proportion of hardware on computation units with less automatic management of data movement and extraction of parallelism. As a result, architectures increasingly include multiple computation cores and complicated, software-managed memory hierarchies. Compilers have difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic generation of efficient code in any but the most straightforward of cases. We propose the concept of indexed dependence metadata to improve application development and mapping onto such architectures. The metadata represent both the iteration space of a kernel and the mapping of that iteration space from a given index to the set of data elements that iteration might use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping allows the compiler or runtime to optimise the program more efficiently, and improves the program structure for the developer. We argue that this form of explicit interface specification reduces the need for premature, architecture-specific optimisation. It improves program portability, supports intercomponent optimisation and enables generation of efficient data movement code. We offer the following contributions: an introduction to the concept of indexed dependence metadata as a generalisation of stream programming, a demonstration of its advantages in a component programming system, the decoupled access/execute model for C++ programs, and how indexed dependence metadata might be used to improve the programming model for GPU-based designs. Our experimental results with prototype implementations show that indexed dependence metadata supports automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive loop fusion optimisations in image processing, linear algebra and multigrid application case studies

    Design Disjunction for Resilient Reconfigurable Hardware

    Get PDF
    Contemporary reconfigurable hardware devices have the capability to achieve high performance, power efficiency, and adaptability required to meet a wide range of design goals. With scaling challenges facing current complementary metal oxide semiconductor (CMOS), new concepts and methodologies supporting efficient adaptation to handle reliability issues are becoming increasingly prominent. Reconfigurable hardware and their ability to realize self-organization features are expected to play a key role in designing future dependable hardware architectures. However, the exponential increase in density and complexity of current commercial SRAM-based field-programmable gate arrays (FPGAs) has escalated the overhead associated with dynamic runtime design adaptation. Traditionally, static modular redundancy techniques are considered to surmount this limitation; however, they can incur substantial overheads in both area and power requirements. To achieve a better trade-off among performance, area, power, and reliability, this research proposes design-time approaches that enable fine selection of redundancy level based on target reliability goals and autonomous adaptation to runtime demands. To achieve this goal, three studies were conducted: First, a graph and set theoretic approach, named Hypergraph-Cover Diversity (HCD), is introduced as a preemptive design technique to shift the dominant costs of resiliency to design-time. In particular, union-free hypergraphs are exploited to partition the reconfigurable resources pool into highly separable subsets of resources, each of which can be utilized by the same synthesized application netlist. The diverse implementations provide reconfiguration-based resilience throughout the system lifetime while avoiding the significant overheads associated with runtime placement and routing phases. Evaluation on a Motion-JPEG image compression core using a Xilinx 7-series-based FPGA hardware platform has demonstrated the potential of the proposed FT method to achieve 37.5% area saving and up to 66% reduction in power consumption compared to the frequently-used TMR scheme while providing superior fault tolerance. Second, Design Disjunction based on non-adaptive group testing is developed to realize a low-overhead fault tolerant system capable of handling self-testing and self-recovery using runtime partial reconfiguration. Reconfiguration is guided by resource grouping procedures which employ non-linear measurements given by the constructive property of f-disjunctness to extend runtime resilience to a large fault space and realize a favorable range of tradeoffs. Disjunct designs are created using the mosaic convergence algorithm developed such that at least one configuration in the library evades any occurrence of up to d resource faults, where d is lower-bounded by f. Experimental results for a set of MCNC and ISCAS benchmarks have demonstrated f-diagnosability at the individual slice level with average isolation resolution of 96.4% (94.4%) for f=1 (f=2) while incurring an average critical path delay impact of only 1.49% and area cost roughly comparable to conventional 2-MR approaches. Finally, the proposed Design Disjunction method is evaluated as a design-time method to improve timing yield in the presence of large random within-die (WID) process variations for application with a moderately high production capacity

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Evolutionary algorithms for synthesis and optimisation of sequential logic circuits

    Get PDF
    Considerable progress has been made recently 1n the understanding of combinational logic optimization. Consequently a large number of university and industrial Electric Computing Aided Design (ECAD) programs are now available for optimal logic synthesis of combinational circuits. The progress with sequential logic synthesis and optimization, on the other hand, is considerably less mature. In recent years, evolutionary algorithms have been found to be remarkably effective way of using computers for solving difficult problems. This thesis is, in large part, a concentrated effort to apply this philosophy to the synthesis and optimization of sequential circuits. A state assignment based on the use of a Genetic Algorithm (GA) for the optimal synthesis of sequential circuits is presented. The state assignment determines the structure of the sequential circuit realizing the state machine and therefore its area and performances. The synthesis based on the GA approach produced designs with the smallest area to date. Test results on standard fmite state machine (FS:M) benchmarks show that the GA could generate state assignments, which required on average 15.44% fewer gates and 13.47% fewer literals compared with alternative techniques. Hardware evolution is performed through a succeSSlOn of changes/reconfigurations of elementary components, inter-connectivity and selection of the fittest configurations until the target functionality is reached. The thesis presents new approaches, which combine both genetic algorithm for state assignment and extrinsic Evolvable Hardware (EHW) to design sequential logic circuits. The implemented evolutionary algorithms are able to design logic circuits with size and complexity, which have not been demonstrated in published work. There are still plenty of opportunities to develop this new line of research for the synthesis, optimization and test of novel digital, analogue and mixed circuits. This should lead to a new generation of Electronic Design Automation tools.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    RTL Design Quality Checks for Soft IPs

    Get PDF
    Soft IPs are architectural modules which are delivered in the form of synthesizable RTL level codes written in some HDL (hardware descriptive language) like Verilog or VHDL or System Verilog. They are technology independent and offer high degree of modification flexibility. RTL is the complete abstraction of our design. Since SOC complexity is growing day by day with new technologies and requirement, it will be very much difficult to debug and fix issues after physical level. So to reduce effort and increase efficiency and accuracy it is necessary to fix most of the bugs in RTL level. Also if we are using soft IP, then our bug free IP can be used by third party. So early detection of bugs helps us not to go back to entire design and do all the process again and again. One of the important issue at RTL level of a design is the Clock Domain Crossing (CDC) problem. This is the issue which affects the performance at each and every stage of the design flow. Failure in fixing these issues at the earlier stage makes the design unreliable and design performance collapses. The main issue in real time clock designs are the metastability issue. Although we cannot check or see these issues using our simulator but we have to make preventions at RTL level. This is done by restructuring the design and adding required synchronizers. One more important area of consideration in VLSI design is power consumption. In modern low power designs low power is a key factor. So design consuming less power is preferred over design consuming more power. This decision should be made as early as possible. RTL quality check helps us on this aspect. Using different tools power estimation can be performed at RTL stage which saves lots of efforts in redesigning. This project aims at checking clock domain crossing faults at RTL stage and doing redesign of circuit to eliminate those faults. Also an effort is made to compare quality of two designs in terms of delay, power consumption and area
    corecore