26 research outputs found

    Survey of VLSI Test Data Compression Methods

    Get PDF
    It has been seen that the test data compression has been an emerging need of VLSI field and hence the hot topic of research for last decade. Still there is a great need and scope for further reduction in test data volume. This reduction may be lossy for output side test data but must be lossless for input side test data. This paper summarizes the different methods applied for lossless compression of the input side test data, starting with simple code based methods to combined/hybrid methods. The basic goal here is to prepare survey on current methodologies applied for test data compression and prepare a platform for further development in this avenue

    An innovative two-stage data compression scheme using adaptive block merging technique

    Get PDF
    Test data has increased enormously owing to the rising on-chip complexity of integrated circuits. It further increases the test data transportation time and tester memory. The non-correlated test bits increase the issue of the test power. This paper presents a two-stage block merging based test data minimization scheme which reduces the test bits, test time and test power. A test data is partitioned into blocks of fixed sizes which are compressed using two-stage encoding technique. In stage one, successive blocks are merged to retain a representative block. In stage two, the retained pattern block is further encoding based on the existence of ten different subcases between the sub-block formed by splitting the retained pattern block into two halves. Non-compatible blocks are also split into two sub-blocks and tried for encoded using lesser bits. Decompression architecture to retrieve the original test data is presented. Simulation results obtained corresponding to different ISCAS′89 benchmarks circuits reflect its effectiveness in achieving better compression

    Synchronization overhead in SOC compressed test

    Full text link

    An Efficient Test Vector Compression Technique Based on Block Merging

    Get PDF
    In this paper, we present a new test data compression technique based on block merging. The technique capitalizes on the fact that many consecutive blocks of the test data can be merged together. Compression is achieved by storing the merged block and the number of blocks merged. It also takes advantage of cases where the merged block can be filled by all 0’s or all 1’s. Test data decompression is performed on chip using a simple circuitry that repeats the merged block the required number of times. The decompression circuitry has the advantage of being test data independent. Experimental results on benchmark circuits demonstrate the effectiveness of the proposed technique compared to previous approaches

    An Efficient Test Vector Compression Technique Based on Block Merging

    Get PDF
    In this paper, we present a new test data compression technique based on block merging. The technique capitalizes on the fact that many consecutive blocks of the test data can be merged together. Compression is achieved by storing the merged block and the number of blocks merged. It also takes advantage of cases where the merged block can be filled by all 0’s or all 1’s. Test data decompression is performed on chip using a simple circuitry that repeats the merged block the required number of times. The decompression circuitry has the advantage of being test data independent. Experimental results on benchmark circuits demonstrate the effectiveness of the proposed technique compared to previous approaches

    Techniques of design optimisation for algorithms implemented in software

    Get PDF
    The overarching objective of this thesis was to develop tools for parallelising, optimising, and implementing algorithms on parallel architectures, in particular General Purpose Graphics Processors (GPGPUs). Two projects were chosen from different application areas in which GPGPUs are used: a defence application involving image compression, and a modelling application in bioinformatics (computational immunology). Each project had its own specific objectives, as well as supporting the overall research goal. The defence / image compression project was carried out in collaboration with the Jet Propulsion Laboratories. The specific questions were: to what extent an algorithm designed for bit-serial for the lossless compression of hyperspectral images on-board unmanned vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to implement that algorithm, and whether a software implementation with or without GPGPU acceleration could match the throughput of a dedicated hardware (FPGA) implementation. The dependencies within the algorithm were analysed, and the algorithm parallelised. The algorithm was implemented in software for GPGPU, and optimised. During the optimisation process, profiling revealed less than optimal device utilisation, but no further optimisations resulted in an improvement in speed. The design had hit a local-maximum of performance. Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation metric of kernel occupancy used for GPU optimisation. Redesigning the implementation with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board implementation of the CCSDS lossless hyperspectral image compression algorithm, exceeding the performance of the hardware reference implementation, and providing sufficient throughput for the next generation of image sensor as well. The second project was carried out in collaboration with biologists at the University of Arizona and involved modelling a complex biological system – VDJ recombination involved in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor and antibodies) by VDJ recombination is an enormously complex process, which can theoretically synthesize greater than 1018 variants. Originally thought to be a random process, the underlying mechanisms clearly have a non-random nature that preferentially creates a small subset of immune receptors in many individuals. Understanding this bias is a longstanding problem in the field of immunology. Modelling the process of VDJ recombination to determine the number of ways each immune receptor can be synthesized, previously thought to be untenable, is a key first step in determining how this special population is made. The computational tools developed in this thesis have allowed immunologists for the first time to comprehensively test and invalidate a longstanding theory (convergent recombination) for how this special population is created, while generating the data needed to develop novel hypothesis

    High throughput image compression and decompression on GPUs

    Get PDF
    Diese Arbeit befasst sich mit der Entwicklung eines GPU-freundlichen, intra-only, Wavelet-basierten Videokompressionsverfahrens mit hohem Durchsatz, das für visuell verlustfreie Anwendungen optimiert ist. Ausgehend von der Beobachtung, dass der JPEG 2000 Entropie-Kodierer ein Flaschenhals ist, werden verschiedene algorithmische Änderungen vorgeschlagen und bewertet. Zunächst wird der JPEG 2000 Selective Arithmetic Coding Mode auf der GPU realisiert, wobei sich die Erhöhung des Durchsatzes hierdurch als begrenzt zeigt. Stattdessen werden zwei nicht standard-kompatible Änderungen vorgeschlagen, die (1) jede Bitebebene in nur einem einzelnen Pass verarbeiten (Single-Pass-Modus) und (2) einen echten Rohcodierungsmodus einführen, der sample-weise parallelisierbar ist und keine aufwendige Kontextmodellierung erfordert. Als nächstes wird ein alternativer Entropiekodierer aus der Literatur, der Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), evaluiert. Er gibt Signaladaptivität zu Gunsten von höherer Parallelität auf und daher wird hier untersucht und gezeigt, dass ein aus verschiedensten Testsequenzen gemitteltes statisches Wahrscheinlichkeitsmodell eine kompetitive Kompressionseffizienz erreicht. Es wird zudem eine Kombination von BPC-PaCo mit dem Single-Pass-Modus vorgeschlagen, der den Speedup gegenüber dem JPEG 2000 Entropiekodierer von 2,15x (BPC-PaCo mit zwei Pässen) auf 2,6x (BPC-PaCo mit Single-Pass-Modus) erhöht auf Kosten eines um 0,3 dB auf 1,0 dB erhöhten Spitzen-Signal-Rausch-Verhältnis (PSNR). Weiter wird ein paralleler Algorithmus zur Post-Compression Ratenkontrolle vorgestellt sowie eine parallele Codestream-Erstellung auf der GPU. Es wird weiterhin ein theoretisches Laufzeitmodell formuliert, das es durch Benchmarking von einer GPU ermöglicht die Laufzeit einer Routine auf einer anderen GPU vorherzusagen. Schließlich wird der erste JPEG XS GPU Decoder vorgestellt und evaluiert. JPEG XS wurde als Low Complexity Codec konzipiert und forderte erstmals explizit GPU-Freundlichkeit bereits im Call for Proposals. Ab Bitraten über 1 bpp ist der Decoder etwa 2x schneller im Vergleich zu JPEG 2000 und 1,5x schneller als der schnellste hier vorgestellte Entropiekodierer (BPC-PaCo mit Single-Pass-Modus). Mit einer GeForce GTX 1080 wird ein Decoder Durchsatz von rund 200 fps für eine UHD-4:4:4-Sequenz erreicht.This work investigates possibilities to create a high throughput, GPU-friendly, intra-only, Wavelet-based video compression algorithm optimized for visually lossless applications. Addressing the key observation that JPEG 2000’s entropy coder is a bottleneck and might be overly complex for a high bit rate scenario, various algorithmic alterations are proposed. First, JPEG 2000’s Selective Arithmetic Coding mode is realized on the GPU, but the gains in terms of an increased throughput are shown to be limited. Instead, two independent alterations not compliant to the standard are proposed, that (1) give up the concept of intra-bit plane truncation points and (2) introduce a true raw-coding mode that is fully parallelizable and does not require any context modeling. Next, an alternative block coder from the literature, the Bitplane Coder with Parallel Coefficient Processing (BPC-PaCo), is evaluated. Since it trades signal adaptiveness for increased parallelism, it is shown here how a stationary probability model averaged from a set of test sequences yields competitive compression efficiency. A combination of BPC-PaCo with the single-pass mode is proposed and shown to increase the speedup with respect to the original JPEG 2000 entropy coder from 2.15x (BPC-PaCo with two passes) to 2.6x (proposed BPC-PaCo with single-pass mode) at the marginal cost of increasing the PSNR penalty by 0.3 dB to at most 1 dB. Furthermore, a parallel algorithm is presented that determines the optimal code block bit stream truncation points (given an available bit rate budget) and builds the entire code stream on the GPU, reducing the amount of data that has to be transferred back into host memory to a minimum. A theoretical runtime model is formulated that allows, based on benchmarking results on one GPU, to predict the runtime of a kernel on another GPU. Lastly, the first ever JPEG XS GPU-decoder realization is presented. JPEG XS was designed to be a low complexity codec and for the first time explicitly demanded GPU-friendliness already in the call for proposals. Starting at bit rates above 1 bpp, the decoder is around 2x faster compared to the original JPEG 2000 and 1.5x faster compared to JPEG 2000 with the fastest evaluated entropy coder (BPC-PaCo with single-pass mode). With a GeForce GTX 1080, a decoding throughput of around 200 fps is achieved for a UHD 4:4:4 sequence

    Integrating simultaneous bi-direction signalling in the test fabric of 3D stacked integrated circuits.

    Get PDF
    Jennions, Ian K. - Associate SupervisorThe world has seen significant advancements in electronic devices’ capabilities, most notably the ability to embed ultra-large-scale functionalities in lightweight, area and power-efficient devices. There has been an enormous push towards quality and reliability in consumer electronics that have become an indispensable part of human life. Consequently, the tests conducted on these devices at the final stages before these are shipped out to the customers have a very high significance in the research community. However, researchers have always struggled to find a balance between the test time (hence the test cost) and the test overheads; unfortunately, these two are inversely proportional. On the other hand, the ever-increasing demand for more powerful and compact devices is now facing a new challenge. Historically, with the advancements in manufacturing technology, electronic devices witnessed miniaturizing at an exponential pace, as predicted by Moore’s law. However, further geometric or effective 2D scaling seems complicated due to performance and power concerns with smaller technology nodes. One promising way forward is by forming 3D Stacked Integrated Circuits (SICs), in which the individual dies are stacked vertically and interconnected using Through Silicon Vias (TSVs) before being packaged as a single chip. This allows more functionality to be embedded with a reduced footprint and addresses another critical problem being observed in 2D designs: increasingly long interconnects and latency issues. However, as more and more functionality is embedded into a small area, it becomes increasingly challenging to access the internal states (to observe or control) after the device is fabricated, which is essential for testing. This access is restricted by the limited number of Chip Terminals (IC pins and the vertical Through Silicon Vias) that a chip could be fitted with, the power consumption concerns, and the chip area overheads that could be allocated for testing. This research investigates Simultaneous Bi-Directional Signaling (SBS) for use in Test Access Mechanism (TAM) designs in 3D SICs. SBS enables chip terminals to simultaneously send and receive test vectors on a single Chip Terminal (CT), effectively doubling the per-pin efficiency, which could be translated into additional test channels for test time reduction or Chip Terminal reduction for resource efficiency. The research shows that SBS-based test access methods have significant potential in reducing test times and/or test resources compared to traditional approaches, thereby opening up new avenues towards cost-effectiveness and reliability of future electronics.PhD in Manufacturin

    High Performance Data Acquisition and Analysis Routines for the Nab Experiment

    Get PDF
    Probes of the Standard Model of particle physics are pushing further and further into the so-called “precision frontier”. In order to reach the precision goals of these experiments, a combination of elegant experimental design and robust data acquisition and analysis is required. Two experiments that embody this philosophy are the Nab and Calcium-45 experiments. These experiments are probing the understanding of the weak interaction by examining the beta decay of the free neutron and Calcium-45 respectively. They both aim to measure correlation parameters in the neutron beta decay alphabet, a and b. The parameter a, the electron-neutrino correlation coefficient, is sensitive to λ, the ratio of the axial-vector and vector coupling strengths in the decay of the free neutron. This parameter λ, in tandem with a precision measurement of the neutron lifetime τ , provides a measurement of the matrix element Vud from the CKM quark mixing matrix. The CKM matrix, as a rotation matrix, must be unitary. Probes of Vud and Vus in recent years have revealed tension in this unitarity at the 2.2σ level. The measurement of a via decay of free cold neutrons serves as an additional method of extraction for Vud that is sensitive to a different set of systematic effects and as such is an excellent probe into the source of the deviation from unitarity. The parameter b, the Fierz interference term, appears as a distortion in the mea- sured electron energy spectra from beta decay. This parameter, if non-zero, would indicate the existence of Scalar and/or Tensor couplings in the Weak interaction which according to the Standard Model is purely Vector minus Axial-Vector. This is therefore a search for physics beyond the standard model, BSM, physics search. The Nab and Calcium-45 experiments probe these parameters with a combination of elegant experimental design and brute force collection and analysis of large amounts of digitized detector data. These datasets, particularly in the case of the Nab experiment, are anticipated to span multiple petabytes of data and will require high performance online analysis and precision offline analysis routines in order to reach the experimental goals. Of particular note are the requirements for better than 3 keV energy resolution and an understanding of the uncertainty in the mean timing bias for the detected particles within 300 ps. Presented in this dissertation is an overview of the experiments and their design, a description of the data acquisition systems and analysis routines that have been developed to support the experiments, and a discussion of the data analysis performed for the Calcium-45 experiment

    Frequent itemset mining on multiprocessor systems

    Get PDF
    Frequent itemset mining is an important building block in many data mining applications like market basket analysis, recommendation, web-mining, fraud detection, and gene expression analysis. In many of them, the datasets being mined can easily grow up to hundreds of gigabytes or even terabytes of data. Hence, efficient algorithms are required to process such large amounts of data. In recent years, there have been many frequent-itemset mining algorithms proposed, which however (1) often have high memory requirements and (2) do not exploit the large degrees of parallelism provided by modern multiprocessor systems. The high memory requirements arise mainly from inefficient data structures that have only been shown to be sufficient for small datasets. For large datasets, however, the use of these data structures force the algorithms to go out-of-core, i.e., they have to access secondary memory, which leads to serious performance degradations. Exploiting available parallelism is further required to mine large datasets because the serial performance of processors almost stopped increasing. Algorithms should therefore exploit the large number of available threads and also the other kinds of parallelism (e.g., vector instruction sets) besides thread-level parallelism. In this work, we tackle the high memory requirements of frequent itemset mining twofold: we (1) compress the datasets being mined because they must be kept in main memory during several mining invocations and (2) improve existing mining algorithms with memory-efficient data structures. For compressing the datasets, we employ efficient encodings that show a good compression performance on a wide variety of realistic datasets, i.e., the size of the datasets is reduced by up to 6.4x. The encodings can further be applied directly while loading the dataset from disk or network. Since encoding and decoding is repeatedly required for loading and mining the datasets, we reduce its costs by providing parallel encodings that achieve high throughputs for both tasks. For a memory-efficient representation of the mining algorithms’ intermediate data, we propose compact data structures and even employ explicit compression. Both methods together reduce the intermediate data’s size by up to 25x. The smaller memory requirements avoid or delay expensive out-of-core computation when large datasets are mined. For coping with the high parallelism provided by current multiprocessor systems, we identify the performance hot spots and scalability issues of existing frequent-itemset mining algorithms. The hot spots, which form basic building blocks of these algorithms, cover (1) counting the frequency of fixed-length strings, (2) building prefix trees, (3) compressing integer values, and (4) intersecting lists of sorted integer values or bitmaps. For all of them, we discuss how to exploit available parallelism and provide scalable solutions. Furthermore, almost all components of the mining algorithms must be parallelized to keep the sequential fraction of the algorithms as small as possible. We integrate the parallelized building blocks and components into three well-known mining algorithms and further analyze the impact of certain existing optimizations. Our algorithms are already single-threaded often up an order of magnitude faster than existing highly optimized algorithms and further scale almost linear on a large 32-core multiprocessor system. Although our optimizations are intended for frequent-itemset mining algorithms, they can be applied with only minor changes to algorithms that are used for mining of other types of itemsets
    corecore