7,266 research outputs found

    Linear image processing operations with operational tight packing

    Get PDF
    Computer hardware with native support for large-bitwidth operations can be used for the concurrent calculation of multiple independent linear image processing operations when these operations map integers to integers. This is achieved by packing multiple input samples in one large-bitwidth number, performing a single operation with that number and unpacking the results. We propose an operational framework for tight packing, i.e., achieve the maximum packing possible by a certain implementation. We validate our framework on floating-point units natively supported in mainstream programmable processors. For image processing tasks where operational tight packing leads to increased packing in comparison to previously-known operational packing, the processing throughput is increased by up to 25%. © 2010 IEEE

    Throughput-Distortion Computation Of Generic Matrix Multiplication: Toward A Computation Channel For Digital Signal Processing Systems

    Get PDF
    The generic matrix multiply (GEMM) function is the core element of high-performance linear algebra libraries used in many computationally-demanding digital signal processing (DSP) systems. We propose an acceleration technique for GEMM based on dynamically adjusting the imprecision (distortion) of computation. Our technique employs adaptive scalar companding and rounding to input matrix blocks followed by two forms of packing in floating-point that allow for concurrent calculation of multiple results. Since the adaptive companding process controls the increase of concurrency (via packing), the increase in processing throughput (and the corresponding increase in distortion) depends on the input data statistics. To demonstrate this, we derive the optimal throughput-distortion control framework for GEMM for the broad class of zero-mean, independent identically distributed, input sources. Our approach converts matrix multiplication in programmable processors into a computation channel: when increasing the processing throughput, the output noise (error) increases due to (i) coarser quantization and (ii) computational errors caused by exceeding the machine-precision limitations. We show that, under certain distortion in the GEMM computation, the proposed framework can significantly surpass 100% of the peak performance of a given processor. The practical benefits of our proposal are shown in a face recognition system and a multi-layer perceptron system trained for metadata learning from a large music feature database.Comment: IEEE Transactions on Signal Processing (vol. 60, 2012

    Throughput Scaling Of Convolution For Error-Tolerant Multimedia Applications

    Full text link
    Convolution and cross-correlation are the basis of filtering and pattern or template matching in multimedia signal processing. We propose two throughput scaling options for any one-dimensional convolution kernel in programmable processors by adjusting the imprecision (distortion) of computation. Our approach is based on scalar quantization, followed by two forms of tight packing in floating-point (one of which is proposed in this paper) that allow for concurrent calculation of multiple results. We illustrate how our approach can operate as an optional pre- and post-processing layer for off-the-shelf optimized convolution routines. This is useful for multimedia applications that are tolerant to processing imprecision and for cases where the input signals are inherently noisy (error tolerant multimedia applications). Indicative experimental results with a digital music matching system and an MPEG-7 audio descriptor system demonstrate that the proposed approach offers up to 175% increase in processing throughput against optimized (full-precision) convolution with virtually no effect in the accuracy of the results. Based on marginal statistics of the input data, it is also shown how the throughput and distortion can be adjusted per input block of samples under constraints on the signal-to-noise ratio against the full-precision convolution.Comment: IEEE Trans. on Multimedia, 201

    Reliable Linear, Sesquilinear and Bijective Operations On Integer Data Streams Via Numerical Entanglement

    Get PDF
    A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on MM integer data streams (M3M\geq3), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, the MM input integer data streams are linearly superimposed to form MM numerically-entangled integer data streams that are stored in-place of the original inputs. A series of LSB operations can then be performed directly using these entangled data streams. The results are extracted from the MM entangled output streams by additions and arithmetic shifts. Any soft errors affecting any single disentangled output stream are guaranteed to be detectable via a specific post-computation reliability check. In addition, when utilizing a separate processor core for each of the MM streams, the proposed approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entanglement, extraction and validation of the results is linearly related to the number of the inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor (Haswell architecture with AVX2 support) via fast Fourier transforms, circular convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03%0.03\% to 7%7\% reduction in processing throughput for a wide variety of LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy.Comment: to appear in IEEE Trans. on Signal Processing, 201

    Zero-error channel capacity and simulation assisted by non-local correlations

    Full text link
    Shannon's theory of zero-error communication is re-examined in the broader setting of using one classical channel to simulate another exactly, and in the presence of various resources that are all classes of non-signalling correlations: Shared randomness, shared entanglement and arbitrary non-signalling correlations. Specifically, when the channel being simulated is noiseless, this reduces to the zero-error capacity of the channel, assisted by the various classes of non-signalling correlations. When the resource channel is noiseless, it results in the "reverse" problem of simulating a noisy channel exactly by a noiseless one, assisted by correlations. In both cases, 'one-shot' separations between the power of the different assisting correlations are exhibited. The most striking result of this kind is that entanglement can assist in zero-error communication, in stark contrast to the standard setting of communicaton with asymptotically vanishing error in which entanglement does not help at all. In the asymptotic case, shared randomness is shown to be just as powerful as arbitrary non-signalling correlations for noisy channel simulation, which is not true for the asymptotic zero-error capacities. For assistance by arbitrary non-signalling correlations, linear programming formulas for capacity and simulation are derived, the former being equal (for channels with non-zero unassisted capacity) to the feedback-assisted zero-error capacity originally derived by Shannon to upper bound the unassisted zero-error capacity. Finally, a kind of reversibility between non-signalling-assisted capacity and simulation is observed, mirroring the famous "reverse Shannon theorem".Comment: 18 pages, 1 figure. Small changes to text in v2. Removed an unnecessarily strong requirement in the premise of Theorem 1

    Software-based Approximate Computation Of Signal Processing Tasks

    Get PDF
    This thesis introduces a new dimension in performance scaling of signal processing systems by proposing software frameworks that achieve increased processing throughput when producing approximate results. The first contribution of this work is a new theory for accelerated computation of multimedia processing based on the concept of tight packing (Chapter 2). Usage of this theory accelerates small-dynamic-range linear signal processing tasks (such as convolution and transform decomposition) that map integers to integers, without incurring any accuracy loss. The concept of tight packing is combined with incremental computation that processes inputs in a bitplane-by-bitplane manner (Chapter 3), thereby leading to substantial throughput/distortion scalability within filtering, transform-decomposition and motion-estimation tasks. This framework also provides for region-of-interest computation and has inherent robustness to arbitrary termination of processing, imposed, for example, by a task scheduler. Finally, the concept of packed processing is extended to floating-point (lossy) matrix computations, with particular focus on the generic matrix multiplication (GEMM) routine of BLAS-3 (Chapters 4 and 5). This routine is a fundamental building block for several linear algebra and digital signal processing systems, such as face recognition and neural-network training for metadata-based retrieval systems. In order to compete with the best-performing software designs for GEMM, an implementation using single instruction, multiple data (SIMD) instructions is presented and analyzed. The proposed approach demonstrates substantial performance scaling in practice; specifically, it is shown to achieve up to twice the processing throughput of the best designs for GEMM when producing approximate results (under the same hardware). In summary, the proposed approximate computation of signal processing tasks can be selectively disabled thereby producing conventional full-precision/lower-throughput processing when deemed necessary. Importantly, the proposed software designs run on off-the-shelf computer hardware and provide for on-demand reconfiguration, depending on the input data and the precision specification (from full precision to noisy computation). Thus, the proposed approximate computation framework allows for backward compatibility and can be offered as an add-on service, creating significant competitive advantages for application developers. It can be used in mobile or high-performance computing systems when the precision of computation is not of critical importance (error-tolerant systems), or when the input data is intrinsically noisy

    Reliable Linear, Sesquilinear, and Bijective Operations on Integer Data Streams Via Numerical Entanglement

    Get PDF
    A new technique is proposed for fault-tolerant linear, sesquilinear and bijective (LSB) operations on MM integer data streams ( M3M \geq 3), such as: scaling, additions/subtractions, inner or outer vector products, permutations and convolutions. In the proposed method, MM input integer data streams are linearly superimposed to form MM numerically-entangled integer data streams that are stored in-place of the original inputs. LSB operations can then be performed directly using these entangled data streams. The results are extracted from the MM entangled output streams by additions and arithmetic shifts. Any soft errors affecting one disentangled output stream are guaranteed to be detectable via a post-computation reliability check. Additionally, when utilizing a separate processor core for each stream, our approach can recover all outputs after any single fail-stop failure. Importantly, unlike algorithm-based fault tolerance (ABFT) methods, the number of operations required for the entire process is linearly related to the number of inputs and does not depend on the complexity of the performed LSB operations. We have validated our proposal in an Intel processor via several types of operations: fast Fourier transforms, convolutions, and matrix multiplication operations. Our analysis and experiments reveal that the proposed approach incurs between 0.03% to 7% reduction in processing throughput for numerous LSB operations. This overhead is 5 to 1000 times smaller than that of the equivalent ABFT method that uses a checksum stream. Thus, our proposal can be used in fault-generating processor hardware or safety-critical applications, where high reliability is required without the cost of ABFT or modular redundancy

    Core Failure Mitigation in Integer Sum-of-Product Computations on Cloud Computing Systems

    Get PDF
    The decreasing mean-time-to-failure estimates in cloud computing systems indicate that multimedia applications running on such environments should be able to mitigate an increasing number of core failures at runtime. We propose a new roll-forward failure-mitigation approach for integer sumof-product computations, with emphasis on generic matrix multiplication (GEMM)and convolution/crosscorrelation (CONV) routines. Our approach is based on the production of redundant results within the numerical representation of the outputs via the use of numerical packing.This differs fromall existing roll-forward solutions that require a separate set of checksum (or duplicate) results. Our proposal imposes 37.5% reduction in the maximum output bitwidth supported in comparison to integer sum-ofproduct realizations performed on 32-bit integer representations which is comparable to the bitwidth requirement of checksummethods for multiple core failure mitigation. Experiments with state-of-the-art GEMM and CONV routines running on a c4.8xlarge compute-optimized instance of amazon web services elastic compute cloud (AWS EC2) demonstrate that the proposed approach is able to mitigate up to one quadcore failure while achieving processing throughput that is: 1) comparable to that of the conventional, failure-intolerant, integer GEMM and CONV routines, 2) substantially superior to that of the equivalent roll-forward failure-mitigation method based on checksum streams. Furthermore, when used within an image retrieval framework deployed over a cluster of AWS EC2 spot (i.e., low-cost albeit terminatable) instances, our proposal leads to: 1) 16%-23% cost reduction against the equivalent checksum-based method and 2) more than 70% cost reduction against conventional failure-intolerant processing on AWS EC2 on-demand (i.e., highercost albeit guaranteed) instances
    corecore