Search CORE

4,277 research outputs found

Pipelining Saturated Accumulation

Author: Chan Stephanie
DeHon André
Kapre Nachiket
Papadantonakis Karl
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/04/2008
Field of study

Aggressive pipelining and spatial parallelism allow integrated circuits (e.g., custom VLSI, ASICs, and FPGAs) to achieve high throughput on many Digital Signal Processing applications. However, cyclic data dependencies in the computation can limit parallelism and reduce the efficiency and speed of an implementation. Saturated accumulation is an important example where such a cycle limits the throughput of signal processing applications. We show how to reformulate saturated addition as an associative operation so that we can use a parallel-prefix calculation to perform saturated accumulation at any data rate supported by the device. This allows us, for example, to design a 16-bit saturated accumulator which can operate at 280 MHz on a Xilinx Spartan-3(XC3S-5000-4) FPGA, the maximum frequency supported by the component's DCM

CiteSeerX

Caltech Authors

Optimistic Parallelization of Floating-Point Accumulation

Author: DeHon André
Kapre Nachiket
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2007
Field of study

Floating-point arithmetic is notoriously non-associative due to the limited precision representation which demands intermediate values be rounded to fit in the available precision. The resulting cyclic dependency in floating-point accumulation inhibits parallelization of the computation, including efficient use of pipelining. In practice, however, we observe that floating-point operations are "mostly" associative. This observation can be exploited to parallelize floating-point accumulation using a form of optimistic concurrency. In this scheme, we first compute an optimistic associative approximation to the sum and then relax the computation by iteratively propagating errors until the correct sum is obtained. We map this computation to a network of 16 statically-scheduled, pipelined, double-precision floating-point adders on the Virtex-4 LX160 (-12) device where each floating-point adder runs at 296 MHz and has a pipeline depth of 10. On this 16 PE design, we demonstrate an average speedup of 6× with randomly generated data and 3-7× with summations extracted from Conjugate Gradient benchmarks

CiteSeerX

Crossref

Caltech Authors

ScholarlyCommons@Penn

On the efficiency of reductions in µ-SIMD media extensions

Author: Corbal San Adrián Jesús
Espasa Sans Roger
Valero Cortés Mateo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2001
Field of study

Many important multimedia applications contain a significant fraction of reduction operations. Although, in general, multimedia applications are characterized for having high amounts of Data Level Parallelism, reductions and accumulations are difficult to parallelize and show a poor tolerance to increases in the latency of the instructions. This is specially significant for µ-SIMD extensions such as MMX or AltiVec. To overcome the problem of reductions in µ-SIMD ISAs, designers tend to include more and more complex instructions able to deal with the most common forms of reductions in multimedia. As long as the number of processor pipeline stages grows, the number of cycles needed to execute these multimedia instructions increases with every processor generation, severely compromising performance. The paper presents an in-depth discussion of how reductions/accumulations are performed in current µ-SIMD architectures and evaluates the performance trade-offs for near-future highly aggressive superscalar processors with three different styles of µ-SIMD extensions. We compare a MMX-like alternative to a MDMX-like extension that has packed accumulators to attack the reduction problem, and we also compare it to MOM, a matrix register ISA. We show that while packed accumulators present several advantages, they introduce artificial recurrences that severely degrade performance for processors with high number of registers and long latency operations. On the other hand, the paper demonstrates that longer SIMD media extensions such as MOM can take great advantage of accumulators by exploiting the associative parallelism implicit in reductions.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Parallel Implementation of Efficient Search Schemes for the Inference of Cancer Progression Models

Author: Antoniotti Marco
Cazzaniga Paolo
Mauri Giancarlo
Nobile Marco S.
Ramazzotti Daniele
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2016
Field of study

The emergence and development of cancer is a consequence of the accumulation over time of genomic mutations involving a specific set of genes, which provides the cancer clones with a functional selective advantage. In this work, we model the order of accumulation of such mutations during the progression, which eventually leads to the disease, by means of probabilistic graphic models, i.e., Bayesian Networks (BNs). We investigate how to perform the task of learning the structure of such BNs, according to experimental evidence, adopting a global optimization meta-heuristics. In particular, in this work we rely on Genetic Algorithms, and to strongly reduce the execution time of the inference -- which can also involve multiple repetitions to collect statistically significant assessments of the data -- we distribute the calculations using both multi-threading and a multi-node architecture. The results show that our approach is characterized by good accuracy and specificity; we also demonstrate its feasibility, thanks to a 84x reduction of the overall execution time with respect to a traditional sequential implementation

arXiv.org e-Print Archive

Repository TU/e

Efficient Compressive Sampling of Spatially Sparse Fields in Wireless Sensor Networks

Author: Colonnese Stefania
Cusani Roberto
Rinauro Stefano
Ruggiero Giorgia
Scarano Gaetano
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Wireless sensor networks (WSN), i.e. networks of autonomous, wireless sensing nodes spatially deployed over a geographical area, are often faced with acquisition of spatially sparse fields. In this paper, we present a novel bandwidth/energy efficient CS scheme for acquisition of spatially sparse fields in a WSN. The paper contribution is twofold. Firstly, we introduce a sparse, structured CS matrix and we analytically show that it allows accurate reconstruction of bidimensional spatially sparse signals, such as those occurring in several surveillance application. Secondly, we analytically evaluate the energy and bandwidth consumption of our CS scheme when it is applied to data acquisition in a WSN. Numerical results demonstrate that our CS scheme achieves significant energy and bandwidth savings wrt state-of-the-art approaches when employed for sensing a spatially sparse field by means of a WSN.Comment: Submitted to EURASIP Journal on Advances in Signal Processin

arXiv.org e-Print Archive

Springer - Publisher Connector

Archivio della ricerca- Università di Roma La Sapienza

Taking advantage of hybrid systems for sparse direct solvers via task-based runtimes

Author: Bosilca George
Faverge Mathieu
Lacoste Xavier
Ramet Pierre
Thibault Samuel
Publication venue
Publication date: 06/01/2014
Field of study

The ongoing hardware evolution exhibits an escalation in the number, as well as in the heterogeneity, of computing resources. The pressure to maintain reasonable levels of performance and portability forces application developers to leave the traditional programming paradigms and explore alternative solutions. PaStiX is a parallel sparse direct solver, based on a dynamic scheduler for modern hierarchical manycore architectures. In this paper, we study the benefits and limits of replacing the highly specialized internal scheduler of the PaStiX solver with two generic runtime systems: PaRSEC and StarPU. The tasks graph of the factorization step is made available to the two runtimes, providing them the opportunity to process and optimize its traversal in order to maximize the algorithm efficiency for the targeted hardware platform. A comparative study of the performance of the PaStiX solver on top of its native internal scheduler, PaRSEC, and StarPU frameworks, on different execution environments, is performed. The analysis highlights that these generic task-based runtimes achieve comparable results to the application-optimized embedded scheduler on homogeneous platforms. Furthermore, they are able to significantly speed up the solver on heterogeneous environments by taking advantage of the accelerators while hiding the complexity of their efficient manipulation from the programmer.Comment: Heterogeneity in Computing Workshop (2014

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server

Function-specific schemes for verifiable computation

Author: Papadopoulos Dimitrios
Publication venue
Publication date: 07/12/2016
Field of study

An integral component of modern computing is the ability to outsource data and computation to powerful remote servers, for instance, in the context of cloud computing or remote file storage. While participants can benefit from this interaction, a fundamental security issue that arises is that of integrity of computation: How can the end-user be certain that the result of a computation over the outsourced data has not been tampered with (not even by a compromised or adversarial server)? Cryptographic schemes for verifiable computation address this problem by accompanying each result with a proof that can be used to check the correctness of the performed computation. Recent advances in the field have led to the first implementations of schemes that can verify arbitrary computations. However, in practice the overhead of these general-purpose constructions remains prohibitive for most applications, with proof computation times (at the server) in the order of minutes or even hours for real-world problem instances. A different approach for designing such schemes targets specific types of computation and builds custom-made protocols, sacrificing generality for efficiency. An important representative of this function-specific approach is an authenticated data structure (ADS), where a specialized protocol is designed that supports query types associated with a particular outsourced dataset. This thesis presents three novel ADS constructions for the important query types of set operations, multi-dimensional range search, and pattern matching, and proves their security under cryptographic assumptions over bilinear groups. The scheme for set operations can support nested queries (e.g., two unions followed by an intersection of the results), extending previous works that only accommodate a single operation. The range search ADS provides an exponential (in the number of attributes in the dataset) asymptotic improvement from previous schemes for storage and computation costs. Finally, the pattern matching ADS supports text pattern and XML path queries with minimal cost, e.g., the overhead at the server is less than 4% compared to simply computing the result, for all our tested settings. The experimental evaluation of all three constructions shows significant improvements in proof-computation time over general-purpose schemes

Boston University Institutional Repository (OpenBU)

Distributed memory compiler design for sparse problems

Author: Berryman Harry
Hiranandani Seema
Saltz Joel
Wu Janet
Publication venue
Publication date
Field of study

A compiler and runtime support mechanism is described and demonstrated. The methods presented are capable of solving a wide range of sparse and unstructured problems in scientific computing. The compiler takes as input a FORTRAN 77 program enhanced with specifications for distributing data, and the compiler outputs a message passing program that runs on a distributed memory computer. The runtime support for this compiler is a library of primitives designed to efficiently support irregular patterns of distributed array accesses and irregular distributed array partitions. A variety of Intel iPSC/860 performance results obtained through the use of this compiler are presented

NASA Technical Reports Server

Visual Analysis of High-Dimensional Point Clouds using Topological Abstraction

Author: Oesterling Patrick
Publication venue
Publication date: 14/04/2016
Field of study

This thesis is about visualizing a kind of data that is trivial to process by computers but difficult to imagine by humans because nature does not allow for intuition with this type of information: high-dimensional data. Such data often result from representing observations of objects under various aspects or with different properties. In many applications, a typical, laborious task is to find related objects or to group those that are similar to each other. One classic solution for this task is to imagine the data as vectors in a Euclidean space with object variables as dimensions. Utilizing Euclidean distance as a measure of similarity, objects with similar properties and values accumulate to groups, so-called clusters, that are exposed by cluster analysis on the high-dimensional point cloud. Because similar vectors can be thought of as objects that are alike in terms of their attributes, the point cloud\''s structure and individual cluster properties, like their size or compactness, summarize data categories and their relative importance. The contribution of this thesis is a novel analysis approach for visual exploration of high-dimensional point clouds without suffering from structural occlusion. The work is based on implementing two key concepts: The first idea is to discard those geometric properties that cannot be preserved and, thus, lead to the typical artifacts. Topological concepts are used instead to shift away the focus from a point-centered view on the data to a more structure-centered perspective. The advantage is that topology-driven clustering information can be extracted in the data\''s original domain and be preserved without loss in low dimensions. The second idea is to split the analysis into a topology-based global overview and a subsequent geometric local refinement. The occlusion-free overview enables the analyst to identify features and to link them to other visualizations that permit analysis of those properties not captured by the topological abstraction, e.g. cluster shape or value distributions in particular dimensions or subspaces. The advantage of separating structure from data point analysis is that restricting local analysis only to data subsets significantly reduces artifacts and the visual complexity of standard techniques. That is, the additional topological layer enables the analyst to identify structure that was hidden before and to focus on particular features by suppressing irrelevant points during local feature analysis. This thesis addresses the topology-based visual analysis of high-dimensional point clouds for both the time-invariant and the time-varying case. Time-invariant means that the points do not change in their number or positions. That is, the analyst explores the clustering of a fixed and constant set of points. The extension to the time-varying case implies the analysis of a varying clustering, where clusters appear as new, merge or split, or vanish. Especially for high-dimensional data, both tracking---which means to relate features over time---but also visualizing changing structure are difficult problems to solve

Qucosa - Publikationsserver der Universität Leipzig