14,854 research outputs found
Dynamic MDS Matrices for Substantial Cryptographic Strength
Ciphers get their strength from the mathematical functions of confusion and
diffusion, also known as substitution and permutation. These were the basics of
classical cryptography and they are still the basic part of modern ciphers. In
block ciphers diffusion is achieved by the use of Maximum Distance Separable
(MDS) matrices. In this paper we present some methods for constructing dynamic
(and random) MDS matrices.Comment: Short paper at WISA'10, 201
AlSub: Fully Parallel and Modular Subdivision
In recent years, mesh subdivision---the process of forging smooth free-form
surfaces from coarse polygonal meshes---has become an indispensable production
instrument. Although subdivision performance is crucial during simulation,
animation and rendering, state-of-the-art approaches still rely on serial
implementations for complex parts of the subdivision process. Therefore, they
often fail to harness the power of modern parallel devices, like the graphics
processing unit (GPU), for large parts of the algorithm and must resort to
time-consuming serial preprocessing. In this paper, we show that a complete
parallelization of the subdivision process for modern architectures is
possible. Building on sparse matrix linear algebra, we show how to structure
the complete subdivision process into a sequence of algebra operations. By
restructuring and grouping these operations, we adapt the process for different
use cases, such as regular subdivision of dynamic meshes, uniform subdivision
for immutable topology, and feature-adaptive subdivision for efficient
rendering of animated models. As the same machinery is used for all use cases,
identical subdivision results are achieved in all parts of the production
pipeline. As a second contribution, we show how these linear algebra
formulations can effectively be translated into efficient GPU kernels. Applying
our strategies to , Loop and Catmull-Clark subdivision shows
significant speedups of our approach compared to state-of-the-art solutions,
while we completely avoid serial preprocessing.Comment: Changed structure Added content Improved description
Recommended from our members
Two-dimensional DCT/IDCT architecture
A fully parallel architecture for the computation of a two-dimensional (2-D) discrete cosine transform (DCT), based on row-column decomposition is presented. It uses the same one dimensional (1-D) DCT unit for the row and column computations and (N2+N) registers to perform the transposition. It possesses features of regularity and modularity, and is thus well suited for VLSI implementation. It can be used for the computation of either the forward or the inverse 2-D DCT. Each 1-D DCT unit uses N fully parallel vector inner product (VIP) units. The design of the VIP units is based on a systematic design methodology using radix-2” arithmetic, which allows partitioning of the elements of each vector into small groups. Array multipliers without the final adder are used to produce the different partial product terms. This allows a more efficient use of 4:2 compressors for the accumulation of the products in the intermediate stages and reduces the number of accumulators from N to one. Using this procedure, the 2-D DCT architecture requires less than N2 multipliers (in terms of area occupied) and only 2N adders. It can compute a N x N-point DCT at a rate of one complete transform per N cycles after an appropriate initial delay
Format Abstraction for Sparse Tensor Algebra Compilers
This paper shows how to build a sparse tensor algebra compiler that is
agnostic to tensor formats (data layouts). We develop an interface that
describes formats in terms of their capabilities and properties, and show how
to build a modular code generator where new formats can be added as plugins. We
then describe six implementations of the interface that compose to form the
dense, CSR/CSF, COO, DIA, ELL, and HASH tensor formats and countless variants
thereof. With these implementations at hand, our code generator can generate
code to compute any tensor algebra expression on any combination of the
aforementioned formats.
To demonstrate our technique, we have implemented it in the taco tensor
algebra compiler. Our modular code generator design makes it simple to add
support for new tensor formats, and the performance of the generated code is
competitive with hand-optimized implementations. Furthermore, by extending taco
to support a wider range of formats specialized for different application and
data characteristics, we can improve end-user application performance. For
example, if input data is provided in the COO format, our technique allows
computing a single matrix-vector multiplication directly with the data in COO,
which is up to 3.6 faster than by first converting the data to CSR.Comment: Presented at OOPSLA 201
Development of symbolic algorithms for certain algebraic processes
This study investigates the problem of computing the exact greatest common divisor of two polynomials relative to an orthogonal basis, defined over the rational number field. The main objective of the study is to design and implement an effective and efficient symbolic algorithm for the general class of dense polynomials, given the rational number defining terms of their basis. From a general algorithm using the comrade matrix approach, the nonmodular and modular techniques are prescribed. If the coefficients of the generalized polynomials are multiprecision integers, multiprecision arithmetic will be required in the construction of the comrade matrix and the corresponding systems coefficient matrix. In addition, the application of the nonmodular elimination technique on this coefficient matrix extensively applies multiprecision rational number operations. The modular technique is employed to minimize the complexity involved in such computations. A divisor test algorithm that enables the detection of an unlucky reduction is a crucial device for an effective implementation of the modular technique. With the bound of the true solution not known a priori, the test is devised and carefully incorporated into the modular algorithm. The results illustrate that the modular algorithm illustrate its best performance for the class of relatively prime polynomials. The empirical computing time results show that the modular algorithm is markedly superior to the nonmodular algorithms in the case of sufficiently dense Legendre basis polynomials with a small GCD solution. In the case of dense Legendre basis polynomials with a big GCD solution, the modular algorithm is significantly superior to the nonmodular algorithms in higher degree polynomials. For more definitive conclusions, the computing time functions of the algorithms that are presented in this report have been worked out. Further investigations have also been suggested
Implementing Shor's algorithm on Josephson Charge Qubits
We investigate the physical implementation of Shor's factorization algorithm
on a Josephson charge qubit register. While we pursue a universal method to
factor a composite integer of any size, the scheme is demonstrated for the
number 21. We consider both the physical and algorithmic requirements for an
optimal implementation when only a small number of qubits is available. These
aspects of quantum computation are usually the topics of separate research
communities; we present a unifying discussion of both of these fundamental
features bridging Shor's algorithm to its physical realization using Josephson
junction qubits. In order to meet the stringent requirements set by a short
decoherence time, we accelerate the algorithm by decomposing the quantum
circuit into tailored two- and three-qubit gates and we find their physical
realizations through numerical optimization.Comment: 12 pages, submitted to Phys. Rev.
Contract-Based General-Purpose GPU Programming
Using GPUs as general-purpose processors has revolutionized parallel
computing by offering, for a large and growing set of algorithms, massive
data-parallelization on desktop machines. An obstacle to widespread adoption,
however, is the difficulty of programming them and the low-level control of the
hardware required to achieve good performance. This paper suggests a
programming library, SafeGPU, that aims at striking a balance between
programmer productivity and performance, by making GPU data-parallel operations
accessible from within a classical object-oriented programming language. The
solution is integrated with the design-by-contract approach, which increases
confidence in functional program correctness by embedding executable program
specifications into the program text. We show that our library leads to modular
and maintainable code that is accessible to GPGPU non-experts, while providing
performance that is comparable with hand-written CUDA code. Furthermore,
runtime contract checking turns out to be feasible, as the contracts can be
executed on the GPU
- …