38 research outputs found

    Brief Announcement: Optimal Bit-Reversal Using Vector Permutations

    No full text
    Accepted versio

    Fast Tree Search for Enumeration of a Lattice Model of Protein Folding

    Full text link
    Using a fast tree-searching algorithm and a Pentium cluster, we enumerated all the sequences and compact conformations (structures) for a protein folding model on a cubic lattice of size 4×3×34\times3\times3. We used two types of amino acids -- hydrophobic (H) and polar (P) -- to make up the sequences, so there were 2366.87×10102^{36} \approx 6.87 \times 10^{10} different sequences. The total number of distinct structures was 84,731,192. We made use of a simple solvation model in which the energy of a sequence folded into a structure is minus the number of hydrophobic amino acids in the ``core'' of the structure. For every sequence, we found its ground state or ground states, i.e., the structure or structures for which its energy is lowest. About 0.3% of the sequences have a unique ground state. The number of structures that are unique ground states of at least one sequence is 2,662,050, about 3% of the total number of structures. However, these ``designable'' structures differ drastically in their designability, defined as the number of sequences whose unique ground state is that structure. To understand this variation in designability, we studied the distribution of structures in a high dimensional space in which each structure is represented by a string of 1's and 0's, denoting core and surface sites, respectively.Comment: 18 pages, 10 figure

    Software and hardware methods for memory access latency reduction on ILP processors

    Get PDF
    While microprocessors have doubled their speed every 18 months, performance improvement of memory systems has continued to lag behind. to address the speed gap between CPU and memory, a standard multi-level caching organization has been built for fast data accesses before the data have to be accessed in DRAM core. The existence of these caches in a computer system, such as L1, L2, L3, and DRAM row buffers, does not mean that data locality will be automatically exploited. The effective use of the memory hierarchy mainly depends on how data are allocated and how memory accesses are scheduled. In this dissertation, we propose several novel software and hardware techniques to effectively exploit the data locality and to significantly reduce memory access latency.;We first presented a case study at the application level that reconstructs memory-intensive programs by utilizing program-specific knowledge. The problem of bit-reversals, a set of data reordering operations extensively used in scientific computing program such as FFT, and an application with a special data access pattern that can cause severe cache conflicts, is identified in this study. We have proposed several software methods, including padding and blocking, to restructure the program to reduce those conflicts. Our methods outperform existing ones on both uniprocessor and multiprocessor systems.;The access latency to DRAM core has become increasingly long relative to CPU speed, causing memory accesses to be an execution bottleneck. In order to reduce the frequency of DRAM core accesses to effectively shorten the overall memory access latency, we have conducted three studies at this level of memory hierarchy. First, motivated by our evaluation of DRAM row buffer\u27s performance roles and our findings of the reasons of its access conflicts, we propose a simple and effective memory interleaving scheme to reduce or even eliminate row buffer conflicts. Second, we propose a fine-grain priority scheduling scheme to reorder the sequence of data accesses on multi-channel memory systems, effectively exploiting the available bus bandwidth and access concurrency. In the final part of the dissertation, we first evaluate the design of cached DRAM and its organization alternatives associated with ILP processors. We then propose a new memory hierarchy integration that uses cached DRAM to construct a very large off-chip cache. We show that this structure outperforms a standard memory system with an off-level L3 cache for memory-intensive applications.;Memory access latency has become a major performance bottleneck for memory-intensive applications. as long as DRAM technology remains its most cost-effective position for making main memory, the memory performance problem will continue to exist. The studies conducted in this dissertation attempt to address this important issue. Our proposed software and hardware schemes are effective and applicable, which can be directly used in real-world memory system designs and implementations. Our studies also provide guidance for application programmers to understand memory performance implications, and for system architects to optimize memory hierarchies

    Describing and verifying FFT circuits using SharpHDL

    Get PDF
    Fourier transforms are critical in a variety of fields but in the past, they were rarely used in applications because of the big processing power required. However, the Cooley’s and Tukey’s development of the Fast Fourier Transform (FFT) vastly simplified this. A large number of FFT algorithms have been developed, amongst which are the radix-2 and the radix-22 . These are the ones that have been mostly used for practical applications due to their simple structure with constant butterfly geometry. Most of the research to date for the implementation and benchmarking of FFT algorithms have been performed using general purpose processors, Digital Signal Processors (DSPs) and dedicated FFT processor ICs but as FPGAs have developed they have become a viable solution for computing FFTs. In this paper, SharpHDL, an object oriented HDL, will be used to implement the two mentioned FFT algorithms and test their equivalence.peer-reviewe

    Parallel fast fourier transform in SPMD style of cilk

    Full text link
    Copyright © 2019 Inderscience Enterprises Ltd. In this paper, we propose a parallel one-dimensional non-recursive fast Fourier transform (FFT) program based on conventional Cooley-Tukey’s algorithm written in C using Cilk in single program multiple data (SPMD) style. As a highly compact designed code, this code is compared with a highly tuned parallel recursive fast Fourier transform (FFT) using Cilk, which is included in Cilk package of version 5.4.6. Both algorithms are executed on multicore servers, and experimental results show that the performance of the SPMD style of Cilk fast Fourier transform (FFT) parallel code is highly competitive and promising

    Multiprocessor Out-of-Core FFTs with Distributed Memory and Parallel Disks

    Get PDF
    This paper extends an earlier out-of-core Fast Fourier Transform (FFT) method for a uniprocessor with the Parallel Disk Model (PDM) to use multiple processors. Four out-of-core multiprocessor methods are examined. Operationally, these methods differ in the size of mini-butterfly computed in memory and how the data are organized on the disks and in the distributed memory of the multiprocessor. The methods also perform differing amounts of I/O and communication. Two of them have the remarkable property that even though they are computing the FFT on a multiprocessor, all interprocessor communication occurs outside the mini-butterfly computations. Performance results on a small workstation cluster indicate that except for unusual combinations of problem size and memory size, the methods that do not perform interprocessor communication during the mini-butterfly computations require approximately 86% of the time of those that do. Moreover, the faster methods are much easier to implement

    On the design of a high-performance adaptive router for CC-NUMA multiprocessors

    Get PDF
    Copyright © 2003 IEEEThis work presents the design and evaluation of an adaptive packet router aimed at supporting CC-NUMA traffic. We exploit a simple and efficient packet injection mechanism to avoid deadlock, which leads to a fully adaptive routing by employing only three virtual channels. In addition, we selectively use output buffers for implementing the most utilized virtual paths in order to reduce head-of-line blocking. The careful implementation of these features has resulted in a good trade off between network performance and hardware cost. The outcome of this research is a High-Performance Adaptive Router (HPAR), which adequately balances the needs of parallel applications: minimal network latency at low loads and high throughput at heavy loads. The paper includes an evaluation process in which HPAR is compared with other adaptive routers using FIFO input buffering, with or without additional virtual channels to reduce head-of-line blocking. This evaluation contemplates both the VLSI costs of each router and their performance under synthetic and real application workloads. To make the comparison fair, all the routers use the same efficient deadlock avoidance mechanism. In all the experiments, HPAR exhibited the best response among all the routers tested. The throughput gains ranged from 10 percent to 40 percent in respect to its most direct rival, which employs more hardware resources. Other results shown that HPAR achieves up to 83 percent of its theoretical maximum throughput under random traffic and up to 70 percent when running real applications. Moreover, the observed packet latencies were comparable to those exhibited by simpler routers. Therefore, HPAR can be considered as a suitable candidate to implement packet interchange in next generations of CC-NUMA multiprocessors.Valentín Puente, José-Ángel Gregorio, Ramón Beivide, and Cruz Iz

    Exploiting cache locality at run-time

    Get PDF
    With the increasing gap between the speeds of the processor and memory system, memory access has become a major performance bottleneck in modern computer systems. Recently, Symmetric Multi-Processor (SMP) systems have emerged as a major class of high-performance platforms. Improving the memory performance of Parallel applications with dynamic memory-access patterns on Symmetric Multi-Processors (SMP) is a hard problem. The solution to this problem is critical to the successful use of the SMP systems because dynamic memory-access patterns occur in many real-world applications. This dissertation is aimed at solving this problem.;Based on a rigorous analysis of cache-locality optimization, we propose a memory-layout oriented run-time technique to exploit the cache locality of parallel loops. Our technique have been implemented in a run-time system. Using simulation and measurement, we have shown our run-time approach can achieve comparable performance with compiler optimizations for those regular applications, whose load balance and cache locality can be well optimized by tiling and other program transformations. However, our approach was shown to improve significantly the memory performance for applications with dynamic memory-access patterns. Such applications are usually hard to optimize with static compiler optimizations.;Several contributions are made in this dissertation. We present models to characterize the complexity and present a solution framework for optimizing cache locality. We present an effective estimation technique for memory-access patterns to support efficient locality optimizations and information integration. We present a memory-layout oriented run-time technique for locality optimization. We present efficient scheduling algorithms to trade off locality and load imbalance. We provide a detailed performance evaluation of the run-time technique

    No Time to Hash: On Super Efficient Entropy Accumulation

    Get PDF
    Real-world random number generators (RNGs) cannot afford to use (slow) cryptographic hashing every time they refresh their state RR with a new entropic input XX. Instead, they use ``superefficient\u27\u27 simple entropy-accumulation procedures, such as Rrotα,n(R)X,R \leftarrow \mathsf{rot}_{\alpha, n}(R) \oplus X, where rotα,n\mathsf{rot}_{\alpha,n} rotates an nn-bit state RR by some fixed number α\alpha. For example, Microsoft\u27s RNG uses α=5\alpha=5 for n=32n=32 and α=19\alpha=19 for n=64n=64. Where do these numbers come from? Are they good choices? Should rotation be replaced by a better permutation π\pi of the input bits? In this work we initiate a rigorous study of these pragmatic questions, by modeling the sequence of successive entropic inputs X1,X2,X_1,X_2,\ldots as independent (but otherwise adversarial) samples from some natural distribution family D{\mathcal D}. Our contribution is as follows. * We define 22-monotone distributions as a rich family D{\mathcal D} that includes relevant real-world distributions (Gaussian, exponential, etc.), but avoids trivial impossibility results. * For any α\alpha with gcd(α,n)=1\gcd(\alpha,n)=1, we show that rotation accumulates Ω(n)\Omega(n) bits of entropy from nn independent samples X1,,XnX_1,\ldots,X_n from any (unknown) 22-monotone distribution with entropy k>1k > 1. * However, we also show that some choices of α\alpha perform much better than others for a given nn. E.g., we show α=19\alpha=19 is one of the best choices for n=64n=64; in contrast, α=5\alpha=5 is good, but generally worse than α=7\alpha=7, for n=32n=32. * More generally, given a permutation π\pi and k1k\ge 1, we define a simple parameter, the covering number Cπ,kC_{\pi,k}, and show that it characterizes the number of steps before the rule (R1,,Rn)(Rπ(1),,Rπ(n))X(R_1,\ldots,R_n)\leftarrow (R_{\pi(1)},\ldots, R_{\pi(n)})\oplus X accumulates nearly nn bits of entropy from independent, 22-monotone samples of min-entropy kk each. * We build a simple permutation π\pi^*, which achieves nearly optimal Cπ,kn/kC_{\pi^*,k}\approx n/k for all values of kk simultaneously, and experimentally validate that it compares favorably with all rotations rotα,n\mathsf{rot}_{\alpha,n}
    corecore