150 research outputs found

    A Theoretical Approach Involving Recurrence Resolution, Dependence Cycle Statement Ordering and Subroutine Transformation for the Exploitation of Parallelism in Sequential Code.

    Get PDF
    To exploit parallelism in Fortran code, this dissertation consists of a study of the following three issues: (1) recurrence resolution in Do-loops for vector processing, (2) dependence cycle statement ordering in Do-loops for parallel processing, and (3) sub-routine parallelization. For recurrence resolution, the major findings include: (1) the node splitting algorithm cannot be used directly to break an essential antidependence link, of which the source variable that results in antidependence is itself the sink variable of another true dependence so a correction method is proposed, (2) a sink variable renaming technique is capable of breaking an antidependence and/or output-dependence link, (3) for recurrences formed by only true dependences, a dynamic dependence concept and the derived technique are powerful, and (4) by integrating related techniques, an algorithm for resolving a general multistatement recurrence is developed. The performance of a parallel loop is determined by the level of parallelism and the time delay due to interprocessor communication and synchronization. For a dependence cycle of a single parallel loop executed in a general synchronization mode, the parallelism exposed varies with the alignment of statements. Statements are reordered on the basis of execution-time of the loop as estimated at compile-time. An improved timing formula and a derived statement ordering algorithm are proposed. Further extension of this algorithm to multiple perfectly nested Do-loops with simple global dependence cycle is also presented. The subroutine is a potential source for parallel processing. Several problems must be solved for subroutine parallelization: (1) the precedence of parallel executions of subroutines, (2) identification of the optimum execution mode for each subroutine and (3) the restructuring of a serial program. A five-step approach to parallelize called subroutines for a calling subroutine is proposed: (1) computation of control dependence, (2) approximation of the global effects of subroutines, (3) analysis of data dependence, (4) identification of execution mode, and (5) restructuring of calling and called subroutines. Application of these five steps in a recursive manner to different levels of calling subroutines in a program addresses the parallelization of subroutines

    Phylogeny-Aware Placement and Alignment Methods for Short Reads

    Get PDF
    In recent years bioinformatics has entered a new phase: New sequencing methods, generally referred to as Next Generation Sequencing (NGS) have become widely available. This thesis introduces algorithms for phylogeny aware analysis of short sequence reads, as generated by NGS methods in the context of metagenomic studies. A considerable part of this work focuses on the technical (w.r.t. performance) challenges of these new algorithms, which have been developed specifically to exploit parallelism

    Datacenter Design for Future Cloud Radio Access Network.

    Full text link
    Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd

    TraTSA: A Transprecision Framework for Efficient Time Series Analysis

    Get PDF
    Time series analysis (TSA) comprises methods for extracting information in domains as diverse as medicine, seismology, speech recognition and economics. Matrix Profile (MP) is the state-of-the-art TSA technique, which provides the most similar neighbor to each subsequence of the time series. However, this computation requires a huge amount of floating-point (FP) operations, which are a major contributor ( 50%) to the energy consumption in modern computing platforms. In this sense, Transprecision Computing has recently emerged as a promising approach to improve energy efficiency and performance by using fewer bits in FP operations while providing accurate results. In this work, we present TraTSA, the first transprecision framework for efficient time series analysis based on MP. TraTSA allows the user to deploy a high-performance and energy-efficient computing solution with the exact precision required by the TSA application. To this end, we first propose implementations of TraTSA for both commodity CPU and FPGA platforms. Second, we propose an accuracy metric to compare the results with the double-precision MP. Third, we study MP’s accuracy when using a transprecision approach. Finally, our evaluation shows that, while obtaining results accurate enough, the FPGA transprecision MP (i) is 22.75 faster than a 72-core server, and (ii) the energy consumption is up to 3.3 lower than the double-precision executions.This work has been supported by the Government of Spain under project PID2019-105396RB-I00, and Junta de Andalucia under projects P18-FR-3433 and UMA18-FEDERJA-197. Funding for open access charge: Universidad de Málaga / CBUA

    The last mile: High-Assurance and High-Speed cryptographic implementations

    Get PDF
    We develop a new approach for building cryptographic implementations. Our approach goes the last mile and delivers assembly code that is provably functionally correct, protected against side-channels, and as efficient as handwritten assembly. We illustrate our approach using ChaCha20Poly1305, one of the two ciphersuites recommended in TLS 1.3, and deliver formally verified vectorized implementations which outperform the fastest non-verified code.We realize our approach by combining the Jasmin framework, which offers in a single language features of high-level and low-level programming, and the EasyCrypt proof assistant, which offers a versatile verification infrastructure that supports proofs of functional correctness and equivalence checking. Neither of these tools had been used for functional correctness before. Taken together, these infrastructures empower programmers to develop efficient and verified implementations by "game hopping", starting from reference implementations that are proved functionally correct against a specification, and gradually introducing program optimizations that are proved correct by equivalence checking.We also make several contributions of independent interest, including a new and extensible verified compiler for Jasmin, with a richer memory model and support for vectorized instructions, and a new embedding of Jasmin in EasyCrypt.This work is partially supported by project ONR N00014-19-1-2292. Manuel Barbosa was supported by grant SFRH/BSAB/143018/2018 awarded by FCT. This work was partially funded by national funds via FCT in the context of project PTDC/CCI-INF/31698/2017

    Deriving divide-and-conquer dynamic programming algorithms using solver-aided transformations

    Get PDF
    We introduce a framework allowing domain experts to manipulate computational terms in the interest of deriving better, more efficient implementations.It employs deductive reasoning to generate provably correct efficient implementations from a very high-level specification of an algorithm, and inductive constraint-based synthesis to improve automation. Semantic information is encoded into program terms through the use of refinement types. In this paper, we develop the technique in the context of a system called Bellmania that uses solver-aided tactics to derive parallel divide-and-conquer implementations of dynamic programming algorithms that have better locality and are significantly more efficient than traditional loop-based implementations. Bellmania includes a high-level language for specifying dynamic programming algorithms and a calculus that facilitates gradual transformation of these specifications into efficient implementations. These transformations formalize the divide-and conquer technique; a visualization interface helps users to interactively guide the process, while an SMT-based back-end verifies each step and takes care of low-level reasoning required for parallelism. We have used the system to generate provably correct implementations of several algorithms, including some important algorithms from computational biology, and show that the performance is comparable to that of the best manually optimized code.National Science Foundation (U.S.) (CCF-1139056)National Science Foundation (U.S.) (CCF- 1439084)National Science Foundation (U.S.) (CNS-1553510

    The PARSE Programming Paradigm. Part I: Software Development Methodology. Part II: Software Development Support Tools

    Get PDF
    The programming methodology of PARSE (parallel software environment), a software environment being developed for reconfigurable non-shared memory parallel computers, is described. This environment will consist of an integrated collection of language interfaces, automatic and semi-automatic debugging and analysis tools, and operating system —all of which are made more flexible by the use of a knowledge-based implementation for the tools that make up PARSE. The programming paradigm supports the user freely choosing among three basic approaches /abstractions for programming a parallel machine: logic-based descriptive, sequential-control procedural, and parallel-control procedural programming. All of these result in efficient parallel execution. The current work discusses the methodology underlying PARSE, whereas the companion paper, “The PARSE Programming Paradigm — II: Software Development Support Tools,” details each of the component tools

    Proceedings of the Fifth NASA/NSF/DOD Workshop on Aerospace Computational Control

    Get PDF
    The Fifth Annual Workshop on Aerospace Computational Control was one in a series of workshops sponsored by NASA, NSF, and the DOD. The purpose of these workshops is to address computational issues in the analysis, design, and testing of flexible multibody control systems for aerospace applications. The intention in holding these workshops is to bring together users, researchers, and developers of computational tools in aerospace systems (spacecraft, space robotics, aerospace transportation vehicles, etc.) for the purpose of exchanging ideas on the state of the art in computational tools and techniques
    • …
    corecore