1,437 research outputs found

    An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)

    Full text link
    We present a detailed approach for making use of two new computer hardware architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis application (Einstein@Home). Our results suggest that both the architectures suit the application quite well and the achievable performance in the same software developmental time-frame, is nearly identical.Comment: Accepted for publication in International Conference on Parallel Processing and Applied Mathematics (PPAM 2009

    Mixing multi-core CPUs and GPUs for scientific simulation software

    Get PDF
    Recent technological and economic developments have led to widespread availability of multi-core CPUs and specialist accelerator processors such as graphical processing units (GPUs). The accelerated computational performance possible from these devices can be very high for some applications paradigms. Software languages and systems such as NVIDIA's CUDA and Khronos consortium's open compute language (OpenCL) support a number of individual parallel application programming paradigms. To scale up the performance of some complex systems simulations, a hybrid of multi-core CPUs for coarse-grained parallelism and very many core GPUs for data parallelism is necessary. We describe our use of hybrid applica- tions using threading approaches and multi-core CPUs to control independent GPU devices. We present speed-up data and discuss multi-threading software issues for the applications level programmer and o er some suggested areas for language development and integration between coarse-grained and ne-grained multi-thread systems. We discuss results from three common simulation algorithmic areas including: partial di erential equations; graph cluster metric calculations and random number generation. We report on programming experiences and selected performance for these algorithms on: single and multiple GPUs; multi-core CPUs; a CellBE; and using OpenCL. We discuss programmer usability issues and the outlook and trends in multi-core programming for scienti c applications developers

    Design and implementation of interface units for high speed fiber optics local area networks and broadband integrated services digital networks

    Get PDF
    The design and implementation of interface units for high speed Fiber Optic Local Area Networks and Broadband Integrated Services Digital Networks are discussed. During the last years, a number of network adapters that are designed to support high speed communications have emerged. This approach to the design of a high speed network interface unit was to implement package processing functions in hardware, using VLSI technology. The VLSI hardware implementation of a buffer management unit, which is required in such architectures, is described

    On the design of architecture-aware algorithms for emerging applications

    Get PDF
    This dissertation maps various kernels and applications to a spectrum of programming models and architectures and also presents architecture-aware algorithms for different systems. The kernels and applications discussed in this dissertation have widely varying computational characteristics. For example, we consider both dense numerical computations and sparse graph algorithms. This dissertation also covers emerging applications from image processing, complex network analysis, and computational biology. We map these problems to diverse multicore processors and manycore accelerators. We also use new programming models (such as Transactional Memory, MapReduce, and Intel TBB) to address the performance and productivity challenges in the problems. Our experiences highlight the importance of mapping applications to appropriate programming models and architectures. We also find several limitations of current system software and architectures and directions to improve those. The discussion focuses on system software and architectural support for nested irregular parallelism, Transactional Memory, and hybrid data transfer mechanisms. We believe that the complexity of parallel programming can be significantly reduced via collaborative efforts among researchers and practitioners from different domains. This dissertation participates in the efforts by providing benchmarks and suggestions to improve system software and architectures.Ph.D.Committee Chair: Bader, David; Committee Member: Hong, Bo; Committee Member: Riley, George; Committee Member: Vuduc, Richard; Committee Member: Wills, Scot

    On-Chip memories, the OS perspective

    Get PDF
    This paper is a work in progress study of the operating system services required to manage on-chip memories. We are evaluating different CMP on-chip memories configurations. Chip-MultiProcessors (CMP) architectures integrating multiple computing and memory elements presents different problems (coherency, latency, ...) that must be solved. On-chip local memories are directly addressable and their latency is much shorter than off-chip main memories. Since memory latency is a key factor for application performance, we study how the OS can help.Postprint (author’s final draft

    Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell

    Get PDF
    Phylogenetic inference is considered to be one of the grand challenges in Bioinformatics due to the immense computational requirements. RAxML is currently among the fastest and most accurate programs for phylogenetic tree inference under the Maximum Likelihood (ML) criterion. First, we introduce new tree search heuristics that accelerate RAxML by a factor of 2.43 while returning equally good trees. The performance of the new search algorithm has been assessed on 18 real-world datasets comprising 148 up to 4,843 DNA sequences. We then present the implementation, optimization, and evaluation of RAxML on the IBM Cell Broadband Engine. We address the problems and provide solutions pertaining to the optimization of floating point code, control flow, communication, and scheduling of multi-level parallelism on the Cel

    Acceleration of Spiking Neural Networks on Multicore Architectures

    Get PDF
    The human cortex is the seat of learning and cognition. Biological scale implementations of cortical models have the potential to provide significantly more power problem solving capabilities than traditional computing algorithms. The large scale implementation and design of these models has attracted significant attention recently. High performance implementations of the models are needed to enable such large scale designs. This thesis examines the acceleration of the spiking neural network class of cortical models on several modern multicore processors. These include the Izhikevich, Wilson, Morris-Lecar, and Hodgkin-Huxley models. The architectures examined are the STI Cell, Sun UltraSPARC T2+, and Intel Xeon E5345. Results indicate that these modern multicore processors can provide significant speed-ups and thus are useful in developing large scale cortical models. The models are then implemented on a 50 TeraFLOPS 336 node PlayStation 3 cluster. Results indicate that the models scale well on this cluster and can emulate 108 neurons and 1010 synapses. These numbers are comparable to the large scale cortical model implementation studies performed by IBM using the Blue Gene/L supercomputer. This study indicates that a cluster of PlayStation 3s can provide an economical, yet powerful, platform for simulating large scale biological models

    ACOTES project: Advanced compiler technologies for embedded streaming

    Get PDF
    Streaming applications are built of data-driven, computational components, consuming and producing unbounded data streams. Streaming oriented systems have become dominant in a wide range of domains, including embedded applications and DSPs. However, programming efficiently for streaming architectures is a challenging task, having to carefully partition the computation and map it to processes in a way that best matches the underlying streaming architecture, taking into account the distributed resources (memory, processing, real-time requirements) and communication overheads (processing and delay). These challenges have led to a number of suggested solutions, whose goal is to improve the programmer’s productivity in developing applications that process massive streams of data on programmable, parallel embedded architectures. StreamIt is one such example. Another more recent approach is that developed by the ACOTES project (Advanced Compiler Technologies for Embedded Streaming). The ACOTES approach for streaming applications consists of compiler-assisted mapping of streaming tasks to highly parallel systems in order to maximize cost-effectiveness, both in terms of energy and in terms of design effort. The analysis and transformation techniques automate large parts of the partitioning and mapping process, based on the properties of the application domain, on the quantitative information about the target systems, and on programmer directives. This paper presents the outcomes of the ACOTES project, a 3-year collaborative work of industrial (NXP, ST, IBM, Silicon Hive, NOKIA) and academic (UPC, INRIA, MINES ParisTech) partners, and advocates the use of Advanced Compiler Technologies that we developed to support Embedded Streaming.Peer ReviewedPostprint (published version
    • …
    corecore