1,019 research outputs found

    On the acceleration of wavefront applications using distributed many-core architectures

    Get PDF
    In this paper we investigate the use of distributed graphics processing unit (GPU)-based architectures to accelerate pipelined wavefront applications—a ubiquitous class of parallel algorithms used for the solution of a number of scientific and engineering applications. Specifically, we employ a recently developed port of the LU solver (from the NAS Parallel Benchmark suite) to investigate the performance of these algorithms on high-performance computing solutions from NVIDIA (Tesla C1060 and C2050) as well as on traditional clusters (AMD/InfiniBand and IBM BlueGene/P). Benchmark results are presented for problem classes A to C and a recently developed performance model is used to provide projections for problem classes D and E, the latter of which represents a billion-cell problem. Our results demonstrate that while the theoretical performance of GPU solutions will far exceed those of many traditional technologies, the sustained application performance is currently comparable for scientific wavefront applications. Finally, a breakdown of the GPU solution is conducted, exposing PCIe overheads and decomposition constraints. A new k-blocking strategy is proposed to improve the future performance of this class of algorithm on GPU-based architectures

    Measuring NUMA effects with the STREAM benchmark

    Full text link
    Modern high-end machines feature multiple processor packages, each of which contains multiple independent cores and integrated memory controllers connected directly to dedicated physical RAM. These packages are connected via a shared bus, creating a system with a heterogeneous memory hierarchy. Since this shared bus has less bandwidth than the sum of the links to memory, aggregate memory bandwidth is higher when parallel threads all access memory local to their processor package than when they access memory attached to a remote package. But, the impact of this heterogeneous memory architecture is not easily understood from vendor benchmarks. Even where these measurements are available, they provide only best-case memory throughput. This work presents a series of modifications to the well-known STREAM benchmark to measure the effects of NUMA on both a 48-core AMD Opteron machine and a 32-core Intel Xeon machine

    The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition

    Get PDF
    We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible framework for parallel task-composition based many-core programming. We allow the programmer to structure programs into task code, written as C++ classes, and communication code, written in a restricted subset of C++ with functional semantics and parallel evaluation. In this paper we discuss the GPRM, the virtual machine framework that enables the parallel task composition approach. We focus the discussion on GPIR, the functional language used as the intermediate representation of the bytecode running on the GPRM. Using examples in this language we show the flexibility and power of our task composition framework. We demonstrate the potential using an implementation of a merge sort algorithm on a 64-core Tilera processor, as well as on a conventional Intel quad-core processor and an AMD 48-core processor system. We also compare our framework with OpenMP tasks in a parallel pointer chasing algorithm running on the Tilera processor. Our results show that the GPRM programs outperform the corresponding OpenMP codes on all test platforms, and can greatly facilitate writing of parallel programs, in particular non-data parallel algorithms such as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221

    Proceedings of the Second International Workshop on HyperTransport Research and Applications (WHTRA2011)

    Get PDF
    Proceedings of the Second International Workshop on HyperTransport Research and Applications (WHTRA2011) which was held Feb. 9th 2011 in Mannheim, Germany. The Second International Workshop for Research on HyperTransport is an international high quality forum for scientists, researches and developers working in the area of HyperTransport. This includes not only developments and research in HyperTransport itself, but also work which is based on or enabled by HyperTransport. HyperTransport (HT) is an interconnection technology which is typically used as system interconnect in modern computer systems, connecting the CPUs among each other and with the I/O bridges. Primarily designed as interconnect between high performance CPUs it provides an extremely low latency, high bandwidth and excellent scalability. The definition of the HTX connector allows the use of HT even for add-in cards. In opposition to other peripheral interconnect technologies like PCI-Express no protocol conversion or intermediate bridging is necessary. HT is a direct connection between device and CPU with minimal latency. Another advantage is the possibility of cache coherent devices. Because of these properties HT is of high interest for high performance I/O like networking and storage, but also for co-processing and acceleration based on ASIC or FPGA technologies. In particular acceleration sees a resurgence of interest today. One reason is the possibility to reduce power consumption by the use of accelerators. In the area of parallel computing the low latency communication allows for fine grain communication schemes and is perfectly suited for scalable systems. Summing up, HT technology offers key advantages and great performance to any research aspect related to or based on interconnects. For more information please consult the workshop website (http://whtra.uni-hd.de)

    Evolutionary optimization of neural networks with heterogeneous computation: study and implementation

    Full text link
    In the optimization of artificial neural networks (ANNs) via evolutionary algorithms and the implementation of the necessary training for the objective function, there is often a trade-off between efficiency and flexibility. Pure software solutions on general-purpose processors tend to be slow because they do not take advantage of the inherent parallelism, whereas hardware realizations usually rely on optimizations that reduce the range of applicable network topologies, or they attempt to increase processing efficiency by means of low-precision data representation. This paper presents, first of all, a study that shows the need of heterogeneous platform (CPU–GPU–FPGA) to accelerate the optimization of ANNs using genetic algorithms and, secondly, an implementation of a platform based on embedded systems with hardware accelerators implemented in Field Pro-grammable Gate Array (FPGA). The implementation of the individuals on a remote low-cost Altera FPGA allowed us to obtain a 3x–4x acceleration compared with a 2.83 GHz Intel Xeon Quad-Core and 6x–7x compared with a 2.2 GHz AMD Opteron Quad-Core 2354.The translation of this paper was funded by the Universitat Politecnica de Valencia, Spain.Fe, JD.; Aliaga Varea, RJ.; Gadea Gironés, R. (2015). Evolutionary optimization of neural networks with heterogeneous computation: study and implementation. The Journal of Supercomputing. 71(8):2944-2962. doi:10.1007/s11227-015-1419-7S29442962718Farmahini-Farahani A, Vakili S, Fakhraie SM, Safari S, Lucas C (2010) Parallel scalable hardware implementation of asynchronous discrete particle swarm optimization. Eng Appl Artif Intell 23(2):177–187Curteanu S, Cartwright H (2011) Neural networks applied in chemistry. i. Determination of the optimal topology of multilayer perceptron neural networks. J Chemom 25(10):527–549. doi: 10.1002/cem.1401Islam MM, Sattar MA, Amin MF, Yao X, Murase K (2009) A new adaptive merging and growing algorithm for designing artificial neural networks. Ieee Trans Syst Man Cybern Part B-Cybern 39(3):705–722Han KH, Kim JH (2004) Quantum-inspired evolutionary algorithms with a new termination criterion, h-epsilon gate, and two-phase scheme. Ieee Trans Evol Comput 8(2):156–169Leung FHF, Lam HK, Ling SH, Tam PKS (2003) Tuning of the structure and parameters of a neural network using an improved genetic algorithm. Ieee Trans Neural Netw 14(1):79–88Tsai JT, Chou JH, Liu TK (2006) Tuning the structure and parameters of a neural network by using hybrid taguchi-genetic algorithm. Ieee Trans Neural Netw 17(1):69–80Ludermir TB, Yamazaki A, Zanchettin C (2006) An optimization methodology for neural network weights and architectures. Ieee Trans Neural Netw 17(6):1452–1459Palmes PP, Hayasaka T, Usui S (2005) Mutation-based genetic neural network. Trans Neural Netw 16(3):587–600. doi: 10.1109/TNN.2005.844858Mu T, Jiang J, Wang Y, Goulermas JY (2012) Adaptive data embedding framework for multiclass classification. Ieee Trans Neural Netw Learn Syst 23(8):1291–1303Lu T-C, Yu G-R, Juang J-C (2013) Quantum-based algorithm for optimizing artificial neural networks. IEEE Trans Neural Netw Lear Syst 24(8):1266–1278Yao X (1999) Evolving artificial neural networks. Proc Ieee 87(9):1423–1447Yao X, Liu Y (1997) A new evolutionary system for evolving artificial neural networks. Ieee Trans Neural Netw 8(3):694–713Mateo F, Sovilj D, Gadea-Gironés R (2010) Approximate k-NN delta test minimization method using genetic algorithms: application to time series. NEUROCOMPUTING 73(10–12, Sp):2017–2029Hawkins S, He H, Williams G, Baxter R (2002) Outlier detection using replicator neural networks. In: Proceedings of the 5th international conference and data warehousing and knowledge discovery. DaWaK02, pp 170–180Fe J, Aliaga RJ, Gironés RG (2013) Experimental platform for accelerate the training of anns with genetic algorithm and embedded system on fpga. In: IWINAC (2), pp 413–420Prechelt L (1994) Proben1—a set of neural network benchmark problems and benchmarking rules. Technical reportAbbass HA (2002) An evolutionary artificial neural networks approach for breast cancer diagnosis. Artif Intell Med 25:265–281Ahmad F, Isa NAM, Hussain Z, Sulaiman SN (2013) A genetic algorithm-based multi-objective optimization of an artificial neural network classifier for breast cancer diagnosis. Neural Comput Appl 23(5):1427–1435Sankaradas M, Jakkula V, Cadambi S, Chakradhar S, Durdanovic I, Cosatto E, Graf H (2009) A massively parallel coprocessor for convolutional neural networks. In: Application-specific systems, architectures and processors, 2009. ASAP 2009. 20th IEEE international conference on, July, pp 53–60Prado R, Melo J, Oliveira J, Neto A (2012) Fpga based implementation of a fuzzy neural network modular architecture for embedded systems. In: Neural networks (IJCNN), The 2012 international joint conference on, June, pp 1–7Çavuşlu M, Karakuzu C, Sahin S, Yakut M (2011) Neural network training based on fpga with floating point number format and its performance. Neural Comput Appl 20:195–202. doi: 10.1007/s00521-010-0423-3Wu G-D, Zhu Z-W, Lin B-W (2011) Reconfigurable back propagation based neural network architecture. In: Integrated circuits (ISIC), 2011 13th international symposium on, Dec, pp 67–70Pinjare SL, Kumar A (2012) Implementation of neural network back propagation training algorithm on fpga. Int J Comput Appl 52(6): 1–7, August, published by Foundation of Computer Science, New York, USAhttp://www.altera.comAliaga R, Gadea R, Colom R, Cerda J, Ferrando N, Herrero V (2009) A mixed hardware–software approach to flexible artificial neural network training on fpga. In: Systems, architectures, modeling, and simulation, 2009. SAMOS ’09. International symposium on, July, pp 1–8http://www.matlab.co

    Thermodynamic Casimir effect: Universality and Corrections to Scaling

    Full text link
    We study the thermodynamic Casimir force for films in the three-dimensional Ising universality class with symmetry breaking boundary conditions. We focus on the effect of corrections to scaling and probe numerically the universality of our results. In particular we check our hypothesis that corrections are well described by an effective thickness L_{0,eff}=L_0+c (L_0+L_s)^{1-\omega} +L_s, where c and L_s are system specific parameters and \omega\approx 0.8 is the exponent of the leading bulk correction. We simulate the improved Blume-Capel model and the Ising model on the simple cubic lattice. First we analyse the behaviour of various quantities at the critical point. Taking into account corrections \propto L_0^{-\omega} in the case of the Ising model, we find good consistency of results obtained from these two different models. In particular we get from the analysis of our data for the Ising model for the difference of Casimir amplitudes \Delta_{+-}-\Delta_{++}=3.200(5), which nicely compares with \Delta_{+-}-\Delta_{++}=3.208(5) obtained by studying the improved Blume-Capel model. Next we study the behaviour of the thermodynamic Casimir force for large values of the scaling variable x=t [L_0/\xi_0]. This behaviour can be obtained up to an overall amplitude by expressing the partition function of the film in terms of eigenvalues and eigenstates of the transfermatrix and boundary states. Here we show how this overall amplitude can be computed with high accuracy. Finally we discuss our results for the scaling functions \theta_{+-} and \theta_{++} of the thermodynamic Casimir force for the whole range of the scaling variable. We conclude that our numerical results are in accordance with universality. Corrections to scaling are well approximated by an effective thickness.Comment: 35 pages, 5 figures, various typos correcte

    Analysis of Inter-Chip Communication Patterns on Multi-Core Distributed Shared-Memory Computers

    Get PDF
    Multi-core multi-socket distributed shared-memory com- puters (DSM computers, for short) have become an impor- tant node architecture in scientific computing as they provide substantial computational capacity with relatively low space and power requirements. Compared to conventional computer networks, inter-chip networks used in DSM computers feature higher bandwidth, lower latency and tighter integration with the CPU. The inter-chip network is a shared resource among the user application and many other services, which can lead to consid- erable variation of execution times of identical communication tasks. In this work, we explore traffic patterns resulting from MPI collective communication primitives and investigate the ques- tion whether inter-chip link load is a reliable indicator and predictor for the execution time of collective communication primitives on a DSM computer. Our experiments on a Sun Fire X4600 M2 DSM computer with 32 cores (eight quad-core CPUs) indicate that specific single link loads are positively correlated with the execution time of MPI ALLREDUCE. Ob- serving patterns over multiple links allows refinement of the single-link observation

    Computers working at the speed of light

    Get PDF

    New insight on galaxy structure from GALPHAT I. Motivation, methodology, and benchmarks for Sersic models

    Get PDF
    We introduce a new galaxy image decomposition tool, GALPHAT (GALaxy PHotometric ATtributes), to provide full posterior probability distributions and reliable confidence intervals for all model parameters. GALPHAT is designed to yield a high speed and accurate likelihood computation, using grid interpolation and Fourier rotation. We benchmark this approach using an ensemble of simulated Sersic model galaxies over a wide range of observational conditions: the signal-to-noise ratio S/N, the ratio of galaxy size to the PSF and the image size, and errors in the assumed PSF; and a range of structural parameters: the half-light radius rer_e and the Sersic index nn. We characterise the strength of parameter covariance in Sersic model, which increases with S/N and nn, and the results strongly motivate the need for the full posterior probability distribution in galaxy morphology analyses and later inferences. The test results for simulated galaxies successfully demonstrate that, with a careful choice of Markov chain Monte Carlo algorithms and fast model image generation, GALPHAT is a powerful analysis tool for reliably inferring morphological parameters from a large ensemble of galaxies over a wide range of different observational conditions. (abridged)Comment: Submitted to MNRAS. The submitted version with high resolution figures can be downloaded from http://www.astro.umass.edu/~iyoon/GALPHAT/galphat1.pd
    • …
    corecore