116 research outputs found
On the Performance Prediction of BLAS-based Tensor Contractions
Tensor operations are surging as the computational building blocks for a
variety of scientific simulations and the development of high-performance
kernels for such operations is known to be a challenging task. While for
operations on one- and two-dimensional tensors there exist standardized
interfaces and highly-optimized libraries (BLAS), for higher dimensional
tensors neither standards nor highly-tuned implementations exist yet. In this
paper, we consider contractions between two tensors of arbitrary dimensionality
and take on the challenge of generating high-performance implementations by
resorting to sequences of BLAS kernels. The approach consists in breaking the
contraction down into operations that only involve matrices or vectors. Since
in general there are many alternative ways of decomposing a contraction, we are
able to methodically derive a large family of algorithms. The main contribution
of this paper is a systematic methodology to accurately identify the fastest
algorithms in the bunch, without executing them. The goal is instead
accomplished with the help of a set of cache-aware micro-benchmarks for the
underlying BLAS kernels. The predictions we construct from such benchmarks
allow us to reliably single out the best-performing algorithms in a tiny
fraction of the time taken by the direct execution of the algorithms.Comment: Submitted to PMBS1
Recommended from our members
Performance portability of Earth system models with user-controlled GGDML code translation
The increasing need for performance of earth system modeling and other scientific domains pushes the computing technologies in diverse architectural directions. The development of models needs technical expertise and skills of using tools that are able to exploit the hardware capabilities. The heterogeneity of architectures complicates the development and the maintainability of the models. To improve the software development process of earth system models, we provide an approach that simplifies the code maintainability by fostering separation of concerns while providing performance portability. We propose the use of high-level language extensions that reflect scientific concepts. The scientists can use the programming language of their own choice to develop models, however, they can use the language extensions optionally wherever they need. The code translation is driven by configurations that are separated from the model source code. These configurations are prepared by scientific programmers to optimally use the machine’s features. The main contribution of this paper is the demonstration of a user-controlled source-to-source translation technique of earth system models that are written with higher-level semantics. We discuss a flexible code translation technique that is driven by the users through a configuration input that is prepared especially to transform the code, and we use this technique to produce OpenMP or OpenACC enabled codes besides MPI to support multi-node configurations
Running Genetic Algorithms in the Edge: A First Analysis
Nowadays, the volume of data produced by different kinds of devices is continuously growing, making even more difficult to solve the
many optimization problems that impact directly on our living quality. For instance, Cisco projected that by 2019 the volume of data will reach 507.5 zettabytes per year, and the cloud traffic will quadruple. This is not sustainable in the long term, so it is a need to move part of the intelligence from the cloud to a highly decentralized computing model. Considering this, we propose a ubiquitous intelligent system which is composed by different kinds of endpoint devices such as smartphones, tablets, routers, wearables, and any other CPU powered device. We want to use this to solve tasks useful for smart cities. In this paper, we analyze if these devices are suitable for this purpose and how we have to adapt the optimization algorithms to be efficient using heterogeneous hardware. To do this, we perform a set of experiments in which we measure the speed, memory usage, and battery consumption of these devices for a set of binary and combinatorial problems. Our conclusions reveal the strong and weak features of each device to run future algorihms in the border of the cyber-physical system.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech.
This research has been partially funded by the Spanish MINECO and FEDER projects TIN2014-57341-R (http://moveon.lcc.uma.es), TIN2016-81766-REDT (http://cirti.es), TIN2017-88213-R (http://6city.lcc.uma.es), the Ministry of Education of Spain (FPU16/02595
Benchmark based on application signature to analyze and predict their behavior
Currently, there are benchmark sets that measure the performance of HPC systems under specific computing and communication properties. These benchmarks represent the kernels of applications that measure specific hardware components. If the user’s application is not represented by any benchmark, it is not possible to obtain an equivalent performance metric. In this work, we propose a benchmark based on the signature of an MPI application obtained by the PAS2P method. PAS2P creates the application signature in order to predict the execution time, which we believe will be very adjusted in relation to the execution time of the full application. The signature has two performance qualities: the bounded time to execute it (a benchmark property) and the quality of prediction. Therefore, we propose to extend the signature by giving the benchmark capacities such as the efficiency of the application over the HPC system. The performance metrics will be performed by the benchmark proposed. The experimentation validates our proposal with an average error of prediction close to 7%.Instituto de Investigación en Informátic
Solving Weighted Least Squares (WLS) problems on ARM-based architectures
TheWeighted Least Squares algorithm (WLS) is applied to numerous optimization
problems, but requires the use of high computational resources, especially
when complex arithmetic is involved. This work aims to accelerate the resolution of
a WLS problem by reducing the computational cost (relaying on BLAS/LAPACK
routines) and the computational precision from double to single. As a test case, we
design an IIR filter for a Graphic Equalizer, where the numerical errors due to single
precision are easily visualized. In addition, given the importance of low power architectures
for this kind of implementations, we evaluate the performance, scalability,
and energy efficiency of each method on two different processors implementing the ARMv7 architecture, widely used in current mobile devices with power constraints.
Results show that the method that exhibits a high theoretical computational cost overcomes
in efficiency other methods with lower theoretical cost in architectures of this
type.This work started in spring 2016 when Jose A. Belloch was a visiting postdoctoral researcher at Budapest University of Technology and Economics thanks to the European Network COST Action IC1305 inside the program Short Term Scientific Mission with the following reference: COST-SPASM-ECOST-STSM-IC1305-020416-072431. Dr. Jose A. Belloch is supported by GVA contract APOSTD/2016/069. The researchers from Universitat Jaume I are supported by the CICYT projects TIN2014-53495-R of MINECO and FEDER. The authors from the Universitat Politecnica de Valencia are supported by MINECO Projects TEC2015-67387-C4-1-R, PROMETEOII/2014/003 and CAPAP-H5 network TIN2014-53522-REDT. The researcher from UCM is supported by the EU (FEDER) and the Spanish MINECO, under Grants TIN 2015-65277-R and TIN2012-32180. The work of Balazs Bank was supported by the UNKP-16-4-III New National Excellence Program of the Ministry of Human Capacities, Hungary.Belloch Rodríguez, JA.; Bank, B.; Igual Peña, FD.; Quintana Ortí, ES.; Vidal Maciá, AM. (2017). Solving Weighted Least Squares (WLS) problems on ARM-based architectures. Journal of Supercomputing. 73(1):530-542. https://doi.org/10.1007/s11227-016-1910-9S530542731Smith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2014)Burrus CS (2012) Iterative reweighted least squares. OpenStax-CNC document, May 2012, module m45285. http://cnx.org/content/m45285/1.12 . Accessed 2 Nov 2016Khang SW (1972) Best L p approximation. Math Comput 26(118):505–508Jackson LB (2008) Frequency-domain Steiglitz-McBride method for least-squares filter design, ARMA modeling, and periodogram smoothing. IEEE Signal Process Lett 15:49–52Bank B (2012) Magnitude-priority filter design for audio applications. In: Proceedings of 132 nd AES Convention, Preprint No. 8591, Budapest, Hungary, May 2012Daubechies I, Devire R, Fornasier M, Gntrk CS (2010) Iteratively reweighted least squares minimization for sparse recovery. Comput Music J 23(2):52–69Rämö J, Välimäki V, Bank B (2014) High-precision parallel graphic equalizer. IEEE/ACM Trans Audio Speech Lange Proc 22(12):1894–1904Perez Gonzales E, Reiss J (2009) Automatic equalization of multi-channel audio using cross-adaptive methods. In: Proceedings of AES 127th Convention, New York, Oct. 2009Rämö J, Välimäki V (2013) Live sound equalization and attenuation with a headset. In: Proceedings of AES 51st International Conference, Helsinki, Finland, Aug. 2013Mäkivirta A, Antsalo P, Karjalainen M, Välimäki V (2003) Modal equalization of loudspeaker-room responses at low frequencies. J Audio Eng Soc 51(5):324–343Holters M, Zölzer U (2006) Graphic equalizer design using higher-order recursive filters. In: Proceedings of International Conference Digital Audio Effects, Montreal, QC, pp 37–40Tassart S (2013) Graphical equalization using interpolated filter banks. J Audio Eng Soc 61(5):263–279Chen Z, Geng GS, Yin FL, Hao J (2014) A pre-distortion based design method for digital audio graphic equalizer. Digital Signal Process 25:296–302Välimäki V, Reiss J (2016) All about audio equalization: solutions and frontiers. Appl Sci 6(5):129–145Belloch JA, Välimäki V (2016) Efficient target-response interpolation for a graphic equalizer. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), March 2016, pp 564–568Belloch JA, Alventosa FJ, Alonso P, Quintana-Ortí ES, Vidal AM (2016) Accelerating multi-channel filtering of audio signal on arm processors. J Supercomput, pp 1–12. doi: 10.1007/s11227-016-1689-8Belloch JA, Gonzalez A, Igual FD, Mayo R, Quintana-Ortí ES (2015)Vectorization of binaural sound virtualization on the ARM cortex-A15 architecture. In: Proceedings of 23rd European Signal Processing Conference, (EUSIPCO), Nize, France, September 2015Mitra G, Johnston B, Rendell A, McCreath E, Zhou J (2013) Use of simd vector operations to accelerate application code performance on low-powered arm and intel platforms. In: IEEE 27th International Parallel and Distributed Processing Symposium Workshops PhD Forum (IPDPSW), May 2013, pp 1107–1116Tomov S, Dongarra J, Baboulin M (2008) Towards dense linear algebra for hybrid gpu accelerated manycore systems. LAPACK Working Note, Tech. Rep. 210, Oct. 2008. http://www.netlib.org/lapack/lawnspdf/lawn210.pdf . Accessed 2 Nov 2016Dongarra JJ, DuCroz J, Hammarling S, Hanson RJ (1985) A proposal for an extended set of fortran basic linear algebra subprograms. ACM Signum Newsletter, New York, pp 2–18Golub GH, Loan CFV (2013) Matrix Comput, 4th edn. The John Hopkins University Press, BaltimoreAlonso P, Badia RM, Labarta J, Barreda M, Dolz MF, Mayo R, Quintana-Ortí ES, Reyes R (2012) Tools for power-energy modelling and analysis of parallel scientific applications. In: 41st International Conference on Parallel Processing—ICPP, 2012, pp 420–42
Tableau-based protein substructure search using quadratic programming
<p>Abstract</p> <p>Background</p> <p>Searching for proteins that contain similar substructures is an important task in structural biology. The exact solution of most formulations of this problem, including a recently published method based on tableaux, is too slow for practical use in scanning a large database.</p> <p>Results</p> <p>We developed an improved method for detecting substructural similarities in proteins using tableaux. Tableaux are compared efficiently by solving the quadratic program (QP) corresponding to the quadratic integer program (QIP) formulation of the extraction of maximally-similar tableaux. We compare the accuracy of the method in classifying protein folds with some existing techniques.</p> <p>Conclusion</p> <p>We find that including constraints based on the separation of secondary structure elements increases the accuracy of protein structure search using maximally-similar subtableau extraction, to a level where it has comparable or superior accuracy to existing techniques. We demonstrate that our implementation is able to search a structural database in a matter of hours on a standard PC.</p
Double-diffusive convection in an inclined porous layer with a concentration-based internal heat source
© 2017, The Author(s). The thermosolutal instability of double-diffusive convection in an inclined fluid-saturated porous layer with a concentration-based internal heat source is investigated. The linear instability of small-amplitude perturbations to the system is analyzed with respect to transverse and longitudinal rolls. The resultant eigenvalue problem is solved numerically utilizing the Chebyshev tau method. It is shown that an increasing inclination angle causes a strong stabilization in the transverse rolls irrespective of the internal heat source or vertical solutal Rayleigh number. Furthermore, substantial qualitative changes are demonstrated in the linear instability thresholds with variations in the inclination angle and concentration-based heat source
Case studies on the development of ScaLAPACK and the NAG Numerical PVM Library
In this paper we look at the development of ScaLAPACK, a software library for dense and banded numerical linear algebra, and the NAG Numerical PVM Library, which includes software for dense and sparse linear algebra, quadrature, optimization and random number generation. Both libraries are aimed at distributed memory machines, including networks of workstations. The paper concentrates on the underlying design and the testing of the libraries
- …