112 research outputs found

    A methodology for speeding up matrix vector multiplication for single/multi-core architectures

    Get PDF
    In this paper, a new methodology for computing the Dense Matrix Vector Multiplication, for both embedded (processors without SIMD unit) and general purpose processors (single and multi-core processors, with SIMD unit), is presented. This methodology achieves higher execution speed than ATLAS state-of-the-art library (speedup from 1.2 up to 1.45). This is achieved by fully exploiting the combination of the software (e.g., data reuse) and hardware parameters (e.g., data cache associativity) which are considered simultaneously as one problem and not separately, giving a smaller search space and high-quality solutions. The proposed methodology produces a different schedule for different values of the (i) number of the levels of data cache; (ii) data cache sizes; (iii) data cache associativities; (iv) data cache and main memory latencies; (v) data array layout of the matrix and (vi) number of cores

    A high-performance matrix-matrix multiplication methodology for CPU and GPU architectures

    Get PDF
    Current compilers cannot generate code that can compete with hand-tuned code in efficiency, even for a simple kernel like matrix–matrix multiplication (MMM). A key step in program optimization is the estimation of optimal values for parameters such as tile sizes and number of levels of tiling. The scheduling parameter values selection is a very difficult and time-consuming task, since parameter values depend on each other; this is why they are found by using searching methods and empirical techniques. To overcome this problem, the scheduling sub-problems must be optimized together, as one problem and not separately. In this paper, an MMM methodology is presented where the optimum scheduling parameters are found by decreasing the search space theoretically, while the major scheduling sub-problems are addressed together as one problem and not separately according to the hardware architecture parameters and input size; for different hardware architecture parameters and/or input sizes, a different implementation is produced. This is achieved by fully exploiting the software characteristics (e.g., data reuse) and hardware architecture parameters (e.g., data caches sizes and associativities), giving high-quality solutions and a smaller search space. This methodology refers to a wide range of CPU and GPU architectures

    A methodology pruning the search space of six compiler transformations by addressing them together as one problem and by exploiting the hardware architecture details

    Get PDF
    Today’s compilers have a plethora of optimizations-transformations to choose from, and the correct choice, order as well parameters of transformations have a significant/large impact on performance; choosing the correct order and parameters of optimizations has been a long standing problem in compilation research, which until now remains unsolved; the separate sub-problems optimization gives a different schedule/binary for each sub-problem and these schedules cannot coexist, as by refining one degrades the other. Researchers try to solve this problem by using iterative compilation techniques but the search space is so big that it cannot be searched even by using modern supercomputers. Moreover, compiler transformations do not take into account the hardware architecture details and data reuse in an efficient way. In this paper, a new iterative compilation methodology is presented which reduces the search space of six compiler transformations by addressing the above problems; the search space is reduced by many orders of magnitude and thus an efficient solution is now capable to be found. The transformations are the following: loop tiling (including the number of the levels of tiling), loop unroll, register allocation, scalar replacement, loop interchange and data array layouts. The search space is reduced (a) by addressing the aforementioned transformations together as one problem and not separately, (b) by taking into account the custom hardware architecture details (e.g., cache size and associativity) and algorithm characteristics (e.g., data reuse). The proposed methodology has been evaluated over iterative compilation and gcc/icc compilers, on both embedded and general purpose processors; it achieves significant performance gains at many orders of magnitude lower compilation time

    Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/s10766-013-0249-6The introduction of auto-tuning techniques in linear algebra shared-memory routines is analyzed. Information obtained in the installation of the routines is used at running time to take some decisions to reduce the total execution time. The study is carried out with routines at different levels (matrix multiplication, LU and Cholesky factorizations and linear systems symmetric or general routines) and with calls to routines in the LAPACK and PLASMA libraries with multithread implementations. Medium NUMA and large cc-NUMA systems are used in the experiments. This variety of routines, libraries and systems allows us to obtain general conclusions about the methodology to use for linear algebra shared-memory routines auto-tuning. Satisfactory execution times are obtained with the proposed methodology.Partially supported by Fundacion Seneca, Consejeria de Educacion de la Region de Murcia, 08763/PI/08, PROMETEO/2009/013 from Generalitat Valenciana, the Spanish Ministry of Education and Science through TIN2012-38341-C04-03, and the High-Performance Computing Network on Parallel Heterogeneus Architectures (CAPAP-H). The authors gratefully acknowledge the computer resources and assistance provided by the Supercomputing Centre of the Scientific Park Foundation of Murcia and by the Centre de Supercomputacio de Catalunya.Cámara, J.; Cuenca, J.; Giménez, D.; García, LP.; Vidal Maciá, AM. (2014). Empirical Installation of Linear Algebra Shared-Memory Subroutines for Auto-Tuning. International Journal of Parallel Programming. 42(3):408-434. https://doi.org/10.1007/s10766-013-0249-6S408434423Agullo, E., Demmel, J., Dongarra, J., Hadri, B., Kurzak, J., Langou, J., Ltaief, H., Luszczek, P., Tomov, S.: Numerical linear algebra on emerging architectures: the PLASMA and MAGMA projects. J. Phys. Conf. Ser. 180(1), 1–5 (2009)Alberti, P., Alonso, P., Vidal, A.M., Cuenca, J., Giménez, D.: Designing polylibraries to speed up linear algebra computations. Int. J. High Perform. Comput. Netw. 1/2/3(1), 75–84 (2004)Anderson, E., Bai, Z., Bischof, C., Demmel, J., Dongarra, J.J., Du Croz, J., Grenbaum, A., Hammarling, S., McKenney, A., Ostrouchov, S., D. Sorensen, S.: LAPACK User’s Guide. Society for Industrial and Applied Mathematics, Philadelphia (1995)Bernabé, G., Cuenca, J., Giménez, D.: Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: ICCS (2013)Buttari, A., Langou, J., Kurzak, J., Dongarra, J.: A class of parallel tiled linear algebra algorithms for multicore architectures. Parallel Comput. 35(1), 38–53 (2009)Cámara, J., Cuenca, J., Giménez, D., Vidal. A.M.: Empirical autotuning of two-level parallel linear algebra routines on large cc-NUMA systems. In: ISPA (2012)Caron, E., Desprez, F., Suter, F.: Parallel extension of a dynamic performance forecasting tool. Scalable Comput. Pract. Exp. 6(1), 57–69 (2005)Chen, Z., Dongarra, J., Luszczek, P., Roche, K.: Self adapting software for numerical linear algebra and LAPACK for clusters. Parallel Comput. 29, 1723–1743 (2003)Cuenca, J., Giménez, D., González, J.: Achitecture of an automatic tuned linear algebra library. Parallel Comput. 30(2), 187–220 (2004)Cuenca, J., García, L.P., Giménez, D.: Improving linear algebra computation on NUMA platforms through auto-tuned nested parallelism. In: Proceedings of the 2012 EUROMICRO Conference on Parallel, Distributed and Network Processing (2012)Frigo, M.: FFTW: An adaptive software architecture for the FFT. In: Proceedings of the ICASSP Conference, vol. 3, p. 1381 (1998)Golub, G., Van Loan, C.F.: Matrix Computations, 3rd edn. The John Hopkins University Press, Baltimore (1996)Im, E.-J., Yelick, K., Vuduc, R.: Sparsity: optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl. (IJHPCA) 18(1), 135–158 (2004)Intel MKL web page.: http://software.intel.com/en-us/intel-mkl/Jerez, S., Montávez, J.-P., Giménez, D.: Optimizing the execution of a parallel meteorology simulation code. In: Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium. IEEE (2009)Katagiri, T., Kise, K., Honda, H., Yuba, T.: Fiber: a generalized framework for auto-tuning software. Springer LNCS 2858, 146–159 (2003)Katagiri, T., Kise, K., Honda, H., Yuba, T.: ABCLib-DRSSED: a parallel eigensolver with an auto-tuning facility. Parallel Comput. 32(3), 231–250 (2006)Kurzak, J., Tomov, S., Dongarra, J.: Autotuning gemm kernels for the FERMI GPU. IEEE Trans. Parallel Distrib. Syst. 23(11), 2045–2057 (2012)Lastovetsky, A.L., Reddy, R., Higgins, R.: Building the functional performance model of a processor. In: SAC, pp. 746–753 (2006)Li, J., Skjellum, A., Falgout, R.D.: A poly-algorithm for parallel dense matrix multiplication on two-dimensional process grid topologies. Concurrency Pract. Exp. 9(5), 345–389 (1997)Naono, K., Teranishi, K., Cavazos, J., Suda, R., (eds.): Software Automatic Tuning. From Concepts to State-of-the-Art Results. Springer, Berlin (2010)Nath, R., Tomov, S., Dongarra, J.: An improved MAGMA gemm for FERMI graphics processing units. IJHPCA 24(4), 511–515 (2010)Petitet, A., Blackford, L.S., Dongarra, J., Ellis, B., Fagg, G.E., Roche, K., Vadhiyar, S.S.: Numerical libraries and the grid. IJHPCA 15(4), 359–374 (2001)PLASMA.: http://icl.cs.utk.edu/plasma/Püschel, M., Moura, J.M.F., Singer, B., Xiong, J., Johnson, J.R., Padua, D.A., Veloso, M.M., Johnson, R.W.: Spiral: a generator for platform-adapted libraries of signal processing algorithms. IJHPCA 18(1), 21–45 (2004)Seshagiri, L., Wu, M.-S., Sosonkina, M., Zhang, Z., Gordon, M.S., Schmidt, M.W.: Enhancing adaptive middleware for quantum chemistry applications with a database framework. In: IPDPS Workshops, pp. 1–8 (2010)Tanaka, T., Katagiri, T., Yuba, T.: d-Spline based incremental parameter estimation in automatic performance tuning. In: PARA, pp. 986–995 (2006)Vuduc, R., Demmel, J., Bilmes, J.: Statistical models for automatic performance tuning. In: International Conference on Computational Science (1), pp. 117–126 (2001)Whaley, R.C., Petitet, A., Dongarra, J.: Automated empirical optimizations of software and the ATLAS project. Parallel Comput. 27(1–2), 3–35 (2001

    svmPRAT: SVM-based Protein Residue Annotation Toolkit

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Over the last decade several prediction methods have been developed for determining the structural and functional properties of individual protein residues using sequence and sequence-derived information. Most of these methods are based on support vector machines as they provide accurate and generalizable prediction models.</p> <p>Results</p> <p>We present a general purpose protein residue annotation toolkit (<it>svm</it><monospace>PRAT</monospace>) to allow biologists to formulate residue-wise prediction problems. <it>svm</it><monospace>PRAT</monospace> formulates the annotation problem as a classification or regression problem using support vector machines. One of the key features of <it>svm</it><monospace>PRAT</monospace> is its ease of use in incorporating any user-provided information in the form of feature matrices. For every residue <it>svm</it><monospace>PRAT</monospace> captures local information around the reside to create fixed length feature vectors. <it>svm</it><monospace>PRAT</monospace> implements accurate and fast kernel functions, and also introduces a flexible window-based encoding scheme that accurately captures signals and pattern for training effective predictive models.</p> <p>Conclusions</p> <p>In this work we evaluate <it>svm</it><monospace>PRAT</monospace> on several classification and regression problems including disorder prediction, residue-wise contact order estimation, DNA-binding site prediction, and local structure alphabet prediction. <it>svm</it><monospace>PRAT</monospace> has also been used for the development of state-of-the-art transmembrane helix prediction method called TOPTMH, and secondary structure prediction method called YASSPP. This toolkit developed provides practitioners an efficient and easy-to-use tool for a wide variety of annotation problems.</p> <p><it>Availability</it>: <url>http://www.cs.gmu.edu/~mlbio/svmprat</url></p

    Promotion of a healthy lifestyle among 5-year-old overweight children: Health behavior outcomes of the 'Be active, eat right' study

    Get PDF
    Background: This study evaluates the effects of an intervention performed by youth health care professionals on child health behaviors. The intervention consisted of offering healthy lifestyle counseling to parents of overweight (not obese) 5-year-old children. Effects of the intervention on the child having breakfast, drinking sweet beverages, watching television and playing outside were evaluated. Methods. Data were collected with the 'Be active, eat right' study, a cluster randomized controlled trial among nine youth health care centers in the Netherlands. Parents of overweight children received lifestyle counseling according to the intervention protocol in the intervention condition (n = 349) and usual care in the control condition (n = 288). Parents completed questionnaires regarding demographic characteristics, health behaviors and the home environment at baseline and at 2-year follow-up. Cluster adjusted regression models were applied; interaction terms were explored. Results: The population for analysis consisted of 38.1% boys; mean age 5.8 [sd 0.4] years; mean BMI SDS 1.9 [sd 0.4]. There were no significant differences in the number of minutes of outside play or television viewing a day between children in the intervention and the control condition. Also, the odds ratio for having breakfast daily or drinking two or less glasses of sweet beverages a day showed no significant differences between the two conditions. Additional analyses showed that the odds ratio for drinking less than two glasses of sweet beverages at follow-up compared with baseline was significantly higher for children in both the intervention (p < 0.001) and the control condition (p = 0.029). Conclusions: Comparison of the children in the two conditions showed that the intervention does not contribute to a change in health behaviors. Further studies are needed to investigate opportunities to adjust the intervention protocol, such as integration of elements in the regular well-child visit. The intervention protocol for youth health care may become part of a broader approach to tackle childhood overweight and obesity. Trial registration. Current Controlled Trials ISRCTN04965410

    Reading Comprehension and Reading Comprehension Difficulties

    Get PDF
    • …
    corecore