5,027 research outputs found

    Architecture-Aware Configuration and Scheduling of Matrix Multiplication on Asymmetric Multicore Processors

    Get PDF
    Asymmetric multicore processors (AMPs) have recently emerged as an appealing technology for severely energy-constrained environments, especially in mobile appliances where heterogeneity in applications is mainstream. In addition, given the growing interest for low-power high performance computing, this type of architectures is also being investigated as a means to improve the throughput-per-Watt of complex scientific applications. In this paper, we design and embed several architecture-aware optimizations into a multi-threaded general matrix multiplication (gemm), a key operation of the BLAS, in order to obtain a high performance implementation for ARM big.LITTLE AMPs. Our solution is based on the reference implementation of gemm in the BLIS library, and integrates a cache-aware configuration as well as asymmetric--static and dynamic scheduling strategies that carefully tune and distribute the operation's micro-kernels among the big and LITTLE cores of the target processor. The experimental results on a Samsung Exynos 5422, a system-on-chip with ARM Cortex-A15 and Cortex-A7 clusters that implements the big.LITTLE model, expose that our cache-aware versions of gemm with asymmetric scheduling attain important gains in performance with respect to its architecture-oblivious counterparts while exploiting all the resources of the AMP to deliver considerable energy efficiency

    Redshifts in the Southern Abell Redshift Survey Clusters. I. The Data

    Full text link
    The Southern Abell Redshift Survey contains 39 clusters of galaxies with redshifts in the range 0.0 < z < 0.31 and a median redshift depth of z = 0.0845. SARS covers the region 0 21h (while avoiding the LMC and SMC) with b > 40. Cluster locations were chosen from the Abell and Abell-Corwin-Olowin catalogs while galaxy positions were selected from the Automatic Plate Measuring Facility galaxy catalog with extinction-corrected magnitudes in the range 15 <= b_j < 19. SARS utilized the Las Campanas 2.5 m duPont telescope, observing either 65 or 128 objects concurrently over a 1.5 sq deg field. New redshifts for 3440 galaxies are reported in the fields of these 39 clusters of galaxies.Comment: 20 pages, 5 figures, accepted for publication in the Astronomical Journal, Table 2 can be downloaded in its entirety from http://trotsky.arc.nasa.gov/~mway/SARS1/sars1-table2.cs

    X-ray total mass estimate for the nearby relaxed cluster A3571

    Get PDF
    We constrain the total mass distribution in the cluster A3571, combining spatially resolved ASCA temperature data with ROSAT imaging data with the assumption that the cluster is in hydrostatic equilibrium. The total mass within r_500 (1.7/h_50 Mpc) is M_500 = 7.8[+1.4,-2.2] 10^14/ h_50 Msun at 90% confidence, 1.1 times smaller than the isothermal estimate. The Navarro, Frenk & White ``universal profile'' is a good description of the dark matter density distribution in A3571. The gas density profile is shallower than the dark matter profile, scaling as r^{-2.1} at large radii, leading to a monotonically increasing gas mass fraction with radius. Within r_500 the gas mass fraction reaches a value of f_gas = 0.19[+0.06,-0.03] h_50^{-3/2} (90% confidence errors). Assuming that this value of f_gas is a lower limit for the the universal value of the baryon fraction, we estimate the 90% confidence upper limit of the cosmological matter density to be Omega_m < 0.4.Comment: 10 pages, 4 figures, accepted by Ap

    El agronegocio del cultivo de tartago em el mundo.

    Get PDF
    bitstream/CNPA/18351/1/CIRTEC101.pd

    Integration and exploitation of intra-routine malleability in BLIS

    Full text link
    [EN] Malleability is a property of certain applications (or tasks) that, given an external request or autonomously, can accommodate a dynamic modification of the degree of parallelism being exploited at runtime. Malleability improves resource usage (core occupation) on modern multicore architectures for applications that exhibit irregular and divergent execution paths and heavily depend on the underlying library performance to attain high performance. The integration of malleability within high-performance instances of the Basic Linear Algebra Subprograms (BLAS) is nonexistent, and, in addition, it is difficult to attain given the rigidity of current application programming interfaces (APIs). In this paper, we overcome these issues presenting the integration of a malleability mechanism within BLIS, a high-performance and portable framework to implement BLAS-like operations. For this purpose, we leverage low-level (yet simple) APIs to integrate on-demand malleability across all Level-3 BLAS routines, and we demonstrate the performance benefits of this approach by means of a higher-level dense matrix operation: the LU factorization with partial pivoting and look-aheadThe researchers from Universidad Complutense de Madrid were supported by the EU (FEDER) and Spanish MINECO (TIN2015-65277-R, RTI2018-093684-B-I00), and by Spanish CM (S2018/TCS-4423). The researcher from Universitat Poliecnica de Valencia was supported by the Spanish MINECO (TIN2017-82972-R)Rodríguez-Sánchez, R.; Igual, FD.; Quintana-Ortí, ES. (2020). Integration and exploitation of intra-routine malleability in BLIS. The Journal of Supercomputing (Online). 76(4):2860-2875. https://doi.org/10.1007/s11227-019-03078-zS28602875764Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp Spec Issue Euro Par 2009(23):187–198Catalán S, Castelló A, Igual FD, Rodríguez-Sánchez R, Quintana-Ortí ES (2019) Programming parallel dense matrix factorizations with look-ahead and OpenMP. Cluster Comput. https://doi.org/10.1007/s10586-019-02927-zCatalán S, Herrero JR, Quintana-Ortí ES, Rodríguez-Sánchez R, Van De Geijn R (2019) A case for malleable thread-level linear algebra libraries: the LU factorization with partial pivoting. IEEE Access 7:17617–17633Catalán S, Igual FD, Mayo R, Rodríguez-Sánchez R, Quintana-Ortí ES (2016) Architecture-aware configuration and scheduling of matrix multiplication on asymmetric multicore processors. Cluster Comput 19(3):1037–1051Chan E, Van Zee FG, Bientinesi P, Quintana-Ortí ES, Quintana-Ortí G, van de Geijn R (2008)Supermatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. ACM, New York, pp 123–132Corporation I (2019) Intel ® math kernel library developer reference. Tech rep, Intel Corporation. https://software.intel.com/sites/default/files/mkl-2019-developer-reference-c_2.pdf. Accessed 13 Nov 2019Dolz MF, Igual FD, Ludwig T, Piñuel L, Quintana-Ortí ES (2015) Balancing task- and data-level parallelism to improve performance and energy consumption of matrix computations on the intel xeon phi. Comput Electr Eng 46:95–111Dongarra JJ, Du Croz J, Hammarling S, Duff IS (1990) A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw 16(1):1–17Duran A, Ayguadé E, Badia RM, Labarta J, Martinell L, Martorell X, Planas J (2011) OmpSs: a proposal for programming heterogeneous multi-core architectures. Parallel Process Lett 21(2):173–193Gates M, Luszczek P, Abdelfattah A, Kurzak J, Dongarra J, Arturov K, Cecka C, Freitag C (2018) C++ API for BLAS and LAPACK. Tech Rep 2, ICL-UT-17-03 (2017). Revision 21 Feb 2018Guennebaud G, Jacob B et al (2019) Eigen v3. http://eigen.tuxfamily.org. Accessed 13 Nov 2019LAPACK project home page. http://www.netlib.org/lapack. Accessed 13 Nov 2019Leung J, Kelly L, Anderson JH (2004) Handbook of scheduling: algorithms, models, and performance analysis. CRC Press Inc, Boca Raton, FLSmith TM, van de Geijn RA, Smelyanskiy M, Hammond JR, Van Zee FG (2014) Anatomy of high-performance many-threaded matrix multiplication. In: 28th IEEE International Parallel & Distributed Processing SymposiumStrazdins P (1998) A comparison of lookahead and algorithmic blocking techniques for parallel matrix factorization. Tech Rep TR-CS-98-07, Department of Computer Science, The Australian National University, Canberra 0200 ACT, AustraliaWhaley RC, Petitet A, Dongarra JJ (2001) Automated empirical optimization of software and the ATLAS project. Parallel Comput 27(1–2):3–35Van Zee FG, Implementing high-performance complex matrix multiplication via the 1m method. ACM Trans Math Softw (submitted)Van Zee FG, van de Geijn RA (2015) BLIS: a framework for rapidly instantiating BLAS functionality. ACM Trans Math Softw 41(3):14:1–14:33Van Zee FG, Parikh DN, van de Geijn RA, Supporting mixed-domain mixed-precision matrix multiplication within the BLIS framework. ACM Trans Math Softw (submitted)Van Zee FG, Smith T (2017) Implementing high-performance complex matrix multiplication via the 3m and 4m methods. ACM Trans Math Softw 44(1):7:1–7:36Van Zee FG, Smith T, Igual FD, Smelyanskiy M, Zhang X, Kistler M, Austel V, Gunnels J, Low TM, Marker B, Killough L, van de Geijn RA (2016) The BLIS framework: experiments in portability. ACM Trans Math Softw 42(2):12:1–12:1

    The Optimal Gravitational Lens Telescope

    Get PDF
    Given an observed gravitational lens mirage produced by a foreground deflector (cf. galaxy, quasar, cluster,...), it is possible via numerical lens inversion to retrieve the real source image, taking full advantage of the magnifying power of the cosmic lens. This has been achieved in the past for several remarkable gravitational lens systems. Instead, we propose here to invert an observed multiply imaged source directly at the telescope using an ad-hoc optical instrument which is described in the present paper. Compared to the previous method, this should allow one to detect fainter source features as well as to use such an optimal gravitational lens telescope to explore even fainter objects located behind and near the lens. Laboratory and numerical experiments illustrate this new approach
    • …
    corecore