11 research outputs found

    Acceleration of PageRank with customized precision based on mantissa segmentation

    Full text link
    [EN] We describe the application of a communication-reduction technique for the PageRank algorithm that dynamically adapts the precision of the data access to the numerical requirements of the algorithm as the iteration converges. Our variable-precision strategy, using a customized precision format based on mantissa segmentation (CPMS), abandons the IEEE 754 single- and double-precision number representation formats employed in the standard implementation of PageRank, and instead handles the data in memory using a customized floating-point format. The customized format enables fast data access in different accuracy, prevents overflow/underflow by preserving the IEEE 754 double-precision exponent, and efficiently avoids data duplication, since all bits of the original IEEE 754 double-precision mantissa are preserved in memory, but re-organized for efficient reduced precision access. With this approach, the truncated values (omitting significand bits), as well as the original IEEE double-precision values, can be retrieved without duplicating the data in different formats. Our numerical experiments on an NVIDIA V100 GPU (Volta architecture) and a server equipped with two Intel Xeon Platinum 8168 CPUs (48 cores in total) expose that, compared with a standard ieee double-precision implementation, the CPMS-based PageRank completes about 10% faster if high-accuracy output is needed, and about 30% faster if reduced output accuracy is acceptable.H. Anzt was supported by the "Impuls und Vernetzungsfond" of the Helmholtz Association under grant VH-NG-1241. G. Flegar and E. S. Quintana-Orti were supported by project TIN2017-82972-R of the MINECO and FEDER. This work was also supported by the EU H2020 project 732631 "OPRECOMP. Open Transprecision Computing,' and the US Department of Energy Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under Award Numbers DE-SC0016513 and DE-SC-0010042Gruetzmacher, T.; Cojean, T.; Flegar, G.; Anzt, H.; Quintana-Orti, ES. (2020). Acceleration of PageRank with customized precision based on mantissa segmentation. ACM Transactions on Parallel Computing. 7(1):1-19. https://doi.org/10.1145/3380934S1197

    Toward a modular precision ecosystem for high performance computing

    Get PDF
    [EN] With the memory bandwidth of current computer architectures being significantly slower than the (floating point) arithmetic performance, many scientific computations only leverage a fraction of the computational power in today's high-performance architectures. At the same time, memory operations are the primary energy consumer of modern architectures, heavily impacting the resource cost of large-scale applications and the battery life of mobile devices. This article tackles this mismatch between floating point arithmetic throughput and memory bandwidth by advocating a disruptive paradigm change with respect to how data are stored and processed in scientific applications. Concretely, the goal is to radically decouple the data storage format from the processing format and, ultimately, design a "modular precision ecosystem" that allows for more flexibility in terms of customized data access. For memory-bounded scientific applications, dynamically adapting the memory precision to the numerical requirements allows for attractive resource savings. In this article, we demonstrate the potential of employing a modular precision ecosystem for the block-Jacobi preconditioner and the PageRank algorithm-two applications that are popular in the communities and at the same characteristic representatives for the field of numerical linear algebra and data analytics, respectively.The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Impuls und Vernetzungsfond of the Helmholtz Association under grant VH-NG-1241. G Flegar and ES Quintana-Ortí were supported by project TIN2017-82972-R of the MINECO and FEDER and the H2020 EU FETHPC Project 732631 OPRECOMP .Anzt, H.; Flegar, G.; Gruetzmacher, T.; Quintana-Orti, ES. (2019). Toward a modular precision ecosystem for high performance computing. International Journal of High Performance Computing Applications. 33(6):1069-1078. https://doi.org/10.1177/109434201984654710691078336Anzt, H., Dongarra, J., & Quintana-Ortí, E. S. (2015). Adaptive precision solvers for sparse linear systems. Proceedings of the 3rd International Workshop on Energy Efficient Supercomputing - E2SC ’15. doi:10.1145/2834800.2834802Baboulin, M., Buttari, A., Dongarra, J., Kurzak, J., Langou, J., Langou, J., … Tomov, S. (2009). Accelerating scientific computations with mixed precision algorithms. Computer Physics Communications, 180(12), 2526-2533. doi:10.1016/j.cpc.2008.11.005Buttari, A., Dongarra, J., Langou, J., Langou, J., Luszczek, P., & Kurzak, J. (2007). Mixed Precision Iterative Refinement Techniques for the Solution of Dense Linear Systems. The International Journal of High Performance Computing Applications, 21(4), 457-466. doi:10.1177/1094342007084026Carson, E., & Higham, N. J. (2017). A New Analysis of Iterative Refinement and Its Application to Accurate Solution of Ill-Conditioned Sparse Linear Systems. SIAM Journal on Scientific Computing, 39(6), A2834-A2856. doi:10.1137/17m1122918Carson, E., & Higham, N. J. (2018). Accelerating the Solution of Linear Systems by Iterative Refinement in Three Precisions. SIAM Journal on Scientific Computing, 40(2), A817-A847. doi:10.1137/17m1140819Göddeke, D., Strzodka, R., & Turek, S. (2007). Performance and accuracy of hardware-oriented native-, emulated- and mixed-precision solvers in FEM simulations. International Journal of Parallel, Emergent and Distributed Systems, 22(4), 221-256. doi:10.1080/17445760601122076Grützmacher, T., & Anzt, H. (2018). A Modular Precision Format for Decoupling Arithmetic Format and Storage Format. Euro-Par 2018: Parallel Processing Workshops, 434-443. doi:10.1007/978-3-030-10549-5_34Grutzmacher, T., Anzt, H., Scheidegger, F., & Quintana-Orti, E. S. (2018). High-Performance GPU Implementation of PageRank with Reduced Precision Based on Mantissa Segmentation. 2018 IEEE/ACM 8th Workshop on Irregular Applications: Architectures and Algorithms (IA3). doi:10.1109/ia3.2018.00015Hegland, M., & Saylor, P. E. (1992). Block jacobi preconditioning of the conjugate gradient method on a vector processor. International Journal of Computer Mathematics, 44(1-4), 71-89. doi:10.1080/00207169208804096Horowitz, M. (2014). 1.1 Computing’s energy problem (and what we can do about it). 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC). doi:10.1109/isscc.2014.6757323Saad, Y. (2003). Iterative Methods for Sparse Linear Systems. doi:10.1137/1.9780898718003Strzodka, R., & Goddeke, D. (2006). Pipelined Mixed Precision Algorithms on FPGAs for Fast and Accurate PDE Solvers from Low Precision Components. 2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines. doi:10.1109/fccm.2006.57Tadano, H., & Sakurai, T. (2008). On Single Precision Preconditioners for Krylov Subspace Iterative Methods. Lecture Notes in Computer Science, 721-728. doi:10.1007/978-3-540-78827-0_83Wulf, W. A., & McKee, S. A. (1995). Hitting the memory wall. ACM SIGARCH Computer Architecture News, 23(1), 20-24. doi:10.1145/216585.21658

    Domain-Specific Optimization For Machine Learning System

    Get PDF
    The machine learning (ML) system has been an indispensable part of the ML ecosystem in recent years. The rapid growth of ML brings new system challenges such as the need of handling more large-scale data and computation, the requirements for higher execution performance, and lower resource usage, stimulating the demand for improving ML system. General-purpose system optimization is widely used but brings limited benefits because ML applications vary in execution behaviors based on their algorithms, input data, and configurations. It\u27s difficult to perform comprehensive ML system optimizations without application specific information. Therefore, domain-specific optimization, a method that optimizes particular types of ML applications based on their unique characteristics, is necessary for advanced ML systems. This dissertation performs domain-specific system optimizations for three important ML applications: graph-based applications, SGD-based applications, and Python-based applications. For SGD-based applications, this dissertation proposes a lossy compression scheme for application checkpoint constructions (called {LC-Checkpoint\xspace}). {LC-Checkpoint\xspace} intends to simultaneously maximize the compression rate of checkpoints and reduce the recovery cost of SGD-based training processes. Extensive experiments show that {LC-Checkpoint\xspace} achieves a high compression rate with a lower recovery cost over a state-of-the-art algorithm. For kernel regression applications, this dissertation designs and implements a parallel software that targets to handle million-scale datasets. The software is evaluated on two million-scale downstream applications (i.e., equity return forecasting problem on the US stock dataset, and image classification problem on the ImageNet dataset) to demonstrate its efficacy and efficiency. For graph-based applications, this dissertation introduces {ATMem\xspace}, a runtime framework to optimize application data placement on heterogeneous memory systems. {ATMem\xspace} aims to maximize the fast memory (small-capacity) utilization by placing only critical data regions that yield the highest performance gains on the fast memory. Experimental results show that {ATMem\xspace} achieves significant speedup over the baseline that places all data on slow memory (large-capacity) with only placing a minority portion of the data on the fast memory. The future research direction is to adapt ML algorithms for software systems/architectures, deeply bind the design of ML algorithms to the implementation of ML systems, to achieve optimal solutions for ML applications

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Fundamentals

    Get PDF
    Volume 1 establishes the foundations of this new field. It goes through all the steps from data collection, their summary and clustering, to different aspects of resource-aware learning, i.e., hardware, memory, energy, and communication awareness. Machine learning methods are inspected with respect to resource requirements and how to enhance scalability on diverse computing architectures ranging from embedded systems to large computing clusters

    Anales del XIII Congreso Argentino de Ciencias de la Computación (CACIC)

    Get PDF
    Contenido: Arquitecturas de computadoras Sistemas embebidos Arquitecturas orientadas a servicios (SOA) Redes de comunicaciones Redes heterogéneas Redes de Avanzada Redes inalámbricas Redes móviles Redes activas Administración y monitoreo de redes y servicios Calidad de Servicio (QoS, SLAs) Seguridad informática y autenticación, privacidad Infraestructura para firma digital y certificados digitales Análisis y detección de vulnerabilidades Sistemas operativos Sistemas P2P Middleware Infraestructura para grid Servicios de integración (Web Services o .Net)Red de Universidades con Carreras en Informática (RedUNCI
    corecore