369 research outputs found
Recommended from our members
Preparing sparse solvers for exascale computing.
Sparse solvers provide essential functionality for a wide variety of scientific applications. Highly parallel sparse solvers are essential for continuing advances in high-fidelity, multi-physics and multi-scale simulations, especially as we target exascale platforms. This paper describes the challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms. We address the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential. The efforts described here are works in progress, highlighting current success and upcoming challenges. This article is part of a discussion meeting issue 'Numerical algorithms for high-performance computational science'
Programming parallel dense matrix factorizations and inversion for new-generation NUMA architectures
We propose a methodology to address the programmability issues derived from the emergence of new-generation shared-memory NUMA architectures. For this purpose, we employ dense matrix factorizations and matrix inversion (DMFI) as a use case, and we target two modern architectures (AMD Rome and Huawei Kunpeng 920) that exhibit configurable NUMA topologies. Our methodology pursues performance portability across different NUMA configurations by proposing multi-domain implementations for DMFI plus a hybrid task- and loop-level parallelization that configures multi-threaded executions to fix core-to-data binding, exploiting locality at the expense of minor code modifications. In addition, we introduce a generalization of the multi-domain implementations for DMFI that offers support for virtually any NUMA topology in present and future architectures. Our experimentation on the two target architectures for three representative dense linear algebra operations validates the proposal, reveals insights on the necessity of adapting both the codes and their execution to improve data access locality, and reports performance across architectures and inter- and intra-socket NUMA configurations competitive with state-of-the-art message-passing implementations, maintaining the ease of development usually associated with shared-memory programming.This research was sponsored by project PID2019-107255GB of Ministerio de Ciencia, Innovación y Universidades; project S2018/TCS-4423 of Comunidad de Madrid; project 2017-SGR-1414 of the Generalitat de Catalunya and the Madrid Government under the Multiannual Agreement with UCM in the line Program to Stimulate Research for Young Doctors in the context of the V PRICIT, project PR65/19-22445. This project has also received funding from the European High-Performance Computing Joint Undertaking (JU) under grant agreement No 955558. The JU receives support from the European Union’s Horizon 2020 research and innovation programme, and Spain, Germany, France, Italy, Poland, Switzerland, Norway. The work is also supported by grants PID2020-113656RB-C22 and PID2021-126576NB-I00 of MCIN/AEI/10.13039/501100011033 and by ERDF A way of making Europe.Peer ReviewedPostprint (published version
Quantum ESPRESSO: a modular and open-source software project for quantum simulations of materials
Quantum ESPRESSO is an integrated suite of computer codes for
electronic-structure calculations and materials modeling, based on
density-functional theory, plane waves, and pseudopotentials (norm-conserving,
ultrasoft, and projector-augmented wave). Quantum ESPRESSO stands for "opEn
Source Package for Research in Electronic Structure, Simulation, and
Optimization". It is freely available to researchers around the world under the
terms of the GNU General Public License. Quantum ESPRESSO builds upon
newly-restructured electronic-structure codes that have been developed and
tested by some of the original authors of novel electronic-structure algorithms
and applied in the last twenty years by some of the leading materials modeling
groups worldwide. Innovation and efficiency are still its main focus, with
special attention paid to massively-parallel architectures, and a great effort
being devoted to user friendliness. Quantum ESPRESSO is evolving towards a
distribution of independent and inter-operable codes in the spirit of an
open-source project, where researchers active in the field of
electronic-structure calculations are encouraged to participate in the project
by contributing their own codes or by implementing their own ideas into
existing codes.Comment: 36 pages, 5 figures, resubmitted to J.Phys.: Condens. Matte
Dense and sparse parallel linear algebra algorithms on graphics processing units
Una línea de desarrollo seguida en el campo de la supercomputación es el uso de procesadores de propósito específico para acelerar determinados tipos de cálculo. En esta tesis estudiamos el uso de tarjetas gráficas como aceleradores de la computación y lo aplicamos al ámbito del álgebra lineal. En particular trabajamos con la biblioteca SLEPc para resolver problemas de cálculo de autovalores en matrices de gran dimensión, y para aplicar funciones de matrices en los cálculos de aplicaciones científicas. SLEPc es una biblioteca paralela que se basa en el estándar MPI y está desarrollada con la premisa de ser escalable, esto es, de permitir resolver problemas más grandes al aumentar las unidades de procesado.
El problema lineal de autovalores, Ax = lambda x en su forma estándar, lo abordamos con el uso de técnicas iterativas, en concreto con métodos de Krylov, con los que calculamos una pequeña porción del espectro de autovalores. Este tipo de algoritmos se basa en generar un subespacio de tamaño reducido (m) en el que proyectar el problema de gran dimensión (n), siendo m << n. Una vez se ha proyectado el problema, se resuelve este mediante métodos directos, que nos proporcionan aproximaciones a los autovalores del problema inicial que queríamos resolver. Las operaciones que se utilizan en la expansión del subespacio varían en función de si los autovalores deseados están en el exterior o en el interior del espectro. En caso de buscar autovalores en el exterior del espectro, la expansión se hace mediante multiplicaciones matriz-vector. Esta operación la realizamos en la GPU, bien mediante el uso de bibliotecas o mediante la creación de funciones que aprovechan la estructura de la matriz. En caso de autovalores en el interior del espectro, la expansión requiere resolver sistemas de ecuaciones lineales. En esta tesis implementamos varios algoritmos para la resolución de sistemas de ecuaciones lineales para el caso específico de matrices con estructura tridiagonal a bloques, que se ejecutan en GPU.
En el cálculo de las funciones de matrices hemos de diferenciar entre la aplicación directa de una función sobre una matriz, f(A), y la aplicación de la acción de una función de matriz sobre un vector, f(A)b. El primer caso implica un cálculo denso que limita el tamaño del problema. El segundo permite trabajar con matrices dispersas grandes, y para resolverlo también hacemos uso de métodos de Krylov. La expansión del subespacio se hace mediante multiplicaciones matriz-vector, y hacemos uso de GPUs de la misma forma que al resolver autovalores. En este caso el problema proyectado comienza siendo de tamaño m, pero se incrementa en m en cada reinicio del método. La resolución del problema proyectado se hace aplicando una función de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raíz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo.One line of development followed in the field of supercomputing is the use of specific purpose processors to speed up certain types of computations. In this thesis we study the use of graphics processing units as computer accelerators and apply it to the field of linear algebra. In particular, we work with the SLEPc library to solve large scale eigenvalue problems, and to apply matrix functions in scientific applications. SLEPc is a parallel library based on the MPI standard and is developed with the premise of being scalable, i.e. to allow solving larger problems by increasing the processing units.
We address the linear eigenvalue problem, Ax = lambda x in its standard form, using iterative techniques, in particular with Krylov's methods, with which we calculate a small portion of the eigenvalue spectrum. This type of algorithms is based on generating a subspace of reduced size (m) in which to project the large dimension problem (n), being m << n. Once the problem has been projected, it is solved by direct methods, which provide us with approximations of the eigenvalues of the initial problem we wanted to solve. The operations used in the expansion of the subspace vary depending on whether the desired eigenvalues are from the exterior or from the interior of the spectrum. In the case of searching for exterior eigenvalues, the expansion is done by matrix-vector multiplications. We do this on the GPU, either by using libraries or by creating functions that take advantage of the structure of the matrix. In the case of eigenvalues from the interior of the spectrum, the expansion requires solving linear systems of equations. In this thesis we implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure, that are run on GPU.
In the computation of matrix functions we have to distinguish between the direct application of a matrix function, f(A), and the action of a matrix function on a vector, f(A)b. The first case involves a dense computation that limits the size of the problem. The second allows us to work with large sparse matrices, and to solve it we also make use of Krylov's methods. The expansion of subspace is done by matrix-vector multiplication, and we use GPUs in the same way as when solving eigenvalues. In this case the projected problem starts being of size m, but it is increased by m on each restart of the method. The solution of the projected problem is done by directly applying a matrix function. We have implemented several algorithms to compute the square root and the exponential matrix functions, in which the use of GPUs allows us to speed up the computation.Una línia de desenvolupament seguida en el camp de la supercomputació és l'ús de processadors de propòsit específic per a accelerar determinats tipus de càlcul. En aquesta tesi estudiem l'ús de targetes gràfiques com a acceleradors de la computació i ho apliquem a l'àmbit de l'àlgebra lineal. En particular treballem amb la biblioteca SLEPc per a resoldre problemes de càlcul d'autovalors en matrius de gran dimensió, i per a aplicar funcions de matrius en els càlculs d'aplicacions científiques. SLEPc és una biblioteca paral·lela que es basa en l'estàndard MPI i està desenvolupada amb la premissa de ser escalable, açò és, de permetre resoldre problemes més grans en augmentar les unitats de processament.
El problema lineal d'autovalors, Ax = lambda x en la seua forma estàndard, ho abordem amb l'ús de tècniques iteratives, en concret amb mètodes de Krylov, amb els quals calculem una xicoteta porció de l'espectre d'autovalors. Aquest tipus d'algorismes es basa a generar un subespai de grandària reduïda (m) en el qual projectar el problema de gran dimensió (n), sent m << n. Una vegada s'ha projectat el problema, es resol aquest mitjançant mètodes directes, que ens proporcionen aproximacions als autovalors del problema inicial que volíem resoldre. Les operacions que s'utilitzen en l'expansió del subespai varien en funció de si els autovalors desitjats estan en l'exterior o a l'interior de l'espectre. En cas de cercar autovalors en l'exterior de l'espectre, l'expansió es fa mitjançant multiplicacions matriu-vector. Aquesta operació la realitzem en la GPU, bé mitjançant l'ús de biblioteques o mitjançant la creació de funcions que aprofiten l'estructura de la matriu. En cas d'autovalors a l'interior de l'espectre, l'expansió requereix resoldre sistemes d'equacions lineals. En aquesta tesi implementem diversos algorismes per a la resolució de sistemes d'equacions lineals per al cas específic de matrius amb estructura tridiagonal a blocs, que s'executen en GPU.
En el càlcul de les funcions de matrius hem de diferenciar entre l'aplicació directa d'una funció sobre una matriu, f(A), i l'aplicació de l'acció d'una funció de matriu sobre un vector, f(A)b. El primer cas implica un càlcul dens que limita la grandària del problema. El segon permet treballar amb matrius disperses grans, i per a resoldre-ho també fem ús de mètodes de Krylov. L'expansió del subespai es fa mitjançant multiplicacions matriu-vector, i fem ús de GPUs de la mateixa forma que en resoldre autovalors. En aquest cas el problema projectat comença sent de grandària m, però s'incrementa en m en cada reinici del mètode. La resolució del problema projectat es fa aplicant una funció de matriu de forma directa. Nosaltres hem implementat diversos algorismes per a calcular les funcions de matrius arrel quadrada i exponencial, en les quals l'ús de GPUs permet accelerar el càlcul.Lamas Daviña, A. (2018). Dense and sparse parallel linear algebra algorithms on graphics processing units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112425TESI
Solución de Problemas Matriciales de “Gran Escala” sobre Procesadores Multinúcleo y GPUs
Few realize that, for large matrices, many dense matrix computations achieve nearly the same performance
when the matrices are stored on disk as when they are stored in a very large main memory. Similarly, few realize that, given
the right programming abstractions, coding Out-of-Core (OOC) implementations of dense linear algebra operations (where
data resides on disk and has to be explicitly moved in and out of main memory) is no more difficult than programming
high-performance implementations for the case where the matrix is in memory. Finally, few realize that on a contemporary
eight core architecture or a platform equiped with a graphics processor (GPU) one can solve a 100, 000 × 100, 000
symmetric positive definite linear system in about one hour. Thus, for problems that used to be considered large, it is not
necessary to utilize distributed-memory architectures with massive memories if one is willing to wait longer for the solution
to be computed on a fast multithreaded architecture like a multi-core computer or a GPU. This paper provides evidence in
support of these claimsPocos son conscientes de que, para matrices grandes, muchos cálculos matriciales obtienen casi el mismo rendimiento
cuando las matrices se encuentran almacenadas en disco que cuando residen en una memoria principal muy grande. De
manera parecida, pocos son conscientes de que, si se usan las abstracciones de programacón correctas, codificar algoritmos
Out-of-Core (OOC) para operaciones de Álgebra matricial densa (donde los datos residen en disco y tienen que moverse
explícitamente entre memoria principal y disco) no resulta más difícil que codificar algoritmos de altas prestaciones para
matrices que residen en memoria principal. Finalmente, pocos son conscientes de que en una arquictura actual con 8 núcleos
o un equipo con un procesador gráfico (GPU) es posible resolver un sistema lineal simétrico positivo definido de dimensión
100,000 × 100,000 aproximadamente en una hora. Así, para problemas que solían considerarse grandes, no es necesario
usar arquitecturas de memoria distribuida con grandes memorias si uno está dispuesto a esperar un cierto tiempo para que
la solución se obtenga en una arquitectura multihebra como un procesador multinúcleo o una GPU. Este trabajo presenta
evidencias que soportan tales afirmaciones
Programming matrix algorithms-by-blocks for thread-level parallelism
With the emergence of thread-level parallelism as the primary means for continued improvement of performance, the programmability issue has reemerged as an obstacle to the use of architectural advances. We argue that evolving legacy libraries for dense and banded linear algebra is not a viable solution due to constraints imposed by early design decisions. We propose a philosophy of abstraction and separation of concerns that provides a promising solution in this problem domain. The first abstraction, FLASH, allows algorithms to express computation with matrices consisting of blocks, facilitating algorithms-by-blocks. Transparent to the library implementor, operand descriptions are registered for a particular operation a priori. A runtime system, SuperMatrix, uses this information to identify data dependencies between suboperations, allowing them to be scheduled to threads out-of-order and executed in parallel. But not all classical algorithms in linear algebra lend themselves to conversion to algorithms-by-blocks. We show how our recently proposed LU factorization with incremental pivoting and closely related algorithm-by-blocks for the QR factorization, both originally designed for out-of-core computation, overcome this difficulty. Anecdotal evidence regarding the development of routines with a core functionality demonstrates how the methodology supports high productivity while experimental results suggest that high performance is abundantly achievabl
Computed tomography medical image reconstruction on affordable equipment by using Out-Of-Core techniques
[EN] Background and objective: As Computed Tomography scans are an essential medical test, many techniques have been proposed to reconstruct high-quality images using a smaller amount of radiation. One approach is to employ algebraic factorization methods to reconstruct the images, using fewer views than the traditional analytical methods. However, their main drawback is the high computational cost and hence the time needed to obtain the images, which is critical in the daily clinical practice. For this reason, faster methods for solving this problem are required.
Methods: In this paper, we propose a new reconstruction method based on the QR factorization that is very efficient on affordable equipment (standard multicore processors and standard Solid-State Drives) by using Out-Of-Core techniques.
Results: Combining both affordable hardware and the new software proposed in our work, the images can be reconstructed very quickly and with high quality. We analyze the reconstructions using real Computed Tomography images selected from a dataset, comparing the QR method to the LSQR and FBP. We measure the quality of the images using the metrics Peak Signal-To-Noise Ratio and Structural Similarity Index, obtaining very high values. We also compare the efficiency of using spinning disks versus Solid-State Drives, showing how the latter performs the Input/Output operations in a significantly lower amount of time. Conclusions: The results indicate that our proposed me thod and software are valid to efficiently solve large-scale systems and can be applied to the Computed Tomography reconstruction problem to obtain high-quality images.This research has been supported by "Universitat Politecnica de Valencia", "Generalitat Valenciana" under PROMETEO/2018/035 and ACIF/2017/075, co-financed by FEDER and FSE funds, and the "Spanish Ministry of Science, Innovation and Universities" under Grant RTI2018-098156-B-C54 co-financed by FEDER funds.Chillarón-Pérez, M.; Quintana Ortí, G.; Vidal-Gimeno, V.; Verdú Martín, GJ. (2020). Computed tomography medical image reconstruction on affordable equipment by using Out-Of-Core techniques. Computer Methods and Programs in Biomedicine. 193:1-11. https://doi.org/10.1016/j.cmpb.2020.105488S111193Berrington de González, A. (2009). Projected Cancer Risks From Computed Tomographic Scans Performed in the United States in 2007. Archives of Internal Medicine, 169(22), 2071. doi:10.1001/archinternmed.2009.440HALL, E. J., & BRENNER, D. J. (2008). Cancer risks from diagnostic radiology. The British Journal of Radiology, 81(965), 362-378. doi:10.1259/bjr/01948454Tang, X., Hsieh, J., Nilsen, R. A., Dutta, S., Samsonov, D., & Hagiwara, A. (2006). A three-dimensional-weighted cone beam filtered backprojection (CB-FBP) algorithm for image reconstruction in volumetric CT—helical scanning. Physics in Medicine and Biology, 51(4), 855-874. doi:10.1088/0031-9155/51/4/007Zhuang, T., Leng, S., Nett, B. E., & Chen, G.-H. (2004). Fan-beam and cone-beam image reconstruction via filtering the backprojection image of differentiated projection data. Physics in Medicine and Biology, 49(24), 5489-5503. doi:10.1088/0031-9155/49/24/007Mori, S., Endo, M., Komatsu, S., Kandatsu, S., Yashiro, T., & Baba, M. (2006). A combination-weighted Feldkamp-based reconstruction algorithm for cone-beam CT. Physics in Medicine and Biology, 51(16), 3953-3965. doi:10.1088/0031-9155/51/16/005Willemink, M. J., de Jong, P. A., Leiner, T., de Heer, L. M., Nievelstein, R. A. J., Budde, R. P. J., & Schilham, A. M. R. (2013). Iterative reconstruction techniques for computed tomography Part 1: Technical principles. European Radiology, 23(6), 1623-1631. doi:10.1007/s00330-012-2765-yWillemink, M. J., Leiner, T., de Jong, P. A., de Heer, L. M., Nievelstein, R. A. J., Schilham, A. M. R., & Budde, R. P. J. (2013). Iterative reconstruction techniques for computed tomography part 2: initial results in dose reduction and image quality. European Radiology, 23(6), 1632-1642. doi:10.1007/s00330-012-2764-zWu, W., Liu, F., Zhang, Y., Wang, Q., & Yu, H. (2019). Non-Local Low-Rank Cube-Based Tensor Factorization for Spectral CT Reconstruction. IEEE Transactions on Medical Imaging, 38(4), 1079-1093. doi:10.1109/tmi.2018.2878226Wu, W., Zhang, Y., Wang, Q., Liu, F., Chen, P., & Yu, H. (2018). Low-dose spectral CT reconstruction using image gradient ℓ0–norm and tensor dictionary. Applied Mathematical Modelling, 63, 538-557. doi:10.1016/j.apm.2018.07.006Andersen, A. H. (1989). Algebraic reconstruction in CT from limited views. IEEE Transactions on Medical Imaging, 8(1), 50-55. doi:10.1109/42.20361Andersen, A. H., & Kak, A. C. (1984). Simultaneous Algebraic Reconstruction Technique (SART): A Superior Implementation of the Art Algorithm. Ultrasonic Imaging, 6(1), 81-94. doi:10.1177/016173468400600107Yu, W., & Zeng, L. (2014). A Novel Weighted Total Difference Based Image Reconstruction Algorithm for Few-View Computed Tomography. PLoS ONE, 9(10), e109345. doi:10.1371/journal.pone.0109345Flores, L., Vidal, V., & Verdú, G. (2015). Iterative Reconstruction from Few-view Projections. Procedia Computer Science, 51, 703-712. doi:10.1016/j.procs.2015.05.188Flores, L. A., Vidal, V., Mayo, P., Rodenas, F., & Verdú, G. (2014). Parallel CT image reconstruction based on GPUs. Radiation Physics and Chemistry, 95, 247-250. doi:10.1016/j.radphyschem.2013.03.011Chillarón, M., Vidal, V., Segrelles, D., Blanquer, I., & Verdú, G. (2017). Combining Grid Computing and Docker Containers for the Study and Parametrization of CT Image Reconstruction Methods. Procedia Computer Science, 108, 1195-1204. doi:10.1016/j.procs.2017.05.065Sollmann, N., Mei, K., Schwaiger, B. J., Gersing, A. S., Kopp, F. K., Bippus, R., … Baum, T. (2018). Effects of virtual tube current reduction and sparse sampling on MDCT-based femoral BMD measurements. Osteoporosis International, 29(12), 2685-2692. doi:10.1007/s00198-018-4675-6Yan Liu, Zhengrong Liang, Jianhua Ma, Hongbing Lu, Ke Wang, Hao Zhang, & Moore, W. (2014). Total Variation-Stokes Strategy for Sparse-View X-ray CT Image Reconstruction. IEEE Transactions on Medical Imaging, 33(3), 749-763. doi:10.1109/tmi.2013.2295738Tang, J., Nett, B. E., & Chen, G.-H. (2009). Performance comparison between total variation (TV)-based compressed sensing and statistical iterative reconstruction algorithms. Physics in Medicine and Biology, 54(19), 5781-5804. doi:10.1088/0031-9155/54/19/008Vandeghinste, B., Vandenberghe, S., Vanhove, C., Staelens, S., & Van Holen, R. (2013). Low-Dose Micro-CT Imaging for Vascular Segmentation and Analysis Using Sparse-View Acquisitions. PLoS ONE, 8(7), e68449. doi:10.1371/journal.pone.0068449Qi, H., Chen, Z., & Zhou, L. (2015). CT Image Reconstruction from Sparse Projections Using Adaptive TpV Regularization. Computational and Mathematical Methods in Medicine, 2015, 1-8. doi:10.1155/2015/354869Wu, W., Chen, P., Vardhanabhuti, V. V., Wu, W., & Yu, H. (2019). Improved Material Decomposition With a Two-Step Regularization for Spectral CT. IEEE Access, 7, 158770-158781. doi:10.1109/access.2019.2950427Rodriguez-Alvarez, M. J., Sanchez, F., Soriano, A., Moliner, L., Sanchez, S., & Benlloch, J. (2018). QR-Factorization Algorithm for Computed Tomography (CT): Comparison With FDK and Conjugate Gradient (CG) Algorithms. IEEE Transactions on Radiation and Plasma Medical Sciences, 2(5), 459-469. doi:10.1109/trpms.2018.2843803Chillarón, M., Vidal, V., & Verdú, G. (2020). CT image reconstruction with SuiteSparseQR factorization package. Radiation Physics and Chemistry, 167, 108289. doi:10.1016/j.radphyschem.2019.04.039Joseph, P. M. (1982). An Improved Algorithm for Reprojecting Rays through Pixel Images. IEEE Transactions on Medical Imaging, 1(3), 192-196. doi:10.1109/tmi.1982.4307572S. Toledo, F. Gustavson, The design and implementation of solar, a portable library for scalable out-of-core linear algebra computations, in: Proceedings of the Annual Workshop on I/O in Parallel and Distributed Systems, IOPADS,D’Azevedo, E., & Dongarra, J. (2000). The design and implementation of the parallel out-of-core ScaLAPACK LU, QR, and Cholesky factorization routines. Concurrency: Practice and Experience, 12(15), 1481-1493. doi:10.1002/1096-9128(20001225)12:153.0.co;2-vGunter, B. C., & Van De Geijn, R. A. (2005). Parallel out-of-core computation and updating of the QR factorization. ACM Transactions on Mathematical Software, 31(1), 60-78. doi:10.1145/1055531.1055534Quintana-Ortí, G., Igual, F. D., Marqués, M., Quintana-Ortí, E. S., & van de Geijn, R. A. (2012). A Runtime System for Programming Out-of-Core Matrix Algorithms-by-Tiles on Multithreaded Architectures. ACM Transactions on Mathematical Software, 38(4), 1-25. doi:10.1145/2331130.2331133Marqués, M., Quintana-Ortí, G., Quintana-Ortí, E. S., & van de Geijn, R. (2010). Using desktop computers to solve large-scale dense linear algebra problems. The Journal of Supercomputing, 58(2), 145-150. doi:10.1007/s11227-010-0394-2G. Lauritsch, H. Bruder, FORBILD head phantom, http://www.imp.uni-erlangen.de/phantoms/head/head.html.Yan, K., Wang, X., Lu, L., & Summers, R. M. (2018). DeepLesion: automated mining of large-scale lesion annotations and universal lesion detection with deep learning. Journal of Medical Imaging, 5(03), 1. doi:10.1117/1.jmi.5.3.036501Miqueles, E., Koshev, N., & Helou, E. S. (2018). A Backprojection Slice Theorem for Tomographic Reconstruction. IEEE Transactions on Image Processing, 27(2), 894-906. doi:10.1109/tip.2017.2766785N. Koshev, E.S. Helou, E.X. Miqueles, Fast backprojection techniques for high resolution tomographyarXiv preprint: 1608.03589
Performance and Energy Optimization of the Iterative Solution of Sparse Linear Systems on Multicore Processors
En esta tesis doctoral se aborda la solución de sistemas dispersos de ecuaciones lineales utilizando métodos iterativos precondicionados basados en subespacios de Krylov. En concreto, se centra en ILUPACK, una biblioteca que implementa precondicionadores de tipo ILU multinivel para la solución eficiente de sistemas lineales dispersos. El incremento en el número de ecuaciones, y la aparición
de nuevas arquitecturas, motiva el desarrollo de una versión paralela de ILUPACK que optimice tanto el tiempo de ejecución como el consumo energético en arquitecturas multinúcleo actuales y en clusters de nodos construidos con esta tecnología. El objetivo principal de la tesis es el diseño, implementación y valuación de resolutores paralelos energéticamente eficientes para sistemas lineales dispersos orientados a procesadores multinúcleo así como aceleradores hardware como el Intel Xeon Phi. Para
lograr este objetivo, se aprovecha el paralelismo de tareas mediante OmpSs y MPI, y se desarrolla un entorno automático para detectar ineficiencias energéticas.In this dissertation we target the solution of large sparse systems of linear equations using preconditioned iterative methods based on Krylov subspaces. Specifically, we focus on ILUPACK, a library that offers multi-level ILU preconditioners for the effective solution of sparse linear systems.
The increase of the number of equations and the introduction of new HPC architectures motivates us to develop a parallel version of ILUPACK which optimizes both execution time and energy consumption on current multicore architectures and clusters of nodes built from this type of technology. Thus, the main goal of this thesis is the design, implementation and evaluation of parallel and energy-efficient iterative sparse linear system solvers for multicore processors as well as recent manycore accelerators such as the Intel Xeon Phi. To fulfill the general objective, we optimize ILUPACK exploiting task parallelism via OmpSs and MPI, and also develope an automatic framework to detect energy inefficiencies
- …