90 research outputs found

    Dense matrix computations on NUMA architectures with distance-aware work stealing

    Get PDF
    We employ the dynamic runtime system OmpSs to decrease the overhead of data motion in the now ubiquitous non-uniform memory access (NUMA) high concurrency environment of multicore processors. The dense numerical linear algebra algorithms of Cholesky factorization and symmetric matrix inversion are employed as representative benchmarks. Work stealing occurs within an innovative NUMA-aware scheduling policy to reduce data movement between NUMA nodes. The overall approach achieves separation of concerns by abstracting the complexity of the hardware from the end users so that high productivity can be achieved. Performance results on a large NUMA system outperform the state-of-the-art existing implementations up to a two fold speedup for the Cholesky factorization, as well as the symmetric matrix inversion, while the OmpSs-enabled code maintains strong similarity to its original sequential version.The authors would like to thank the National Institute for Computational Sciences for granting us access on the Nautilus system. The KAUST authors acknowledge support of the Extreme Computing Research Center. The BSC-affiliated authors thankfully acknowledges the support of the European Commission through the HiPEAC-3 Network of Excellence (FP7-ICT 287759), Intel-BSC Exascale Lab and IBM/BSC Exascale Initiative collaboration, Spanish Ministry of Education (FPU), Computación de Altas Prestaciones VI (TIN2012-34557), Generalitat de Catalunya (2014-SGR-1051) and the grant SEV-2011-00067 of the Severo Ochoa Program.Peer ReviewedPostprint (published version

    The fast multipole method at exascale

    Get PDF
    This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems. We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design. To demonstrate the scientific significance of FMM, we present two applications namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities. Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D

    Knowledge-Based Automatic Generation of Linear Algebra Algorithms and Code

    Get PDF
    This dissertation focuses on the design and the implementation of domain-specific compilers for linear algebra matrix equations. The development of efficient libraries for such equations, which lie at the heart of most software for scientific computing, is a complex process that requires expertise in a variety of areas, including the application domain, algorithms, numerical analysis and high-performance computing. Moreover, the process involves the collaboration of several people for a considerable amount of time. With our compilers, we aim to relieve the developers from both designing algorithms and writing code, and to generate routines that match or even surpass the performance of those written by human experts.Comment: Dissertatio

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    Toward Efficient Scheduling for Parallel Real-Time Tasks on Multiprocessors

    Get PDF
    Modern real-time applications are becoming more demanding computationally while their temporal requirements, dictated by the physical world, often remain unchanged. This coupled with the increasing prevalence of multiprocessors in real-time systems necessitates that highly computation-demanding real-time tasks need to be parallelized to exploit the parallelism offered by the underlying hardware, in order to satisfy their temporal constraints. Scheduling parallel real-time tasks, however, introduces a new layer of complexity due to the allowance for intra-task parallelism. This dissertation addresses the problem of scheduling parallel real-time tasks in which tasks may (or may not) access shared non-processor resources, such as in-memory buffers or data structures. Specifically, for independent tasks, we propose new scheduling algorithms and schedulability analyses for parallel tasks with these characteristics, under federated and global scheduling.Experimental results show that the proposed algorithms and analyses improve the previously introduced methods. For parallel tasks that may access shared non-processor resources, we present a blocking analysis for two different types of spinlocks; through evaluations, we make a recommendation for a preferable ordering of locks. We also study practical runtime parallel scheduler designs for soft real-time applications and present a design that is more suitable for soft real-time systems

    Adaptive Baseband Pro cessing and Configurable Hardware for Wireless Communication

    Get PDF
    The world of information is literally at one’s fingertips, allowing access to previously unimaginable amounts of data, thanks to advances in wireless communication. The growing demand for high speed data has necessitated theuse of wider bandwidths, and wireless technologies such as Multiple-InputMultiple-Output (MIMO) have been adopted to increase spectral efficiency.These advanced communication technologies require sophisticated signal processing, often leading to higher power consumption and reduced battery life.Therefore, increasing energy efficiency of baseband hardware for MIMO signal processing has become extremely vital. High Quality of Service (QoS)requirements invariably lead to a larger number of computations and a higherpower dissipation. However, recognizing the dynamic nature of the wirelesscommunication medium in which only some channel scenarios require complexsignal processing, and that not all situations call for high data rates, allowsthe use of an adaptive channel aware signal processing strategy to provide adesired QoS. Information such as interference conditions, coherence bandwidthand Signal to Noise Ratio (SNR) can be used to reduce algorithmic computations in favorable channels. Hardware circuits which run these algorithmsneed flexibility and easy reconfigurability to switch between multiple designsfor different parameters. These parameters can be used to tune the operations of different components in a receiver based on feedback from the digitalbaseband. This dissertation focuses on the optimization of digital basebandcircuitry of receivers which use feedback to trade power and performance. Aco-optimization approach, where designs are optimized starting from the algorithmic stage through the hardware architectural stage to the final circuitimplementation is adopted to realize energy efficient digital baseband hardwarefor mobile 4G devices. These concepts are also extended to the next generation5G systems where the energy efficiency of the base station is improved.This work includes six papers that examine digital circuits in MIMO wireless receivers. Several key blocks in these receiver include analog circuits thathave residual non-linearities, leading to signal intermodulation and distortion.Paper-I introduces a digital technique to detect such non-linearities and calibrate analog circuits to improve signal quality. The concept of a digital nonlinearity tuning system developed in Paper-I is implemented and demonstratedin hardware. The performance of this implementation is tested with an analogchannel select filter, and results are presented in Paper-II. MIMO systems suchas the ones used in 4G, may employ QR Decomposition (QRD) processors tosimplify the implementation of tree search based signal detectors. However,the small form factor of the mobile device increases spatial correlation, whichis detrimental to signal multiplexing. Consequently, a QRD processor capableof handling high spatial correlation is presented in Paper-III. The algorithm and hardware implementation are optimized for carrier aggregation, which increases requirements on signal processing throughput, leading to higher powerdissipation. Paper-IV presents a method to perform channel-aware processingwith a simple interpolation strategy to adaptively reduce QRD computationcount. Channel properties such as coherence bandwidth and SNR are used toreduce multiplications by 40% to 80%. These concepts are extended to usetime domain correlation properties, and a full QRD processor for 4G systemsfabricated in 28 nm FD-SOI technology is presented in Paper-V. The designis implemented with a configurable architecture and measurements show thatcircuit tuning results in a highly energy efficient processor, requiring 0.2 nJ to1.3 nJ for each QRD. Finally, these adaptive channel-aware signal processingconcepts are examined in the scope of the next generation of communicationsystems. Massive MIMO systems increase spectral efficiency by using a largenumber of antennas at the base station. Consequently, the signal processingat the base station has a high computational count. Paper-VI presents a configurable detection scheme which reduces this complexity by using techniquessuch as selective user detection and interpolation based signal processing. Hardware is optimized for resource sharing, resulting in a highly reconfigurable andenergy efficient uplink signal detector

    Enhanced scheduling techniques through lightweight monitoring for OmpSs-2

    Get PDF
    Manually tuning applications to achieve optimal configurations prevails as the default technique. This raises issues, since these configurations depend on many aspects. To solve them, we propose extending runtimes with tools to obtain and predict metrics in order to enhance scheduling policies

    Analysis and Design of Communication Avoiding Algorithms for Out of Memory(OOM) SVD

    Get PDF
    Many applications — including big data analytics, information retrieval, gene expression analysis, and numerical weather prediction – require the solution of large, dense singular value decomposition (SVD). The size of matrices used in many of these applications is becoming too large to fit into into a computer’s main memory at one time, and the traditional SVD algorithms that require all the matrix components to be loaded into memory before computation starts cannot be used directly. Moving data (communication) between levels of memory hierarchy and the disk exposes extra challenges to design SVD for such big matrices because of the exponential growth in the gap between floating-point arithmetic rate and bandwidth for many different storage devices on modern high performance computers. In this dissertation, we have analyzed communication overhead on hierarchical memory systems and disks for SVD algorithms and designed communication-avoiding (CA) Out of Memory (OOM) SVD algorithms. By Out of Memory we mean that the matrix is too big to fit in the main memory and therefore must reside in external or internal storage. We have studied communication overhead for classical one-stage blocked SVD and two-stage tiled SVD algorithms and proposed our OOM SVD algorithm, which reduces the communication cost. We have presented theoretical analysis and strategies to design CA OOM SVD algorithms, developed optimized implementation of CA OOM SVD for multicore architecture, and presented its performance results. When matrices are tall, performance of OOM SVD can be improved significantly by carrying out QR decomposition on the original matrix in the first place. The upper triangular matrix generated by QR decomposition may fit in the main memory, and in-core SVD can be used efficiently. Even if the upper triangular matrix does not fit in the main memory, OOM SVD will work on a smaller matrix. That is why we have analyzed communication reduction for OOM QR algorithm, implemented optimized OOM tiled QR for multicore systems and showed performance improvement of OOM SVD algorithms for tall matrices
    corecore