    Science Driven Supercomputing Architectures: AnalyzingArchitectural Bottlenecks with Applications and Benchmark Probes

    dissertationEmerging trends such as growing architectural diversity and increased emphasis on energy and power efficiency motivate the need for code that adapts to its execution context (input dataset and target architecture). Unfortunately, writing such code remains difficult, and is typically attempted only by a small group of motivated expert programmers who are highly knowledgeable about the relationship between software and its hardware mapping. In this dissertation, we introduce novel abstractions and techniques based on automatic performance tuning that enable both experts and nonexperts (application developers) to produce adaptive code. We present two new frameworks for adaptive programming: Nitro and Surge. Nitro enables expert programmers to specify code variants, or alternative implementations of the same computation, together with meta-information for selecting among them. It then utilizes supervised classification to select an optimal code variant at runtime based on characteristics of the execution context. Surge, on the other hand, provides a high-level nested data-parallel programming interface for application developers to specify computations. It then employs a two-level mechanism to automatically generate code variants and then tunes them using Nitro. The resulting code performs on par with or better than handcrafted reference implementations on both CPUs and GPUs. In addition to abstractions for expressing code variants, this dissertation also presents novel strategies for adaptively tuning them. First, we introduce a technique for dynamically selecting an optimal code variant at runtime based on characteristics of the input dataset. On five high-performance GPU applications, variants tuned using this strategy achieve over 93% of the performance of variants selected through exhaustive search. Next, we present a novel approach based on multitask learning to develop a code variant selection model on a target architecture from training on different source architectures. We evaluate this approach on a set of six benchmark applications and a collection of six NVIDIA GPUs from three distinct architecture generations. Finally, we implement support for combined code variant and frequency selection based on multiple objectives, including power and energy efficiency. Using this strategy, we construct a GPU sorting implementation that provides improved energy and power efficiency with less than a proportional drop in sorting throughput

    A Fortran Kernel Generation Framework for Scientific Legacy Code

    Quality assurance procedure is very important for software development. The complexity of modules and structure in software impedes the testing procedure and further development. For complex and poorly designed scientific software, module developers and software testers need to put a lot of extra efforts to monitor not related modules\u27 impacts and to test the whole system\u27s constraints. In addition, widely used benchmarks cannot help programmers with accurate and program specific system performance evaluation. In this situation, the generated kernels could provide considerable insight into better performance tuning. Therefore, in order to greatly improve the productivity of various scientific software engineering tasks such as performance tuning, debugging, and verification of simulation results, we developed an automatic compute kernel extraction prototype platform for complex legacy scientific code. In addition, considering that scientific research and experiment require long-term simulation procedure and the huge size of data transfer, we apply message passing based parallelization and I/O behavior optimization to highly improve the performance of the kernel extractor framework and then use profiling tools to give guidance for parallel distribution. Abnormal event detection is another important aspect for scientific research; dealing with huge observational datasets combined with simulation results it becomes not only essential but also extremely difficult. In this dissertation, for the sake of detecting high frequency event and low frequency events, we reconfigured this framework equipped with in-situ data transfer infrastructure. Through the method of combining signal processing data preprocess(decimation) with machine learning detection model to train the stream data, our framework can significantly decrease the amount of transferred data demand for concurrent data analysis (between distributed computing CPU/GPU nodes). Finally, the dissertation presents the implementation of the framework and a case study of the ACME Land Model (ALM) for demonstration. It turns out that the generated compute kernel with lower cost can be used in performance tuning experiments and quality assurance, which include debugging legacy code, verification of simulation results through single point and multiple points of variables tracking, collaborating with compiler vendors, and generating custom benchmark tests

    Advances in uncertainty modelling : from epistemic uncertainty estimation to generalized generative flow networks

    Les problèmes de prise de décision se produisent souvent dans des situations d'incertitude, englobant à la fois l'incertitude aléatoire due à la présence de processus inhérents aléatoires et l'incertitude épistémique liée aux connaissances limitées. Cette thèse explore le concept d'incertitude, un aspect crucial de l'apprentissage automatique et un facteur clé pour que les agents rationnels puissent déterminer où allouer leurs ressources afin d'obtenir les meilleurs résultats. Traditionnellement, l'incertitude est encodée à travers une probabilité postérieure, obtenue par des techniques d'inférence Bayésienne approximatives. Le premier ensemble de contributions de cette thèse tourne autour des propriétés mathématiques des réseaux de flot génératifs, qui sont des modèles probabilistes de séquences discrètes et des échantillonneurs amortis de distributions de probabilités non normalisées. Les réseaux de flot génératifs trouvent des applications dans l'inférence Bayésienne et peuvent être utilisés pour l'estimation de l'incertitude. De plus, ils sont utiles pour les problèmes de recherche dans de vastes espaces compositionnels. Au-delà du renforcement du cadre mathématique sous-jacent, une étude comparative avec les méthodes variationnelles hiérarchiques est fournie, mettant en lumière les importants avantages des réseaux de flot génératifs, tant d'un point de vue théorique que par le biais d'expériences diverses. Ces contributions incluent une théorie étendant les réseaux de flot génératifs à des espaces continus ou plus généraux, ce qui permet de modéliser la probabilité postérieure et l'incertitude dans de nombreux contextes intéressants. La théorie est validée expérimentalement dans divers domaines. Le deuxième axe de travail de cette thèse concerne les mesures alternatives de l'incertitude épistémique au-delà de la modélisation de la probabilité postérieure. La méthode présentée, appelée Estimation Directe de l'Incertitude Épistémique (DEUP), surmonte une faiblesse majeure des techniques d'inférence Bayésienne approximatives due à la mauvaise spécification du modèle. DEUP repose sur le maintien d'un prédicteur secondaire des erreurs du prédicteur principal, à partir duquel des mesures d'incertitude épistémique peuvent être déduites.Decision-making problems often occur under uncertainty, encompassing both aleatoric uncertainty arising from inherent randomness in processes and epistemic uncertainty due to limited knowledge. This thesis explores the concept of uncertainty, a crucial aspect of machine learning and a key factor for rational agents to determine where to allocate their resources for achieving the best possible results. Traditionally, uncertainty is encoded in a posterior distribution, obtained by approximate \textit{Bayesian} inference techniques. This thesis's first set of contributions revolves around the mathematical properties of generative flow networks, which are probabilistic models over discrete sequences and amortized samplers of unnormalized probability distributions. Generative flow networks find applications in Bayesian inference and can be used for uncertainty estimation. Additionally, they are helpful for search problems in large compositional spaces. Beyond deepening the mathematical framework underlying them, a comparative study with hierarchical variational methods is provided, shedding light on the significant advantages of generative flow networks, both from a theoretical point of view and via diverse experiments. These contributions include a theory extending generative flow networks to continuous or more general spaces, which allows modelling the Bayesian posterior and uncertainty in many interesting settings. The theory is experimentally validated in various domains. This thesis's second line of work is about alternative measures of epistemic uncertainty beyond posterior modelling. The presented method, called Direct Epistemic Uncertainty Estimation (DEUP), overcomes a major shortcoming of approximate Bayesian inference techniques caused by model misspecification. DEUP relies on maintaining a secondary predictor of the errors of the main predictor, from which measures of epistemic uncertainty can be deduced

    Hierarchical Bayesian optimization of targeted motor outputs with spatiotemporal neurostimulation

    Ce mémoire par article part de la question suivante: pouvons-nous utiliser des prothèses neurales afin d’activer artificiellement certain muscles dans le but d’accélérer la guérison et le réapprentissage du contrôle moteur après un AVC ou un traumatisme cervical ? Cette question touche plus de 15 millions de personnes chaque année à travers le monde, et est au coeur de la recherche de Numa Dancause et Marco Bonizzato, nos collaborateurs dans le département de Neuroscience de l’Université de Montréal. Il est maintenant possible d’implanter des électrodes à grande capacité dans le cortex dans le but d’acheminer des signaux électriques, mais encore difficile de prédire l’effet de stimulations sur le cerveau et le reste du corps. Cependant, des résultats préliminaires prometteurs sur des rats et singes démontrent qu’une récupération motrice non-négligeable est observée après stimulation de régions encore fonctionnelles du cortex moteur. Les difficultés rattachées à l’implémentation optimale de stimulation motocorticale consistent donc à trouver une de ces régions, ainsi qu’un protocole de stimulation efficace à la récupération. Bien que cette optimisation a été jusqu’à présent faite à la main, l’émergence d’implants capables de livrer des signaux sur plusieurs sites et avec plusieurs patrons spatio-temporels rendent l’exploration manuelle et exhaustive impossible. Une approche prometteuse afin d’automatiser et optimiser ce processus est d’utiliser un algorithme d’exploration bayésienne. Mon travail a été de déveloper et de raffiner ces techniques avec comme objectif de répondre aux deux questions scientifiques importantes suivantes: (1) comment évoquer des mouvements complexes en enchainant des microstimulations corticales ?, et (2) peuvent-elles avoir des effets plus significatifs que des stimulations simples sur la récupération motrice? Nous présentons dans l’article de ce mémoire notre approche hiérarchique utilisant des processus gaussiens pour exploiter les propriétés connues du cerveau afin d’accélérer la recherche, ainsi que nos premiers résultats répondant à la question 1. Nous laissons pour des travaux futur une réponse définitive à la deuxième question.The idea for this thesis by article sprung from the following question: can we use neural prostheses to stimulate specific muscles in order to help recovery of motor control after stroke or cervical injury? This question is of crucial importance to 15 million people each year around the globe, and is at the heart of Numa Dancause and Marco Bonizzato’s research, our collaborators in the Neuroscience department at the University of Montreal. It is now possible to implant large capacity electrodes for electrical stimulation in cortex, but still difficult to predict their effect on the brain and the rest of the body. Nevertheless, preliminary but promising results on rats and monkeys have shown that a non-negligible motor recovery is obtained after stimulation of regions of motor cortex that are still functional. The difficulties related to optimal microcortical stimulation hence consist in finding both one of these regions, and a stimulation protocol with optimal recovery efficacy. This search has up to present day been performed by hand, but recent and upcoming large scale stimulation technologies permitting delivery of spatio-temporal signals are making such exhaustive searches impossible.A promising approach to automating and optimizing this discovery is the use of Bayesian optimization. My work has consisted in developing and refining such techniques with two scientific questions in mind: (1) how can we evoke complex movements by chaining cortical microstimulations?, and (2) can these outperform single channel stimulations in terms of recovery efficacy? We present in the main article of this thesis our hierarchical Bayesian optimization approach which uses gaussian processes to exploit known properties of the brain to speed up the search, as well as first results answering question 1. We leave to future work a definitive answer to the second question

    Distributed-memory large deformation diffeomorphic 3D image registration

    We present a parallel distributed-memory algorithm for large deformation diffeomorphic registration of volumetric images that produces large isochoric deformations (locally volume preserving). Image registration is a key technology in medical image analysis. Our algorithm uses a partial differential equation constrained optimal control formulation. Finding the optimal deformation map requires the solution of a highly nonlinear problem that involves pseudo-differential operators, biharmonic operators, and pure advection operators both forward and back- ward in time. A key issue is the time to solution, which poses the demand for efficient optimization methods as well as an effective utilization of high performance computing resources. To address this problem we use a preconditioned, inexact, Gauss-Newton- Krylov solver. Our algorithm integrates several components: a spectral discretization in space, a semi-Lagrangian formulation in time, analytic adjoints, different regularization functionals (including volume-preserving ones), a spectral preconditioner, a highly optimized distributed Fast Fourier Transform, and a cubic interpolation scheme for the semi-Lagrangian time-stepping. We demonstrate the scalability of our algorithm on images with resolution of up to 102431024^3 on the "Maverick" and "Stampede" systems at the Texas Advanced Computing Center (TACC). The critical problem in the medical imaging application domain is strong scaling, that is, solving registration problems of a moderate size of 2563256^3---a typical resolution for medical images. We are able to solve the registration problem for images of this size in less than five seconds on 64 x86 nodes of TACC's "Maverick" system.Comment: accepted for publication at SC16 in Salt Lake City, Utah, USA; November 201
