    Resilience for large ensemble computations

    With the increasing power of supercomputers, ever more detailed models of physical systems can be simulated, and ever larger problem sizes can be considered for any kind of numerical system. During the last twenty years the performance of the fastest clusters went from the teraFLOPS domain (ASCI RED: 2.3 teraFLOPS) to the pre-exaFLOPS domain (Fugaku: 442 petaFLOPS), and we will soon have the first supercomputer with a peak performance cracking the exaFLOPS (El Capitan: 1.5 exaFLOPS). Ensemble techniques experience a renaissance with the availability of those extreme scales. Especially recent techniques, such as particle filters, will benefit from it. Current ensemble methods in climate science, such as ensemble Kalman filters, exhibit a linear dependency between the problem size and the ensemble size, while particle filters show an exponential dependency. Nevertheless, with the prospect of massive computing power come challenges such as power consumption and fault-tolerance. The mean-time-between-failures shrinks with the number of components in the system, and it is expected to have failures every few hours at exascale. In this thesis, we explore and develop techniques to protect large ensemble computations from failures. We present novel approaches in differential checkpointing, elastic recovery, fully asynchronous checkpointing, and checkpoint compression. Furthermore, we design and implement a fault-tolerant particle filter with pre-emptive particle prefetching and caching. And finally, we design and implement a framework for the automatic validation and application of lossy compression in ensemble data assimilation. Altogether, we present five contributions in this thesis, where the first two improve state-of-the-art checkpointing techniques, and the last three address the resilience of ensemble computations. The contributions represent stand-alone fault-tolerance techniques, however, they can also be used to improve the properties of each other. For instance, we utilize elastic recovery (2nd contribution) for mitigating resiliency in an online ensemble data assimilation framework (3rd contribution), and we built our validation framework (5th contribution) on top of our particle filter implementation (4th contribution). We further demonstrate that our contributions improve resilience and performance with experiments on various architectures such as Intel, IBM, and ARM processors.Amb l’increment de les capacitats de còmput dels supercomputadors, es poden simular models de sistemes físics encara més detallats, i es poden resoldre problemes de més grandària en qualsevol tipus de sistema numèric. Durant els últims vint anys, el rendiment dels clústers més ràpids ha passat del domini dels teraFLOPS (ASCI RED: 2.3 teraFLOPS) al domini dels pre-exaFLOPS (Fugaku: 442 petaFLOPS), i aviat tindrem el primer supercomputador amb un rendiment màxim que sobrepassa els exaFLOPS (El Capitan: 1.5 exaFLOPS). Les tècniques d’ensemble experimenten un renaixement amb la disponibilitat d’aquestes escales tan extremes. Especialment les tècniques més noves, com els filtres de partícules, se¿n beneficiaran. Els mètodes d’ensemble actuals en climatologia, com els filtres d’ensemble de Kalman, exhibeixen una dependència lineal entre la mida del problema i la mida de l’ensemble, mentre que els filtres de partícules mostren una dependència exponencial. No obstant, juntament amb les oportunitats de poder computar massivament, apareixen desafiaments com l’alt consum energètic i la necessitat de tolerància a errors. El temps de mitjana entre errors es redueix amb el nombre de components del sistema, i s’espera que els errors s’esdevinguin cada poques hores a exaescala. En aquesta tesis, explorem i desenvolupem tècniques per protegir grans càlculs d’ensemble d’errors. Presentem noves tècniques en punts de control diferencials, recuperació elàstica, punts de control totalment asincrònics i compressió de punts de control. A més, dissenyem i implementem un filtre de partícules tolerant a errors amb captació i emmagatzematge en caché de partícules de manera preventiva. I finalment, dissenyem i implementem un marc per la validació automàtica i l’aplicació de compressió amb pèrdua en l’assimilació de dades d’ensemble. En total, en aquesta tesis presentem cinc contribucions, les dues primeres de les quals milloren les tècniques de punts de control més avançades, mentre que les tres restants aborden la resiliència dels càlculs d’ensemble. Les contribucions representen tècniques independents de tolerància a errors; no obstant, també es poden utilitzar per a millorar les propietats de cadascuna. Per exemple, utilitzem la recuperació elàstica (segona contribució) per a mitigar la resiliència en un marc d’assimilació de dades d’ensemble en línia (tercera contribució), i construïm el nostre marc de validació (cinquena contribució) sobre la nostra implementació del filtre de partícules (quarta contribució). A més, demostrem que les nostres contribucions milloren la resiliència i el rendiment amb experiments en diverses arquitectures, com processadors Intel, IBM i ARM.Postprint (published version

    Algorithms and architectures for MCMC acceleration in FPGAs

    Markov Chain Monte Carlo (MCMC) is a family of stochastic algorithms which are used to draw random samples from arbitrary probability distributions. This task is necessary to solve a variety of problems in Bayesian modelling, e.g. prediction and model comparison, making MCMC a fundamental tool in modern statistics. Nevertheless, due to the increasing complexity of Bayesian models, the explosion in the amount of data they need to handle and the computational intensity of many MCMC algorithms, performing MCMC-based inference is often impractical in real applications. This thesis tackles this computational problem by proposing Field Programmable Gate Array (FPGA) architectures for accelerating MCMC and by designing novel MCMC algorithms and optimization methodologies which are tailored for FPGA implementation. The contributions of this work include: 1) An FPGA architecture for the Population-based MCMC algorithm, along with two modified versions of the algorithm which use custom arithmetic precision in large parts of the implementation without introducing error in the output. Mapping the two modified versions to an FPGA allows for more parallel modules to be instantiated in the same chip area. 2) An FPGA architecture for the Particle MCMC algorithm, along with a novel algorithm which combines Particle MCMC and Population-based MCMC to tackle multi-modal distributions. A proposed FPGA architecture for the new algorithm achieves higher datapath utilization than the Particle MCMC architecture. 3) A generic method to optimize the arithmetic precision of any MCMC algorithm that is implemented on FPGAs. The method selects the minimum precision among a given set of precisions, while guaranteeing a user-defined bound on the output error. By applying the above techniques to large-scale Bayesian problems, it is shown that significant speedups (one or two orders of magnitude) are possible compared to state-of-the-art MCMC algorithms implemented on CPUs and GPUs, opening the way for handling complex statistical analyses in the era of ubiquitous, ever-increasing data.Open Acces

    Efficient Implementation of Particle Filters in Application-Specific Instruction-Set Processor

    RÉSUMÉ Cette thèse considère le problème de l’implémentation de filtres particulaires (particle filters PFs) dans des processeurs à jeu d’instructions spécialisé (Application-Specific Instruction-set Processors ASIPs). Considérant la diversité et la complexité des PFs, leur implémentation requiert une grande efficacité dans les calculs et de la flexibilité dans leur conception. La conception de ASIPs peut se faire avec un niveau intéressant de flexibilité. Notre recherche se concentre donc sur l’amélioration du débit des PFs dans un environnement de conception de ASIP. Une approche générale est tout d’abord proposée pour caractériser la complexité computationnelle des PFs. Puisque les PFs peuvent être utilisés dans une vaste gamme d’applications, nous utilisons deux types de blocs afin de distinguer les propriétés des PFs. Le premier type est spécifique à l’application et le deuxième type est spécifique à l’algorithme. Selon les résultats de profilage, nous avons identifié que les blocs du calcul de la probabilité et du rééchantillonnage sont les goulots d’étranglement principaux des blocs spécifiques à l’algorithme. Nous explorons l’optimisation de ces deux blocs aux niveaux algorithmique et architectural. Le niveau algorithmique offre un grand potentiel d’accélération et d’amélioration du débit. Notre travail débute donc à ce niveau par l’analyse de la complexité des blocs du calcul de la probabilité et du rééchantillonnage, puis continue avec leur simplification et modification. Nous avons simplifié le bloc du calcul de la probabilité en proposant un mécanisme de quantification uniforme, l’algorithme UQLE. Les résultats démontrent une amélioration significative d’une implémentation logicielle, sans perte de précision. Le pire cas de l’algorithme UQLE implémenté en logiciel à virgule fixe avec 32 niveaux de quantification atteint une accélération moyenne de 23.7× par rapport à l’implémentation logicielle de l’algorithme ELE. Nous proposons aussi deux nouveaux algorithmes de rééchantillonnage pour remplacer l’algorithme séquentiel de rééchantillonnage systématique (SR) dans les PFs. Ce sont l’algorithme SR reformulé et l’algorithme SR parallèle (PSR). L’algorithme SR reformulé combine un groupe de boucles en une boucle unique afin de faciliter sa parallélisation dans un ASIP. L’algorithme PSR rend les itérations indépendantes, permettant ainsi à l’algorithme de rééchantillonnage de s’exécuter en parallèle. De plus, l’algorithme PSR a une complexité computationnelle plus faible que l’algorithme SR. Du point de vue architectural, les ASIPs offrent un grand potentiel pour l’implémentation de PFs parce qu’ils présentent un bon équilibre entre l’efficacité computationnelle et la flexibilité de conception. Ils permettent des améliorations considérables en débit par l’inclusion d’instructions spécialisées, tout en conservant la facilité relative de programmation de processeurs à usage général. Après avoir identifié les goulots d’étranglement de PFs dans les blocs spécifiques à l’algorithme, nous avons généré des instructions spécialisées pour les algorithmes UQLE, SR reformulé et PSR. Le débit a été significativement amélioré par rapport à une implémentation purement logicielle tournant sur un processeur à usage général. L’implémentation de l’algorithme UQLE avec instruction spécialisée avec 32 intervalles atteint une accélération de 34× par rapport au pire cas de son implémentation logicielle, avec 3.75 K portes logiques additionnelles. Nous avons produit une implémentation de l’algorithme SR reformulé, avec quatre poids calculés en parallèle et huit catégories définies par des bornes uniformément distribuées qui sont comparées simultanément. Elle atteint une accélération de 23.9× par rapport à l’algorithme SR séquentiel dans un processeur à usage général. Le surcoût est limité à 54 K portes logiques additionnelles. Pour l’algorithme PSR, nous avons conçu quatre instructions spécialisées configurées pour supporter quatre poids entrés en parallèle. Elles mènent à une accélération de 53.4× par rapport à une implémentation de l’algorithme SR en virgule flottante sur un processeur à usage général, avec un surcoût de 47.3 K portes logiques additionnelles. Finalement, nous avons considéré une application du suivi vidéo et implémenté dans un ASIP un algorithme de FP basé sur un histogramme. Nous avons identifié le calcul de l’histogramme comme étant le goulot principal des blocs spécifiques à l’application. Nous avons donc proposé une architecture de calcul d’histogramme à réseau parallèle (PAHA) pour ASIPs. Les résultats d’implémentation démontrent qu’un PAHA à 16 voies atteint une accélération de 43.75× par rapport à une implémentation logicielle sur un processeur à usage général.----------ABSTRACT This thesis considers the problem of the implementation of particle filters (PFs) in Application-Specific Instruction-set Processors (ASIPs). Due to the diversity and complexity of PFs, implementing them requires both computational efficiency and design flexibility. ASIP design can offer an interesting degree of design flexibility. Hence, our research focuses on improving the throughput of PFs in this flexible ASIP design environment. A general approach is first proposed to characterize the computational complexity of PFs. Since PFs can be used for a wide variety of applications, we employ two types of blocks, which are application-specific and algorithm-specific, to distinguish the properties of PFs. In accordance with profiling results, we identify likelihood processing and resampling processing blocks as the main bottlenecks in the algorithm-specific blocks. We explore the optimization of these two blocks at the algorithmic and architectural levels. The algorithmic level is at a high level and therefore has a high potential to offer speed and throughput improvements. Hence, in this work we begin at the algorithm level by analyzing the complexity of the likelihood processing and resampling processing blocks, then proceed with their simplification and modification. We simplify the likelihood processing block by proposing a uniform quantization scheme, the Uniform Quantization Likelihood Evaluation (UQLE). The results show a significant improvement in performance without losing accuracy. The worst case of UQLE software implementation in fixed-point arithmetic with 32 quantized intervals achieves 23.7× average speedup over the software implementation of ELE. We also propose two novel resampling algorithms instead of the sequential Systematic Resampling (SR) algorithm in PFs. They are the reformulated SR and Parallel Systematic Resampling (PSR) algorithms. The reformulated SR algorithm combines a group of loops into a parallel loop to facilitate parallel implementation in an ASIP. The PSR algorithm makes the iterations independent, thus allowing the resampling algorithms to perform loop iterations in parallel. In addition, our proposed PSR algorithm has lower computational complexity than the SR algorithm. At the architecture level, ASIPs are appealing for the implementation of PFs because they strike a good balance between computational efficiency and design flexibility. They can provide considerable throughput improvement by the inclusion of custom instructions, while retaining the ease of programming of general-purpose processors. Hence, after identifying the bottlenecks of PFs in the algorithm-specific blocks, we describe customized instructions for the UQLE, reformulated SR, and PSR algorithms in an ASIP. These instructions provide significantly higher throughput when compared to a pure software implementation running on a general-purpose processor. The custom instruction implementation of UQLE with 32 intervals achieves 34× speedup over the worst case of its software implementation with 3.75 K additional gates. An implementation of the reformulated SR algorithm is evaluated with four weights calculated in parallel and eight categories defined by uniformly distributed numbers that are compared simultaneously. It achieves a 23.9× speedup over the sequential SR algorithm in a general-purpose processor. This comes at a cost of only 54 K additional gates. For the PSR algorithm, four custom instructions, when configured to support four weights input in parallel, lead to a 53.4× speedup over the floating-point SR implementation on a general-purpose processor at a cost of 47.3 K additional gates. Finally, we consider the specific application of video tracking, and an implementation of a histogram-based PF in an ASIP. We identify that the histogram calculation is the main bottleneck in the application-specific blocks. We therefore propose a Parallel Array Histogram Architecture (PAHA) engine for accelerating the histogram calculation in ASIPs. Implementation results show that a 16-way PAHA can achieve a speedup of 43.75× when compared to its software implementation in a general-purpose processor

    Towards resource-aware computing for task-based runtimes and parallel architectures

    Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. The increasing restrictions in terms of power consumption of High Performance Computing (HPC) systems and data centers have forced hardware vendors to include power capping capabilities in their commodity processors. Power capping opens up new opportunities for applications to directly manage their power behavior at user level. However, constraining power consumption causes the individual sockets of a parallel system to deliver different performance levels under the same power cap, even when they are equally designed, which is an effect caused by manufacturing variability. Modern chips suffer from heterogeneous power consumption due to manufacturing issues, a problem known as manufacturing or process variability. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this thesis we show that parallel systems benefit from taking into account the consequences of manufacturing variability, in terms of both performance and energy efficiency. In order to evaluate our work we have also implemented our own task-based version of the PARSEC benchmark suite. This allows to test our methodology using state-of-the-art parallelization techniques and real world workloads. We present two approaches to mitigate manufacturing variability, by power redistribution at runtime level and by power- and variability-aware job scheduling at system-wide level. A parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost. The next approach presented in this theis, we show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensures that power consumption stays under a system wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications.Los sistemas modernos de gran escala muestran crecientes demandas de energía, hasta el punto de que se ha convertido en una gran presión para las instalaciones y los presupuestos. Las restricciones crecientes de consumo de energía de los sistemas de alto rendimiento (HPC) y los centros de datos han obligado a los proveedores de hardware a incluir capacidades de limitación de energía en sus procesadores. La limitación de energía abre nuevas oportunidades para que las aplicaciones administren directamente su comportamiento de energía a nivel de usuario. Sin embargo, la restricción en el consumo de energía de sockets individuales de un sistema paralelo resulta en diferentes niveles de rendimiento, por el mismo límite de potencia, incluso cuando están diseñados por igual. Esto es un efecto causado durante el proceso de la fabricación. Los chips modernos sufren de un consumo de energía heterogéneo debido a problemas de fabricación, un problema conocido como variabilidad del proceso o fabricación. Como resultado, los sistemas que no consideran este tipo de variabilidad causada por problemas de fabricación conducen a degradaciones del rendimiento y desperdicio de energía. Para evitar dicho impacto negativo, los usuarios y administradores del sistema deben contrarrestar activamente cualquier variabilidad de fabricación. En esta tesis, demostramos que los sistemas paralelos se benefician de tener en cuenta las consecuencias de la variabilidad de la fabricación, tanto en términos de rendimiento como de eficiencia energética. Para evaluar nuestro trabajo, también hemos implementado nuestra propia versión del paquete de aplicaciones de prueba PARSEC, basada en tareas paralelos. Esto permite probar nuestra metodología utilizando técnicas avanzadas de paralelización con cargas de trabajo del mundo real. Presentamos dos enfoques para mitigar la variabilidad de fabricación, mediante la redistribución de la energía a durante la ejecución de las aplicaciones y mediante la programación de trabajos a nivel de todo el sistema. Se puede utilizar un sistema runtime paralelo para tratar con eficacia este nuevo tipo de heterogeneidad de rendimiento, compensando los efectos desiguales de la limitación de potencia. En el contexto de un nodo NUMA compuesto de varios sockets y núcleos, nuestro sistema puede optimizar los niveles de energía y concurrencia asignados a cada socket para maximizar el rendimiento. Aplicado de manera transparente dentro del sistema runtime paralelo, no requiere ninguna interacción del programador como cambiar el código fuente de la aplicación o reconfigurar manualmente el sistema paralelo. Comparamos nuestro novedoso análisis de runtime con los resultados óptimos, obtenidos de una análisis manual exhaustiva, y demostramos que puede lograr el mismo rendimiento a una fracción del costo. El siguiente enfoque presentado en esta tesis, muestra que es posible predecir el impacto de la variabilidad de fabricación en aplicaciones específicas mediante el uso de modelos de predicción de potencia conscientes de la variabilidad. Basados ​​en estos modelos de predicción de energía, proponemos dos políticas de programación de trabajos que consideran los efectos de la variabilidad de fabricación para cada aplicación y que aseguran que el consumo se mantiene bajo un presupuesto de energía de todo el sistema. Evaluamos nuestras políticas con diferentes presupuestos de energía y escenarios de tráfico, que consisten en aplicaciones paralelas que corren en uno o varios nodos.Postprint (published version

    Tracking and classification with wireless sensor networks and the transferable belief model

    The use of small, cheap, networked devices to collaboratively perform a task presents an attractive opportunity for many scenarios. One such scenario is the tracking and classification of an object moving through a region of interest. A single sensor is capable of very little, but a group of sensors can potentially provide a flexible, self-organising system that can carry out tasks in harsh conditions for long periods of time. This thesis presents a new framework for tracking and classification with a wire less sensor network. Existing algorithms have been integrated and extended within this framework to perform tracking and classification whilst managing energy usage in order to balance the quality of information with the cost of obtaining it. Novel improvements are presented to perform tracking and classification in more realistic scenarios where a target is moving in a non-linear fashion over a varying terrain. The framework presented in this thesis can be used not only in algorithm development, but also as a tool to aid sensor deployment planning. All of the algorithms presented in this thesis have a common basis that results from the integration of a wireless sensor network management algorithm and a tracking and classification algorithm both of which are considered state-of-the-art. Tracking is performed with a particle filter, and classification is performed with the Transferable Belief Model. Simulations are used throughout this thesis in order to compare the performance of different algorithms. A large number of simulations are used in each experiment with various parameter combinations in order to provide a detailed analysis of each algorithm and scenario. The work presented in this thesis could be of use to developers of wireless sensor network algorithms, and also to people who plan the deployment of nodes. This thesis focuses on military scenarios, but the research presented is not limited to this.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    An aptamer-based sensing platform for luteinising hormone pulsatility measurement

    Normal fertility in human involves highly orchestrated communication across the hypothalamic-pituitary-gonadal (HPG) axis. The pulsatile release of Luteinising Hormone (LH) is a critical element for downstream regulation of sex steroid hormone synthesis and the production of mature eggs. Changes in LH pulsatile pattern have been linked to hypothalamic dysfunction, resulting in multiple reproductive and growth disorders including Polycystic Ovary Syndrome (PCOS), Hypothalamic Amenorrhea (HA), and delayed/precocious puberty. Therefore, assessing the pulsatility of LH is important not only for academic investigation of infertility, but also for clinical decisions and monitoring of treatment. However, there is currently no clinically available tool for measuring human LH pulsatility. The immunoassay system is expensive and requires large volumes of patient blood, limiting its application for LH pulsatility monitoring. In this thesis, I propose a novel method using aptamer-enabled sensing technology to develop a device platform to measure LH pulsatility. I first generated a novel aptamer binding molecule against LH by a nitrocellulose membrane-based in vitro selection then characterised its high affinity and specific binding properties by multiple biophysical/chemical methods. I then developed a sensitive electrochemical-based detection method using this aptamer. The principal mechanism is that structure switching upon binding is associated with the electron transfer rate changes of the MB redox label. I then customised this assay to numerous device platforms under our rapid prototyping strategy including 96 well automated platform, continuous sensing platform and chip-based multiple electrode platform. The best-performing device was found to be the AELECAP (Automated ELEctroChemical Aptamer Platform) – a 96-well plate based automatic micro-wire sensing platform capable of measuring a series of low volume luteinising hormone within a short time. Clinical samples were evaluated using AELECAP. A series of clinical samples were measured including LH pulsatility profile of menopause female (high LH amplitude), normal female/male (normal LH amplitude) and female with hypothalamic amenorrhea (no LH pulsatility). Total patient numbers were 12 of each type, with 50 blood samples collected every 10 mins in 8 hours. Results showed that the system can distinguish LH pulsatile pattern among the cohorts and pulsatility profiles were consistent with the result measured by clinical assays. AELECAP shows high potential as a novel approach for clinical aptamer-based sensing. AELECAP competes with current automated immunometric assays system with lower costs, lower reagent use, and a simpler setup. There is potential for this approach to be further developed as a tool for infertility research and to assist clinicians in personalised treatment with hormonal therapy.Open Acces

    Computational Proteomics Using Network-Based Strategies

    This thesis examines the productive application of networks towards proteomics, with a specific biological focus on liver cancer. Contempory proteomics (shot- gun) is plagued by coverage and consistency issues. These can be resolved via network-based approaches. The application of 3 classes of network-based approaches are examined: A traditional cluster based approach termed Proteomics Expansion Pipeline), a generalization of PEP termed Maxlink and a feature-based approach termed Proteomics Signature Profiling. PEP is an improvement on prevailing cluster-based approaches. It uses a state- of-the-art cluster identification algorithm as well as network-cleaning approaches to identify the critical network regions indicated by the liver cancer data set. The top PARP1 associated-cluster was identified and independently validated. Maxlink allows identification of undetected proteins based on the number of links to identified differential proteins. It is more sensitive than PEP due to more relaxed requirements. Here, the novel roles of ARRB1/2 and ACTB are identified and discussed in the context of liver cancer. Both PEP and Maxlink are unable to deal with consistency issues, PSP is the first method able to deal with both, and is termed feature-based since the network- based clusters it uses are predicted independently of the data. It is also capable of using real complexes or predicted pathway subnets. By combining pathways and complexes, a novel basis of liver cancer progression implicating nucleotide pool imbalance aggravated by mutations of key DNA repair complexes was identified. Finally, comparative evaluations suggested that pure network-based methods are vastly outperformed by feature-based network methods utilizing real complexes. This is indicative that the quality of current networks are insufficient to provide strong biological rigor for data analysis, and should be carefully evaluated before further validations.Open Acces
