12 research outputs found

    Theoretically Efficient Parallel Graph Algorithms Can Be Fast and Scalable

    Full text link
    There has been significant recent interest in parallel graph processing due to the need to quickly analyze the large graphs available today. Many graph codes have been designed for distributed memory or external memory. However, today even the largest publicly-available real-world graph (the Hyperlink Web graph with over 3.5 billion vertices and 128 billion edges) can fit in the memory of a single commodity multicore server. Nevertheless, most experimental work in the literature report results on much smaller graphs, and the ones for the Hyperlink graph use distributed or external memory. Therefore, it is natural to ask whether we can efficiently solve a broad class of graph problems on this graph in memory. This paper shows that theoretically-efficient parallel graph algorithms can scale to the largest publicly-available graphs using a single machine with a terabyte of RAM, processing them in minutes. We give implementations of theoretically-efficient parallel algorithms for 20 important graph problems. We also present the optimizations and techniques that we used in our implementations, which were crucial in enabling us to process these large graphs quickly. We show that the running times of our implementations outperform existing state-of-the-art implementations on the largest real-world graphs. For many of the problems that we consider, this is the first time they have been solved on graphs at this scale. We have made the implementations developed in this work publicly-available as the Graph-Based Benchmark Suite (GBBS).Comment: This is the full version of the paper appearing in the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 201

    Study of Fine-Grained, Irregular Parallel Applications on a Many-Core Processor

    Get PDF
    This dissertation demonstrates the possibility of obtaining strong speedups for a variety of parallel applications versus the best serial and parallel implementations on commodity platforms. These results were obtained using the PRAM-inspired Explicit Multi-Threading (XMT) many-core computing platform, which is designed to efficiently support execution of both serial and parallel code and switching between the two. Biconnectivity: For finding the biconnected components of a graph, we demonstrate speedups of 9x to 33x on XMT relative to the best serial algorithm using a relatively modest silicon budget. Further evidence suggests that speedups of 21x to 48x are possible. For graph connectivity, we demonstrate that XMT outperforms two contemporary NVIDIA GPUs of similar or greater silicon area. Prior studies of parallel biconnectivity algorithms achieved at most a 4x speedup, but we could not find biconnectivity code for GPUs to compare biconnectivity against them. Triconnectivity: We present a parallel solution to the problem of determining the triconnected components of an undirected graph. We obtain significant speedups on XMT over the only published optimal (linear-time) serial implementation of a triconnected components algorithm running on a modern CPU. To our knowledge, no other parallel implementation of a triconnected components algorithm has been published for any platform. Burrows-Wheeler compression: We present novel work-optimal parallel algorithms for Burrows-Wheeler compression and decompression of strings over a constant alphabet and their empirical evaluation. To validate these theoretical algorithms, we implement them on XMT and show speedups of up to 25x for compression, and 13x for decompression, versus bzip2, the de facto standard implementation of Burrows-Wheeler compression. Fast Fourier transform (FFT): Using FFT as an example, we examine the impact that adoption of some enabling technologies, including silicon photonics, would have on the performance of a many-core architecture. The results show that a single-chip many-core processor could potentially outperform a large high-performance computing cluster. Boosted decision trees: This chapter focuses on the hybrid memory architecture of the XMT computer platform, a key part of which is a flexible all-to-all interconnection network that connects processors to shared memory modules. First, to understand some recent advances in GPU memory architecture and how they relate to this hybrid memory architecture, we use microbenchmarks including list ranking. Then, we contrast the scalability of applications with that of routines. In particular, regardless of the scalability needs of full applications, some routines may involve smaller problem sizes, and in particular smaller levels of parallelism, perhaps even serial. To see how a hybrid memory architecture can benefit such applications, we simulate a computer with such an architecture and demonstrate the potential for a speedup of 3.3X over NVIDIA's most powerful GPU to date for XGBoost, an implementation of boosted decision trees, a timely machine learning approach. Boolean satisfiability (SAT): SAT is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers. We show the potential for speedups of up to 382X across a variety of problem instances. We hope that these results will stimulate future research

    Feasibility Study of Scaling an XMT Many-Core

    Get PDF
    The reason for recent focus on communication avoidance is that high rates of data movement become infeasible due to excessive power dissipation. However, shifting the responsibility of minimizing data movement to the parallel algorithm designer comes at significant costs to programmer’s productivity, as well as: (i) reduced speedups and (ii) the risk of repelling application developers from adopting parallelism. The UMD Explicit Multi-Threading (XMT) framework has demonstrated advantages on ease of parallel programming through its support of PRAM-like programming, combined with strong, often unprecedented speedups. Such programming and speedups involve considerable data movement between processors and shared memory. Another reason that XMT is a good test case for a study of data movement is that XMT permits isolation and direct study of most of its data movement (and its power dissipation). Our new results demonstrate that an XMT single-chip many-core processor with tens of thousands of cores and a high throughput network on chip is thermally feasible, though at some cost. This leads to a perhaps game-changing outcome: instead of imposing upfront strict restrictions on data movement, as advocated in a recent report from the National Academies, opt for due diligence that accounts for the full impact on cost. For example, does the increased cost due to communication avoidance (including programmer’s productivity, reduced speedups and desertion risk) indeed offset the cost of the solution we present? More specifically, we investigate in this paper the design of an XMT many-core for 3D VLSI with microfluidic cooling. We used state-of-the-art simulation tools to model the power and thermal properties of such an architecture with 8k to 64k lightweight cores, requiring between 2 and 8 silicon layers. Inter-chip communication using silicon compatible photonics is also considered. We found that, with the use of microfluidic cooling, power dissipation becomes a cost issue rather than a feasibility constraint. Robustness of the results is also discussed.DARPA, NSF, NI

    Computing Connected Components with linear communication cost in pregel-like systems

    Full text link
    © 2016 IEEE. The paper studies two fundamental problems in graph analytics: computing Connected Components (CCs) and computing BiConnected Components (BCCs) of a graph. With the recent advent of Big Data, developing effcient distributed algorithms for computing CCs and BCCs of a big graph has received increasing interests. As with the existing research efforts, in this paper we focus on the Pregel programming model, while the techniques may be extended to other programming models including MapReduce and Spark. The state-of-the-art techniques for computing CCs and BCCs in Pregel incur O(m × #supersteps) total costs for both data communication and computation, where m is the number of edges in a graph and #supersteps is the number of supersteps. Since the network communication speed is usually much slower than the computation speed, communication costs are the dominant costs of the total running time in the existing techniques. In this paper, we propose a new paradigm based on graph decomposition to reduce the total communication costs from O(m×#supersteps) to O(m), for both computing CCs and computing BCCs. Moreover, the total computation costs of our techniques are smaller than that of the existing techniques in practice, though theoretically they are almost the same. Comprehensive empirical studies demonstrate that our approaches can outperform the existing techniques by one order of magnitude regarding the total running time

    ConnectIt: A Framework for Static and Incremental Parallel Graph Connectivity Algorithms

    Full text link
    Connected components is a fundamental kernel in graph applications due to its usefulness in measuring how well-connected a graph is, as well as its use as subroutines in many other graph algorithms. The fastest existing parallel multicore algorithms for connectivity are based on some form of edge sampling and/or linking and compressing trees. However, many combinations of these design choices have been left unexplored. In this paper, we design the ConnectIt framework, which provides different sampling strategies as well as various tree linking and compression schemes. ConnectIt enables us to obtain several hundred new variants of connectivity algorithms, most of which extend to computing spanning forest. In addition to static graphs, we also extend ConnectIt to support mixes of insertions and connectivity queries in the concurrent setting. We present an experimental evaluation of ConnectIt on a 72-core machine, which we believe is the most comprehensive evaluation of parallel connectivity algorithms to date. Compared to a collection of state-of-the-art static multicore algorithms, we obtain an average speedup of 37.4x (2.36x average speedup over the fastest existing implementation for each graph). Using ConnectIt, we are able to compute connectivity on the largest publicly-available graph (with over 3.5 billion vertices and 128 billion edges) in under 10 seconds using a 72-core machine, providing a 3.1x speedup over the fastest existing connectivity result for this graph, in any computational setting. For our incremental algorithms, we show that our algorithms can ingest graph updates at up to several billion edges per second. Finally, to guide the user in selecting the best variants in ConnectIt for different situations, we provide a detailed analysis of the different strategies in terms of their work and locality

    POWER AND PERFORMANCE STUDIES OF THE EXPLICIT MULTI-THREADING (XMT) ARCHITECTURE

    Get PDF
    Power and thermal constraints gained critical importance in the design of microprocessors over the past decade. Chipmakers failed to keep power at bay while sustaining the performance growth of serial computers at the rate expected by consumers. As an alternative, they turned to fitting an increasing number of simpler cores on a single die. While this is a step forward for relaxing the constraints, the issue of power is far from resolved and it is joined by new challenges which we explain next. As we move into the era of many-cores, processors consisting of 100s, even 1000s of cores, single-task parallelism is the natural path for building faster general-purpose computers. Alas, the introduction of parallelism to the mainstream general-purpose domain brings another long elusive problem to focus: ease of parallel programming. The result is the dual challenge where power efficiency and ease-of-programming are vital for the prevalence of up and coming many-core architectures. The observations above led to the lead goal of this dissertation: a first order validation of the claim that even under power/thermal constraints, ease-of-programming and competitive performance need not be conflicting objectives for a massively-parallel general-purpose processor. As our platform, we choose the eXplicit Multi-Threading (XMT) many-core architecture for fine grained parallel programs developed at the University of Maryland. We hope that our findings will be a trailblazer for future commercial products. XMT scales up to thousand or more lightweight cores and aims at improving single task execution time while making the task for the programmer as easy as possible. Performance advantages and ease-of-programming of XMT have been shown in a number of publications, including a study that we present in this dissertation. Feasibility of the hardware concept has been exhibited via FPGA and ASIC (per our partial involvement) prototypes. Our contributions target the study of power and thermal envelopes of an envisioned 1024-core XMT chip (XMT1024) under programs that exist in popular parallel benchmark suites. First, we compare XMT against an area and power equivalent commercial high-end many-core GPU. We demonstrate that XMT can provide an average speedup of 8.8x in irregular parallel programs that are common and important in general purpose computing. Even under the worst-case power estimation assumptions for XMT, average speedup is only reduced by half. We further this study by experimentally evaluating the performance advantages of Dynamic Thermal Management (DTM), when applied to XMT1024. DTM techniques are frequently used in current single and multi-core processors, however until now their effects on single-tasked many-cores have not been examined in detail. It is our purpose to explore how existing techniques can be tailored for XMT to improve performance. Performance improvements up to 46% over a generic global management technique has been demonstrated. The insights we provide can guide designers of other similar many-core architectures. A significant infrastructure contribution of this dissertation is a highly configurable cycle-accurate simulator, XMTSim. To our knowledge, XMTSim is currently the only publicly-available shared-memory many-core simulator with extensive capabilities for estimating power and temperature, as well as evaluating dynamic power and thermal management algorithms. As a major component of the XMT programming toolchain, it is not only used as the infrastructure in this work but also contributed to other publications and dissertations

    Efficient Online Processing for Advanced Analytics

    Get PDF
    With the advent of emerging technologies and the Internet of Things, the importance of online data analytics has become more pronounced. Businesses and companies are adopting approaches that provide responsive analytics to stay competitive in the global marketplace. Online analytics allow data analysts to promptly react to patterns or to gain preliminary insights from early results that aid in research, decision making, and effective strategy planning. The growth of data-velocity in a variety of domains including, high-frequency trading, social networks, infrastructure monitoring, and advertising require adopting online engines that can efficiently process continuous streams of data. This thesis presents foundations, techniques, and systems' design that extend the state-of-the-art in online query processing to efficiently support relational joins with arbitrary join-predicates (beyond traditional equi-joins); and to support other data models (beyond relational) that target machine learning and graph computations. The thesis is divided into two parts: We first present a brief overview of Squall, our open-source online query processing engine that supports SQL-like queries on top of streams. Then, we focus on extending Squall to support efficient theta-join processing. Scalable distributed join processing requires a partitioning policy that evenly distributes the processing load while minimizing the size of maintained state and duplicated messages. Efficient load-balance demands apriori-statistics which are not available in the online setting. We propose a novel operator that continuously adjusts itself to the data dynamics, through adaptive dataflow routing and state repartitioning. It is also resilient to data-skew, maintains high throughput rates, avoids blocking during state repartitioning, and behaves as a black-box dataflow operator with provable performance guarantees. Our evaluation demonstrates that the proposed operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time up to 7x. In the second part, we present a novel framework that supports the Incremental View Maintenance (IVM) of workloads expressed as linear algebra programs. Linear algebra represents a concrete substrate for advanced analytical tasks including, machine learning, scientific computation, and graph algorithms. Previous works on relational calculus IVM are not applicable to matrix algebra workloads. This is because a single entry change to an input-matrix results in changes all over the intermediate views, rendering IVM useless in comparison to re-evaluation. We present Lago, a unified modular compiler framework that supports the IVM of a broad class of linear algebra programs. Lago automatically derives and optimizes incremental trigger programs of analytical computations, while freeing the user from erroneous manual derivations, low-level implementation details, and performance tuning. We present a novel technique that captures Δ\Delta changes as low-rank matrices. Low-rank matrices are representable in a compressed factored form that enables cheaper computations. Lago automatically propagates the factored representation across program statements to derive an efficient trigger program. Moreover, Lago extends its support to other domains that use different semi-ring configurations, e.g., graph applications. Our evaluation results demonstrate orders of magnitude (10x-1

    Adaptive Computing Systems for Aerospace

    Get PDF
    RÉSUMÉ En raison de leur complexitĂ© croissante, les systĂšmes informatiques modernes nĂ©cessitent de nouvelles mĂ©thodologies permettant d’automatiser leur conception et d’amĂ©liorer leurs performances. L’espace, en particulier, constitue un environnement trĂšs dĂ©favorable au maintien de la performance de ces systĂšmes : sans protection des rayonnements ionisants et des particules, l’électronique basĂ©e sur CMOS peut subir des erreurs transitoires, une dĂ©gradation des performances et une usure accĂ©lĂ©rĂ©e causant ultimement une dĂ©faillance du systĂšme. Les approches traditionnellement adoptees pour garantir la fiabilitĂ© du systĂšme et prolonger sa durĂ©e de vie sont basĂ©es sur la redondance, gĂ©nĂ©ralement Ă©tablie durant la conception. En revanche, ces solutions sont coĂ»teuses et parfois inefficaces, puisqu'elles augmentent la taille et la complexitĂ© du systĂšme, l'exposant Ă  des risques plus Ă©levĂ©s de surchauffe et d'erreurs. Les consĂ©quences de ces limites sont d'autant plus importantes lorsqu'elles s’appliquent aux systĂšmes critiques (e.g., contraintes par le temps ou dont l’accĂšs est limitĂ©) qui doivent ĂȘtre en mesure de prendre des dĂ©cisions sans intervention humaine. Sur la base de ces besoins et limites, le dĂ©veloppement en aĂ©rospatial de systĂšmes informatiques avec capacitĂ©s adaptatives peut ĂȘtre considĂ©rĂ© comme la solution la plus appropriĂ©e pour les dispositifs intĂ©grĂ©s Ă  haute performance. L’informatique auto-adaptative offre un potentiel sans Ă©gal pour assurer la crĂ©ation d’une gĂ©nĂ©ration d’ordinateurs plus intelligents et fiables. Qui plus est, elle rĂ©pond aux besoins modernes de concevoir et programmer des systĂšmes informatiques capables de rĂ©pondre Ă  des objectifs en conflit. En nous inspirant des domaines de l’intelligence artificielle et des systĂšmes reconfigurables, nous aspirons Ă  dĂ©velopper des systĂšmes informatiques auto-adaptatifs pour l’aĂ©rospatiale qui rĂ©pondent aux enjeux et besoins actuels. Notre objectif est d’amĂ©liorer l’efficacitĂ© de ces systĂšmes, leur tolerance aux pannes et leur capacitĂ© de calcul. Afin d’atteindre cet objectif, une analyse expĂ©rimentale et comparative des algorithmes les plus populaires pour l’exploration multi-objectifs de l’espace de conception est d’abord effectuĂ©e. Les algorithmes ont Ă©tĂ© recueillis suite Ă  une revue de la plus rĂ©cente littĂ©rature et comprennent des mĂ©thodes heuristiques, Ă©volutives et statistiques. L’analyse et la comparaison de ceux-ci permettent de cerner les forces et limites de chacun et d'ainsi dĂ©finir des lignes directrices favorisant un choix optimal d’algorithmes d’exploration. Pour la crĂ©ation d’un systĂšme d’optimisation autonome—permettant le compromis entre plusieurs objectifs—nous exploitons les capacitĂ©s des modĂšles graphiques probabilistes. Nous introduisons une mĂ©thodologie basĂ©e sur les modĂšles de Markov cachĂ©s dynamiques, laquelle permet d’équilibrer la disponibilitĂ© et la durĂ©e de vie d’un systĂšme multiprocesseur. Ceci est obtenu en estimant l'occurrence des erreurs permanentes parmi les erreurs transitoires et en migrant dynamiquement le calcul sur les ressources supplĂ©mentaires en cas de dĂ©faillance. La nature dynamique du modĂšle rend celui-ci adaptable Ă  diffĂ©rents profils de mission et taux d’erreur. Les rĂ©sultats montrent que nous sommes en mesure de prolonger la durĂ©e de vie du systĂšme tout en conservant une disponibilitĂ© proche du cas idĂ©al. En raison des contraintes de temps rigoureuses imposĂ©es par les systĂšmes aĂ©rospatiaux, nous Ă©tudions aussi l’optimisation de la tolĂ©rance aux pannes en prĂ©sence d'exigences d’exĂ©cution en temps rĂ©el. Nous proposons une mĂ©thodologie pour amĂ©liorer la fiabilitĂ© du calcul en prĂ©sence d’erreurs transitoires pour les tĂąches en temps rĂ©el d’un systĂšme multiprocesseur homogĂšne avec des capacitĂ©s de rĂ©glage de tension et de frĂ©quence. Dans ce cadre, nous dĂ©finissons un nouveau compromis probabiliste entre la consommation d’énergie et la tolĂ©rance aux erreurs. Comme nous reconnaissons que la rĂ©silience est une propriĂ©tĂ© d’intĂ©rĂȘt omniprĂ©sente (par exemple, pour la conception et l’analyse de systems complexes gĂ©nĂ©riques), nous adaptons une dĂ©finition formelle de celle-ci Ă  un cadre probabiliste dĂ©rivĂ© Ă  nouveau de modĂšles de Markov cachĂ©s. Ce cadre nous permet de modĂ©liser de façon rĂ©aliste l’évolution stochastique et l’observabilitĂ© partielle des phĂ©nomĂšnes du monde rĂ©el. Nous proposons un algorithme permettant le calcul exact efficace de l’étape essentielle d’infĂ©rence laquelle est requise pour vĂ©rifier des propriĂ©tĂ©s gĂ©nĂ©riques. Pour dĂ©montrer la flexibilitĂ© de cette approche, nous la validons, entre autres, dans le contexte d’un systĂšme informatisĂ© reconfigurable pour l’aĂ©rospatiale. Enfin, nous Ă©tendons la portĂ©e de nos recherches vers la robotique et les systĂšmes multi-agents, deux sujets dont la popularitĂ© est croissante en exploration spatiale. Nous abordons le problĂšme de l’évaluation et de l’entretien de la connectivitĂ© dans le context distribuĂ© et auto-adaptatif de la robotique en essaim. Nous examinons les limites des solutions existantes et proposons une nouvelle mĂ©thodologie pour crĂ©er des gĂ©omĂ©tries complexes connectĂ©es gĂ©rant plusieurs tĂąches simultanĂ©ment. Des contributions additionnelles dans plusieurs domaines sont rĂ©sumĂ©s dans les annexes, nommĂ©ment : (i) la conception de CubeSats, (ii) la modĂ©lisation des rayonnements spatiaux pour l’injection d’erreur dans FPGA et (iii) l’analyse temporelle probabiliste pour les systĂšmes en temps rĂ©el. À notre avis, cette recherche constitue un tremplin utile vers la crĂ©ation d’une nouvelle gĂ©nĂ©ration de systĂšmes informatiques qui exĂ©cutent leurs tĂąches d’une façon autonome et fiable, favorisant une exploration spatiale plus simple et moins coĂ»teuse.----------ABSTRACT Today's computer systems are growing more and more complex at a pace that requires the development of novel and more effective methodologies to automate their design. Space, in particular, represents a challenging environment: without protection from ionizing and particle radiation, CMOS-based electronics are subject to transients faults, performance degradation, accelerated wear, and, ultimately, system failure. Traditional approaches adopted to guarantee reliability and extended lifetime are based on redundancy that is established at design-time. These solutions are expensive and sometimes inefficient, as they increase the complexity and size of a system, exposing it to higher risks of overheating and incurring in radiation-induced errors. Moreover, critical systems---e.g., time-constrained ones and those where access is limited---must be able to cope with pivotal situations without relying on human intervention. Hence, the emerging interest in computer systems with adaptive capabilities as the most suitable solution for novel high-performance embedded devices for aerospace. Self-adaptive computing carries unmatched potential and great promises for the creation of a new generation of smart, more reliable computers, and it addresses the challenge of designing and programming modern and future computer systems that must meet conflicting goals. Drawing from the fields of artificial intelligence and reconfigurable systems, we aim at developing self-adaptive computer systems for aerospace. Our goal is to improve their efficiency, fault-tolerance, and computational capabilities. The first step in this research is the experimental analysis of the most popular multi-objective design-space exploration algorithms for high-level design. These algorithms were collected from the recent literature and include heuristic, evolutionary, and statistical methods. Their comparison provides insights that we use to define guidelines for the choice of the most appropriate optimization algorithms, given the features of the design space. For the creation of a self-managing optimization framework---enabling the adaptive trade-off of multiple objectives---we leverage the tools of probabilistic graphical models. We introduce a mechanism based on dynamic hidden Markov models that balances the availability and lifetime of multiprocessor systems. This is achieved by estimating the occurrence of permanent faults amid transient faults, and by dynamically migrating the computation on excess resources, when failure occurs. The dynamic nature of the model makes it adjustable to different mission profiles and fault rates. The results show that we are able to lead systems to extended lifetimes, while keeping their availability close to ideal. On account of the stringent timing constraints imposed by aerospace systems, we then investigate the optimization of fault-tolerance under real-time requirements. We propose a methodology to improve the reliability of computation in the presence of transient errors when considering the mapping of real-time tasks on a homogeneous multiprocessor system with voltage and frequency scaling capabilities. In this framework, we take advantage of probability theory to define a novel trade-off between power consumption and fault-tolerance. As we recognize that resilience is a pervasive property of interest (e.g., for the design and analysis of generic complex systems), we adapt a formal definition of it to one more probabilistic framework derived from hidden Markov models. This allows us to realistically model the stochastic evolution and partial observability of complex real-world environments. Within this framework, we propose an efficient algorithm for the exact computation of the essential inference step required to construct generic property checking. To demonstrate the flexibility of this approach, we validate it in the context, among others, of a self-aware, reconfigurable computing system for aerospace. Finally, we move the scope of our research towards robotics and multi-agent systems: a topic of thriving popularity for space exploration. We tackle the problem of connectivity assessment and maintenance in the distributed and self-adaptive context of swarm robotics. We review the limitations of existing solutions and propose a novel methodology to create connected complex geometries for multiple task coverage. Additional contributions in the areas of (i) CubeSat design, (ii) the modelling of space radiation for FPGA fault-injection, and (iii) probabilistic timing analysis for real-time systems are summarized in the appendices. In the author's opinion, this research provides a number of useful stepping stones for the creation of a new generation of computing systems that autonomously---and reliably---perform their tasks for longer periods of time, fostering simpler and cheaper space exploration

    LIPIcs, Volume 274, ESA 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 274, ESA 2023, Complete Volum
    corecore