159 research outputs found

    High-Level Synthesis for Embedded Systems

    Get PDF

    Improving scalability of large-scale distributed Spiking Neural Network simulations on High Performance Computing systems using novel architecture-aware streaming hypergraph partitioning

    Get PDF
    After theory and experimentation, modelling and simulation is regarded as the third pillar of science, helping scientists to further their understanding of a complex system. In recent years there has been a growing scientific focus on computational neuroscience as a means to understand the brain and its functions, with large international projects (Human Brain Project, Brain Activity Map, MindScope and \textit{China Brain Project}) aiming to further our knowledge of high level cognitive functions. They are a testament to the enormous interest, difficulty and importance of solving the mysteries of the brain. Spiking Neural Network (SNN) simulations are widely used in the domain to facilitate experimentation. Scaling SNN simulations to large networks usually results in more-than-linear increase in computational complexity. The computing resources required at the brain scale simulation far surpass the capabilities of personal computers today. If those demands are to be met, distributed computation models need to be adopted, since there is a slow down of improvements in individual processors speed due to physical limitations on heat dissipation. This is a significant change that requires careful management of the workload in many levels: partition of work, communication and workload balancing, efficient inter-process communication and efficient use of available memory. If large scale neuronal network models are to be run successfully, simulators must consider these, and offer a viable solution to the challenges they pose. Large scale SNN simulations evidence most of the issues of general HPC systems evident in large distributed computation. Commonly used distribution of workload algorithms (round robin, random and manual allocation) do not take into consideration connectivity locality, which is natural in biological networks, which can lead to increased communication requirements when distributing the simulation in multiple computing nodes. State-of-the-art SNN simulations use dense communication collectives to distribute spike data. The common method of point to point communication in distributed computation is through dense patterns. Sparse communication collectives have been suggested to incur in lower overheads when the application's pattern of communication is sparse. In this work we characterise the bottlenecks on communication-bound SNN simulations and identify communication balance and sparsity as the main contributors to scalability. We propose hypergraph partitioning to distribute neurons along computing nodes to minimise communication (increasing sparsity). A hypergraph is a generalisation of graphs, where a (hyper)edge can link 2 or more vertices at once. Coupled with a novel use of sparse-aware communication collective, computational efficiency increases by up to 40.8 percent points and simulation time reduces by up to 73\%, compared to the common round-robin allocation in neuronal simulators. HPC systems have, by design, highly hierarchical communication network links, with qualitative differences in communication speed and latency between computing nodes. This can create a mismatch between the distributed simulation communication patterns and the physical capabilities of the hardware. If large distributed simulations are to take full advantage of these systems, the communication properties of the HPC need to be taken into consideration when allocating workload to route frequent, heavy communication through fast network links. Strategies that consider the heterogeneous physical communication capabilities are called architecture-aware. After demonstrating that hypergraph partitioning leads to more efficient workload allocation in SNN simulations, this thesis proposes a novel sequential hypergraph partitioning algorithm that incorporates network bandwidth via profiling. This leads to a significant reduction in execution time (up to 14x speedup in synthetic benchmark simulations compared to architecture-agnostic partitioners). The motivating context of this work is large scale brain simulations, however in the era of social media, large graphs and hypergraphs are increasingly relevant in many other scientific applications. A common feature of such graphs is that they are too big for a single machine to cope, both in terms of performance and memory requirements. State-of-the-art multilevel partitioning has been shown to struggle to scale to large graphs in distributed memory, not just because they take a long time to process, but also because they require full knowledge of the graph (not possible in dynamic graphs) and to fit the graph entirely in memory (not possible for very large graphs). To address those limitations we propose a parallel implementation of our architecture-aware streaming hypergraph partitioning algorithm (HyperPRAW) to model distributed applications. Results demonstrate that HyperPRAW produces consistent speedup over previous streaming approaches that only consider hyperedge overlap (up to 5.2x speedup). Compared to multilevel global partitioner in dense hypergraphs (those with high average cardinality), HyperPRAW is able to produce workload allocations that result in speeding up runtime in a synthetic simulation benchmark (up to 4.3x). HyperPRAW has the potential to scale to very large hypergraphs as it only requires local information to make allocation decisions, with an order of magnitude less memory footprint than global partitioners. The combined contributions of this thesis lead to a novel, parallel, scalable, streaming hypergraph partitioning algorithm (HyperPRAW) that can be used to help scale large distributed simulations in HPC systems. HyperPRAW helps tackle three of the main scalability challenges: it produces highly balanced distributed computation and communication, minimising idle time between computing nodes; it reduces the communication overhead by placing frequently communicating simulation elements close to each other (where the communication cost is minimal); and it provides a solution with a reasonable memory footprint that allows tackling larger problems than state-of-the-art alternatives such as global multilevel partitioning

    Many-core and heterogeneous architectures: programming models and compilation toolchains

    Get PDF
    1noL'abstract è presente nell'allegato / the abstract is in the attachmentopen677. INGEGNERIA INFORMATInopartially_openembargoed_20211002Barchi, Francesc

    Hardware-software codesign in a high-level synthesis environment

    Get PDF
    Interfacing hardware-oriented high-level synthesis to software development is a computationally hard problem for which no general solution exists. Under special conditions, the hardware-software codesign (system-level synthesis) problem may be analyzed with traditional tools and efficient heuristics. This dissertation introduces a new alternative to the currently used heuristic methods. The new approach combines the results of top-down hardware development with existing basic hardware units (bottom-up libraries) and compiler generation tools. The optimization goal is to maximize operating frequency or minimize cost with reasonable tradeoffs in other properties. The dissertation research provides a unified approach to hardware-software codesign. The improvements over previously existing design methodologies are presented in the frame-work of an academic CAD environment (PIPE). This CAD environment implements a sufficient subset of functions of commercial microelectronics CAD packages. The results may be generalized for other general-purpose algorithms or environments. Reference benchmarks are used to validate the new approach. Most of the well-known benchmarks are based on discrete-time numerical simulations, digital filtering applications, and cryptography (an emerging field in benchmarking). As there is a need for high-performance applications, an additional requirement for this dissertation is to investigate pipelined hardware-software systems\u27 performance and design methods. The results demonstrate that the quality of existing heuristics does not change in the enhanced, hardware-software environment

    Proceedings of the 5th International Workshop on Reconfigurable Communication-centric Systems on Chip 2010 - ReCoSoC\u2710 - May 17-19, 2010 Karlsruhe, Germany. (KIT Scientific Reports ; 7551)

    Get PDF
    ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered expertise as well as state of the art research around SoC related topics through plenary invited papers and posters. The workshop aims to provide a prospective view of tomorrow\u27s challenges in the multibillion transistor era, taking into account the emerging techniques and architectures exploring the synergy between flexible on-chip communication and system reconfigurability

    Parallel VLSI Circuit Analysis and Optimization

    Get PDF
    The prevalence of multi-core processors in recent years has introduced new opportunities and challenges to Electronic Design Automation (EDA) research and development. In this dissertation, a few parallel Very Large Scale Integration (VLSI) circuit analysis and optimization methods which utilize the multi-core computing platform to tackle some of the most difficult contemporary Computer-Aided Design (CAD) problems are presented. The first CAD application that is addressed in this dissertation is analyzing and optimizing mesh-based clock distribution network. Mesh-based clock distribution network (also known as clock mesh) is used in high-performance microprocessor designs as a reliable way of distributing clock signals to the entire chip. The second CAD application addressed in this dissertation is the Simulation Program with Integrated Circuit Emphasis (SPICE) like circuit simulation. SPICE simulation is often regarded as the bottleneck of the design flow. Recently, parallel circuit simulation has attracted a lot of attention. The first part of the dissertation discusses circuit analysis techniques. First, a combination of clock network specific model order reduction algorithm and a port sliding scheme is presented to tackle the challenges in analyzing large clock meshes with a large number of clock drivers. Our techniques run much faster than the standard SPICE simulation and existing model order reduction techniques. They also provide a basis for the clock mesh optimization. Then, a hierarchical multi-algorithm parallel circuit simulation (HMAPS) framework is presented as an novel technique of parallel circuit simulation. The inter-algorithm parallelism approach in HMAPS is completely different from the existing intra-algorithm parallel circuit simulation techniques and achieves superlinear speedup in practice. The second part of the dissertation talks about parallel circuit optimization. A modified asynchronous parallel pattern search (APPS) based method which utilizes the efficient clock mesh simulation techniques for the clock driver size optimization problem is presented. Our modified APPS method runs much faster than a continuous optimization method and effectively reduces the clock skew for all test circuits. The third part of the dissertation describes parallel performance modeling and optimization of the HMAPS framework. The performance models and runtime optimization scheme improve the speed of HMAPS further more. The dynamically adapted HMAPS becomes a complete solution for parallel circuit simulation

    An Efficient Execution Model for Reactive Stream Programs

    Get PDF
    Stream programming is a paradigm where a program is structured by a set of computational nodes connected by streams. Focusing on data moving between computational nodes via streams, this programming model fits well for applications that process long sequences of data. We call such applications reactive stream programs (RSPs) to distinguish them from stream programs with rather small and finite input data. In stream programming, concurrency is expressed implicitly via communication streams. This helps to reduce the complexity of parallel programming. For this reason, stream programming has gained popularity as a programming model for parallel platforms. However, it is also challenging to analyse and improve the performance without an understanding of the program's internal behaviour. This thesis targets an effi cient execution model for deploying RSPs on parallel platforms. This execution model includes a monitoring framework to understand the internal behaviour of RSPs, scheduling strategies for RSPs on uniform shared-memory platforms; and mapping techniques for deploying RSPs on heterogeneous distributed platforms. The foundation of the execution model is based on a study of the performance of RSPs in terms of throughput and latency. This study includes quantitative formulae for throughput and latency; and the identification of factors that influence these performance metrics. Based on the study of RSP performance, this thesis exploits characteristics of RSPs to derive effective scheduling strategies on uniform shared-memory platforms. Aiming to optimise both throughput and latency, these scheduling strategies are implemented in two heuristic-based schedulers. Both of them are designed to be centralised to provide load balancing for RSPs with dynamic behaviour as well as dynamic structures. The first one uses the notion of positive and negative data demands on each stream to determine the scheduling priorities. This scheduler is independent from the runtime system. The second one requires the runtime system to provide the position information for each computational node in the RSP; and uses that to decide the scheduling priorities. Our experiments show that both schedulers provides similar performance while being significantly better than a reference implementation without dynamic load balancing. Also based on the study of RSP performance, we present in this thesis two new heuristic partitioning algorithms which are used to map RSPs onto heterogeneous distributed platforms. These are Kernighan-Lin Adaptation (KLA) and Congestion Avoidance (CA), where the main objective is to optimise the throughput. This is a multi-parameter optimisation problem where existing graph partitioning algorithms are not applicable. Compared to the generic meta-heuristic Simulated Annealing algorithm, both proposed algorithms achieve equally good or better results. KLA is faster for small benchmarks while slower for large ones. In contrast, CA is always orders of magnitudes faster even for very large benchmarks

    Asynchronous Teams and Tasks in a Message Passing Environment

    Get PDF
    As the discipline of scientific computing grows, so too does the "skills gap" between the increasingly complex scientific applications and the efficient algorithms required. Increasing demand for computational power on the march towards exascale requires innovative approaches. Closing the skills gap avoids the many pitfalls that lead to poor utilisation of resources and wasted investment. This thesis tackles two challenges: asynchronous algorithms for parallel computing and fault tolerance. First I present a novel asynchronous task invocation methodology for Discontinuous Galerkin codes called enclave tasking. The approach modifies the parallel ordering of tasks that allows for efficient scaling on dynamic meshes up to 756 cores. It ensures high levels of concurrency and intermixes tasks of different computational properties. Critical tasks along domain boundaries are prioritised for an overlap of computation and communication. The second contribution is the teaMPI library, forming teams of MPI processes exchanging consistency data through an asynchronous "heartbeat". In contrast to previous approaches, teaMPI operates fully asynchronously with reduced overhead. It is also capable of detecting individually slow or failing ranks and inconsistent data among replicas. Finally I provide an outlook into how asynchronous teams using enclave tasking can be combined into an advanced team-based diffusive load balancing scheme. Both concepts are integrated into and contribute towards the ExaHyPE project, a next generation code that solves hyperbolic equation systems on dynamically adaptive cartesian grids

    The hArtes Tool Chain

    Get PDF
    This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform

    Cross-Layer Rapid Prototyping and Synthesis of Application-Specific and Reconfigurable Many-accelerator Platforms

    Get PDF
    Technological advances of recent years laid the foundation consolidation of informatisationof society, impacting on economic, political, cultural and socialdimensions. At the peak of this realization, today, more and more everydaydevices are connected to the web, giving the term ”Internet of Things”. The futureholds the full connection and interaction of IT and communications systemsto the natural world, delimiting the transition to natural cyber systems and offeringmeta-services in the physical world, such as personalized medical care, autonomoustransportation, smart energy cities etc. . Outlining the necessities of this dynamicallyevolving market, computer engineers are required to implement computingplatforms that incorporate both increased systemic complexity and also cover awide range of meta-characteristics, such as the cost and design time, reliabilityand reuse, which are prescribed by a conflicting set of functional, technical andconstruction constraints. This thesis aims to address these design challenges bydeveloping methodologies and hardware/software co-design tools that enable therapid implementation and efficient synthesis of architectural solutions, which specifyoperating meta-features required by the modern market. Specifically, this thesispresents a) methodologies to accelerate the design flow for both reconfigurableand application-specific architectures, b) coarse-grain heterogeneous architecturaltemplates for processing and communication acceleration and c) efficient multiobjectivesynthesis techniques both at high abstraction level of programming andphysical silicon level.Regarding to the acceleration of the design flow, the proposed methodologyemploys virtual platforms in order to hide architectural details and drastically reducesimulation time. An extension of this framework introduces the systemicco-simulation using reconfigurable acceleration platforms as co-emulation intermediateplatforms. Thus, the development cycle of a hardware/software productis accelerated by moving from a vertical serial flow to a circular interactive loop.Moreover the simulation capabilities are enriched with efficient detection and correctiontechniques of design errors, as well as control methods of performancemetrics of the system according to the desired specifications, during all phasesof the system development. In orthogonal correlation with the aforementionedmethodological framework, a new architectural template is proposed, aiming atbridging the gap between design complexity and technological productivity usingspecialized hardware accelerators in heterogeneous systems-on-chip and networkon-chip platforms. It is presented a novel co-design methodology for the hardwareaccelerators and their respective programming software, including the tasks allocationto the available resources of the system/network. The introduced frameworkprovides implementation techniques for the accelerators, using either conventionalprogramming flows with hardware description language or abstract programmingmodel flows, using techniques from high-level synthesis. In any case, it is providedthe option of systemic measures optimization, such as the processing speed,the throughput, the reliability, the power consumption and the design silicon area.Finally, on addressing the increased complexity in design tools of reconfigurablesystems, there are proposed novel multi-objective optimization evolutionary algo-rithms which exploit the modern multicore processors and the coarse-grain natureof multithreaded programming environments (e.g. OpenMP) in order to reduce theplacement time, while by simultaneously grouping the applications based on theirintrinsic characteristics, the effectively explore the design space effectively.The efficiency of the proposed architectural templates, design tools and methodologyflows is evaluated in relation to the existing edge solutions with applicationsfrom typical computing domains, such as digital signal processing, multimedia andarithmetic complexity, as well as from systemic heterogeneous environments, suchas a computer vision system for autonomous robotic space navigation and manyacceleratorsystems for HPC and workstations/datacenters. The results strengthenthe belief of the author, that this thesis provides competitive expertise to addresscomplex modern - and projected future - design challenges.Οι τεχνολογικές εξελίξεις των τελευταίων ετών έθεσαν τα θεμέλια εδραίωσης της πληροφοριοποίησης της κοινωνίας, επιδρώντας σε οικονομικές,πολιτικές, πολιτιστικές και κοινωνικές διαστάσεις. Στο απόγειο αυτής τη ςπραγμάτωσης, σήμερα, ολοένα και περισσότερες καθημερινές συσκευές συνδέονται στο παγκόσμιο ιστό, αποδίδοντας τον όρο «Ίντερνετ των πραγμάτων».Το μέλλον επιφυλάσσει την πλήρη σύνδεση και αλληλεπίδραση των συστημάτων πληροφορικής και επικοινωνιών με τον φυσικό κόσμο, οριοθετώντας τη μετάβαση στα συστήματα φυσικού κυβερνοχώρου και προσφέροντας μεταυπηρεσίες στον φυσικό κόσμο όπως προσωποποιημένη ιατρική περίθαλψη, αυτόνομες μετακινήσεις, έξυπνες ενεργειακά πόλεις κ.α. . Σκιαγραφώντας τις ανάγκες αυτής της δυναμικά εξελισσόμενης αγοράς, οι μηχανικοί υπολογιστών καλούνται να υλοποιήσουν υπολογιστικές πλατφόρμες που αφενός ενσωματώνουν αυξημένη συστημική πολυπλοκότητα και αφετέρου καλύπτουν ένα ευρύ φάσμα μεταχαρακτηριστικών, όπως λ.χ. το κόστος σχεδιασμού, ο χρόνος σχεδιασμού, η αξιοπιστία και η επαναχρησιμοποίηση, τα οποία προδιαγράφονται από ένα αντικρουόμενο σύνολο λειτουργικών, τεχνολογικών και κατασκευαστικών περιορισμών. Η παρούσα διατριβή στοχεύει στην αντιμετώπιση των παραπάνω σχεδιαστικών προκλήσεων, μέσω της ανάπτυξης μεθοδολογιών και εργαλείων συνσχεδίασης υλικού/λογισμικού που επιτρέπουν την ταχεία υλοποίηση καθώς και την αποδοτική σύνθεση αρχιτεκτονικών λύσεων, οι οποίες προδιαγράφουν τα μετα-χαρακτηριστικά λειτουργίας που απαιτεί η σύγχρονη αγορά. Συγκεκριμένα, στα πλαίσια αυτής της διατριβής, παρουσιάζονται α) μεθοδολογίες επιτάχυνσης της ροής σχεδιασμού τόσο για επαναδιαμορφούμενες όσο και για εξειδικευμένες αρχιτεκτονικές, β) ετερογενή αδρομερή αρχιτεκτονικά πρότυπα επιτάχυνσης επεξεργασίας και επικοινωνίας και γ) αποδοτικές τεχνικές πολυκριτηριακής σύνθεσης τόσο σε υψηλό αφαιρετικό επίπεδο προγραμματισμού,όσο και σε φυσικό επίπεδο πυριτίου.Αναφορικά προς την επιτάχυνση της ροής σχεδιασμού, προτείνεται μια μεθοδολογία που χρησιμοποιεί εικονικές πλατφόρμες, οι οποίες αφαιρώντας τις αρχιτεκτονικές λεπτομέρειες καταφέρνουν να μειώσουν σημαντικά το χρόνο εξομοίωσης. Παράλληλα, εισηγείται η συστημική συν-εξομοίωση με τη χρήση επαναδιαμορφούμενων πλατφορμών, ως μέσων επιτάχυνσης. Με αυτόν τον τρόπο, ο κύκλος ανάπτυξης ενός προϊόντος υλικού, μετατεθειμένος από την κάθετη σειριακή ροή σε έναν κυκλικό αλληλεπιδραστικό βρόγχο, καθίσταται ταχύτερος, ενώ οι δυνατότητες προσομοίωσης εμπλουτίζονται με αποδοτικότερες μεθόδους εντοπισμού και διόρθωσης σχεδιαστικών σφαλμάτων, καθώς και μεθόδους ελέγχου των μετρικών απόδοσης του συστήματος σε σχέση με τις επιθυμητές προδιαγραφές, σε όλες τις φάσεις ανάπτυξης του συστήματος. Σε ορθογώνια συνάφεια με το προαναφερθέν μεθοδολογικό πλαίσιο, προτείνονται νέα αρχιτεκτονικά πρότυπα που στοχεύουν στη γεφύρωση του χάσματος μεταξύ της σχεδιαστικής πολυπλοκότητας και της τεχνολογικής παραγωγικότητας, με τη χρήση συστημάτων εξειδικευμένων επιταχυντών υλικού σε ετερογενή συστήματα-σε-ψηφίδα καθώς και δίκτυα-σε-ψηφίδα. Παρουσιάζεται κατάλληλη μεθοδολογία συν-σχεδίασης των επιταχυντών υλικού και του λογισμικού προκειμένου να αποφασισθεί η κατανομή των εργασιών στους διαθέσιμους πόρους του συστήματος/δικτύου. Το μεθοδολογικό πλαίσιο προβλέπει την υλοποίηση των επιταχυντών είτε με συμβατικές μεθόδους προγραμματισμού σε γλώσσα περιγραφής υλικού είτε με αφαιρετικό προγραμματιστικό μοντέλο με τη χρήση τεχνικών υψηλού επιπέδου σύνθεσης. Σε κάθε περίπτωση, δίδεται η δυνατότητα στο σχεδιαστή για βελτιστοποίηση συστημικών μετρικών, όπως η ταχύτητα επεξεργασίας, η ρυθμαπόδοση, η αξιοπιστία, η κατανάλωση ενέργειας και η επιφάνεια πυριτίου του σχεδιασμού. Τέλος, προκειμένου να αντιμετωπισθεί η αυξημένη πολυπλοκότητα στα σχεδιαστικά εργαλεία επαναδιαμορφούμενων συστημάτων, προτείνονται νέοι εξελικτικοί αλγόριθμοι πολυκριτηριακής βελτιστοποίησης, οι οποίοι εκμεταλλευόμενοι τους σύγχρονους πολυπύρηνους επεξεργαστές και την αδρομερή φύση των πολυνηματικών περιβαλλόντων προγραμματισμού (π.χ. OpenMP), μειώνουν το χρόνο επίλυσης του προβλήματος της τοποθέτησης των λογικών πόρων σε φυσικούς,ενώ ταυτόχρονα, ομαδοποιώντας τις εφαρμογές βάση των εγγενών χαρακτηριστικών τους, διερευνούν αποτελεσματικότερα το χώρο σχεδίασης.Η αποδοτικότητά των προτεινόμενων αρχιτεκτονικών προτύπων και μεθοδολογιών επαληθεύτηκε σε σχέση με τις υφιστάμενες λύσεις αιχμής τόσο σε αυτοτελής εφαρμογές, όπως η ψηφιακή επεξεργασία σήματος, τα πολυμέσα και τα προβλήματα αριθμητικής πολυπλοκότητας, καθώς και σε συστημικά ετερογενή περιβάλλοντα, όπως ένα σύστημα όρασης υπολογιστών για αυτόνομα διαστημικά ρομποτικά οχήματα και ένα σύστημα πολλαπλών επιταχυντών υλικού για σταθμούς εργασίας και κέντρα δεδομένων, στοχεύοντας εφαρμογές υψηλής υπολογιστικής απόδοσης (HPC). Τα αποτελέσματα ενισχύουν την πεποίθηση του γράφοντα, ότι η παρούσα διατριβή παρέχει ανταγωνιστική τεχνογνωσία για την αντιμετώπιση των πολύπλοκων σύγχρονων και προβλεπόμενα μελλοντικών σχεδιαστικών προκλήσεων
    corecore