313 research outputs found
ACiS: smart switches with application-level acceleration
Network performance has contributed fundamentally to the growth of supercomputing over the past decades. In parallel, High Performance Computing (HPC) peak performance has depended, first, on ever faster/denser CPUs, and then, just on increasing density alone. As operating frequency, and now feature size, have levelled off, two new approaches are becoming central to achieving higher net performance: configurability and integration. Configurability enables hardware to map to the application, as well as vice versa. Integration enables system components that have generally been single function-e.g., a network to transport data—to have additional functionality, e.g., also to operate on that data. More generally, integration enables compute-everywhere: not just in CPU and accelerator, but also in network and, more specifically, the communication switches.
In this thesis, we propose four novel methods of enhancing HPC performance through Advanced Computing in the Switch (ACiS). More specifically, we propose various flexible and application-aware accelerators that can be embedded into or attached to existing communication switches to improve the performance and scalability of HPC and Machine Learning (ML) applications. We follow a modular design discipline through introducing composable plugins to successively add ACiS capabilities.
In the first work, we propose an inline accelerator to communication switches for user-definable collective operations. MPI collective operations can often be performance killers in HPC applications; we seek to solve this bottleneck by offloading them to reconfigurable hardware within the switch itself. We also introduce a novel mechanism that enables the hardware to support MPI communicators of arbitrary shape and that is scalable to very large systems.
In the second work, we propose a look-aside accelerator for communication switches that is capable of processing packets at line-rate. Functions requiring loops and states are addressed in this method. The proposed in-switch accelerator is based on a RISC-V compatible Coarse Grained Reconfigurable Arrays (CGRAs).
To facilitate usability, we have developed a framework to compile user-provided C/C++ codes to appropriate back-end instructions for configuring the accelerator.
In the third work, we extend ACiS to support fused collectives and the combining of collectives with map operations. We observe that there is an opportunity of fusing communication (collectives) with computation. Since the computation can vary for different applications, ACiS support should be programmable in this method.
In the fourth work, we propose that switches with ACiS support can control and manage the execution of applications, i.e., that the switch be an active device with decision-making capabilities. Switches have a central view of the network; they can collect telemetry information and monitor application behavior and then use this information for control, decision-making, and coordination of nodes.
We evaluate the feasibility of ACiS through extensive RTL-based simulation as well as deployment in an open-access cloud infrastructure. Using this simulation framework, when considering a Graph Convolutional Network (GCN) application as a case study, a speedup of on average 3.4x across five real-world datasets is achieved on 24 nodes compared to a CPU cluster without ACiS capabilities
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
GAMMA: Eine Software für den automatischen Aufbau von Makromolekülen
Die Vielfalt möglicher makromolekularer Strukturen gepaart mit den explorativen Fähigkeiten computergestützter Generierung chemischer (Makro-)Molekülstrukturen bietet großes Forschungspotential. Die Eigenschaften solcher Strukturen, darunter häufig Hohlräume, in die kleine Moleküle passen, führen z. B. zu Anwendungen in Wirt-Gast-Strukturen. Das Hauptthema der vorliegenden Arbeit war die Entwicklung eines neuen Computerprogrammes, das, ausgehend von (kleinen) molekularen Bausteinen, makromolekulare Strukturen erstellt. Nach Vorgabe eines Templatgraphen werden molekulare Bausteine so miteinander verknüpft, dass ein Makromolekül entsteht. Dieses kann anschließend um polare funktionelle Gruppen erweitert werden, die das elektrische Feld des Makromoleküls gezielt verändern, um beispielsweise eine katalytische Wirkung zu entfalten
Massive Data-Centric Parallelism in the Chiplet Era
Traditionally, massively parallel applications are executed on distributed
systems, where computing nodes are distant enough that the parallelization
schemes must minimize communication and synchronization to achieve scalability.
Mapping communication-intensive workloads to distributed systems requires
complicated problem partitioning and dataset pre-processing. With the current
AI-driven trend of having thousands of interconnected processors per chip,
there is an opportunity to re-think these communication-bottlenecked workloads.
This bottleneck often arises from data structure traversals, which cause
irregular memory accesses and poor cache locality.
Recent works have introduced task-based parallelization schemes to accelerate
graph traversal and other sparse workloads. Data structure traversals are split
into tasks and pipelined across processing units (PUs). Dalorex demonstrated
the highest scalability (up to thousands of PUs on a single chip) by having the
entire dataset on-chip, scattered across PUs, and executing the tasks at the PU
where the data is local. However, it also raised questions on how to scale to
larger datasets when all the memory is on chip, and at what cost.
To address these challenges, we propose a scalable architecture composed of a
grid of Data-Centric Reconfigurable Array (DCRA) chiplets. Package-time
reconfiguration enables creating chip products that optimize for different
target metrics, such as time-to-solution, energy, or cost, while software
reconfigurations avoid network saturation when scaling to millions of PUs
across many chip packages. We evaluate six applications and four datasets, with
several configurations and memory technologies, to provide a detailed analysis
of the performance, power, and cost of data-local execution at scale. Our
parallelization of Breadth-First-Search with RMAT-26 across a million PUs
reaches 3323 GTEPS
Recommended from our members
High Performance Silicon Photonic Interconnected Systems
Advances in data-driven applications, particularly artificial intelligence and deep learning, are driving the explosive growth of computation and communication in today’s data centers and high-performance computing (HPC) systems. Increasingly, system performance is not constrained by the compute speed at individual nodes, but by the data movement between them. This calls for innovative architectures, smart connectivity, and extreme bandwidth densities in interconnect designs. Silicon photonics technology leverages mature complementary metal-oxide-semiconductor (CMOS) manufacturing infrastructure and is promising for low cost, high-bandwidth, and reconfigurable interconnects. Flexible and high-performance photonic switched architectures are capable of improving the system performance. The work in this dissertation explores various photonic interconnected systems and the associated optical switching functionalities, hardware platforms, and novel architectures. It demonstrates the capabilities of silicon photonics to enable efficient deep learning training.
We first present field programmable gate array (FPGA) based open-loop and closed-loop control for optical spectral-and-spatial switching of silicon photonic cascaded micro-ring resonator (MRR) switches. Our control achieves wavelength locking at the user-defined resonance of the MRR for optical unicast, multicast, and multiwavelength-select functionalities. Digital-to-analog converters (DACs) and analog-to-digital converters (ADCs) are necessary for the control of the switch. We experimentally demonstrate the optical switching functionalities using an FPGA-based switch controller through both traditional multi-bit DAC/ADC and novel single-wired DAC/ADC circuits. For system-level integration, interfaces to the switch controller in a network control plane are developed. The successful control and the switching functionalitiesachieved are essential for system-level architectural innovations as presented in the following sections.
Next, this thesis presents two novel photonic switched architectures using the MRR-based switches. First, a photonic switched memory system architecture was designed to address memory challenges in deep learning. The reconfigurable photonic interconnects provide scalable solutions and enable efficient use of disaggregated memory resources for deep learning training. An experimental testbed was built with a processing system and two remote memory nodes using silicon photonic switch fabrics and system performance improvements were demonstrated. The collective results and existing high-bandwidth optical I/Os show the potential of integrating the photonic switched memory to state-of-the-art processing systems. Second, the scaling trends of deep learning models and distributed training workloads are challenging network capacities in today’s data centers and HPCs. A system architecture that leverages SiP switch-enabled server regrouping is proposed to tackle the challenges and accelerate distributed deep learning training. An experimental testbed with a SiP switch-enabled reconfigurable fat tree topology was built to evaluate the network performance of distributed ring all-reduce and parameter server workloads. We also present system-scale simulations. Server regrouping and bandwidth steering were performed on a large-scale tapered fat tree with 1024 compute nodes to show the benefits of using photonic switched architectures in systems at scale.
Finally, this dissertation explores high-bandwidth photonic interconnect designs for disaggregated systems. We first introduce and discuss two disaggregated architectures leveraging extreme high bandwidth interconnects with optically interconnected computing resources. We present the concept of rack-scale graphics processing unit (GPU) disaggregation with optical circuit switches and electrical aggregator switches. The architecture can leverage the flexibility of high bandwidth optical switches to increase hardware utilization and reduce application runtimes. A testbed was built to demonstrate resource disaggregation and defragmentation. In addition, we also present an extreme high-bandwidth optical interconnect accelerated low-latency communication architecture for deep learning training. The disaggregated architecture utilizes comb laser sources and MRR-based cross-bar switching fabrics to enable an all-to-all high bandwidth communication with a constant latency cost for distributed deep learning training. We discuss emerging technologies in the silicon photonics platform, including light source, transceivers, and switch architectures, to accommodate extreme high bandwidth requirements in HPC and data center environments. A prototype hardware innovation - Optical Network Interface Cards (comprised of FPGA, photonic integrated circuits (PIC), electronic integrated circuits (EIC), interposer, and high-speed printed circuit board (PCB)) is presented to show the path toward fast lanes for expedited execution at 10 terabits.
Taken together, the work in this dissertation demonstrates the capabilities of high-bandwidth silicon photonic interconnects and innovative architectural designs to accelerate deep learning training in optically connected data center and HPC systems
A hierarchical parallel implementation model for algebra-based CFD simulations on hybrid supercomputers
(English) Continuous enhancement in hardware technologies enables scientific computing to advance incessantly and reach further aims. Since the start of the global race for exascale high-performance computing (HPC), massively-parallel devices of various architectures have been incorporated into the newest supercomputers, leading to an increasing hybridization of HPC systems. In this context of accelerated innovation, software portability and efficiency become crucial.
Traditionally, scientific computing software development is based on calculations in iterative stencil loops (ISL) over a discretized geometry—the mesh. Despite being intuitive and versatile, the interdependency between algorithms and their computational implementations in stencil applications usually results in a large number of subroutines and introduces an inevitable complexity when it comes to portability and sustainability. An alternative is to break the interdependency between algorithm and implementation to cast the calculations into a minimalist set of kernels.
The portable implementation model that is the object of this thesis is not restricted to a particular numerical method or problem. However, owing to the CTTC's long tradition in computational fluid dynamics (CFD) and without loss of generality, this work is targeted to solve transient CFD simulations. By casting discrete operators and mesh functions into (sparse) matrices and vectors, it is shown that all the calculations in a typical CFD algorithm boil down to the following basic linear algebra subroutines: the sparse matrix-vector product, the linear combination of vectors, and the dot product.
The proposed formulation eases the deployment of scientific computing software in massively parallel hybrid computing systems and is demonstrated in the large-scale, direct numerical simulation of transient turbulent flows.(Català) La millora contínua en tecnologies de la informàtica possibilita a la comunitat de computació científica avançar incessantment i assolir ulteriors objectius. Des de l'inici de la cursa global per a la computació d'alt rendiment (HPC) d'exa-escala, s'han incorporat dispositius massivament paral·lels de diverses arquitectures als supercomputadors més nous, donant lloc a una creixent hibridació dels sistemes HPC. En aquest context d'innovació accelerada, la portabilitat i l'eficiència del programari esdevenen crucials. Tradicionalment, el desenvolupament de programari informàtic científic es basa en càlculs en bucles de patrons iteratius (ISL) sobre una geometria discretitzada: la malla. Tot i ser intuïtiva i versàtil, la interdependència entre algorismes i les seves implementacions computacionals en aplicacions de patrons sol donar lloc a un gran nombre de subrutines i introdueix una complexitat inevitable quan es tracta de portabilitat i sostenibilitat. Una alternativa és trencar la interdependència entre l'algorisme i la implementació per reduir els càlculs a un conjunt minimalista de subrutines. El model d'implementació portable objecte d'aquesta tesi no es limita a un mètode o problema numèric concret. No obstant això, i a causa de la llarga tradició del CTTC en dinàmica de fluids computacional (CFD) i sense pèrdua de generalitat, aquest treball està dirigit a resoldre simulacions CFD transitòries. Mitjançant la conversió d'operadors discrets i funcions de malla en matrius (disperses) i vectors, es demostra que tots els càlculs d'un algorisme CFD típic es redueixen a les següents subrutines bàsiques d'àlgebra lineal: el producte dispers matriu-vector, la combinació lineal de vectors, i el producte escalar. La formulació proposada facilita el desplegament de programari de computació científica en sistemes informàtics híbrids massivament paral·lels i es demostra el seu rendiment en la simulació numèrica directa de gran escala de fluxos turbulents transitoris.Enginyeria tèrmic
Real-time analysis of MPI programs for NoC-based many-cores using time division multiplexing
Worst-case execution time (WCET) analysis is crucial for designing hard real-time systems. While the WCET of tasks in a single core system can be upper bounded in isolation, the tasks in a many-core system are subject to shared memory interferences which impose high overestimation of the WCET bounds. However, many-core-based massively parallel applications will enter the area of real-time systems in the years ahead. Explicit message-passing and a clear separation of computation and communication facilitates WCET analysis for those programs.
A standard programming model for message-based communication is the message passing interface (MPI). It provides an application independent interface for different standard communication operations (e.g. broadcast, gather, ...). Thereby, it uses efficient communication patterns with deterministic behaviour. In applying these known structures, we target to provide a WCET analysis for communication that is reusable for different applications if the communication is executed on the same underlying platform. Hence, the analysis must be performed once per hardware platform and can be reused afterwards with only adapting several parameters such as the number of nodes participating in that communication. Typically, the processing elements of many-core platforms are connected via a Network-on-Chip (NoC) and apply techniques such as time-division multiplexing (TDM) to provide guaranteed services for the network. Hence, the hardware and the applied technique for guaranteed service needs to facilitate this reusability of the analysis as well.
In this work we review different general-purpose TDM schedules that enable a WCET approximation independent of the placement of tasks on processing elements of a many-core which uses a NoC with torus topology. Furthermore, we provide two new schedules that show a similar performance as the state-of-the-art schedules but additionally serve situations where the presented state-of-the-art schedules perform poorly. Based on these schedules a procedure for the WCET analysis of the communication patterns used in MPI is proposed. Finally, we show how to apply the results of the analysis to calculate the WCET upper bound for a complete MPI program.
Detailed insights in the performance of the applied TDM schedules are provided by comparing the schedules to each other in terms of timing. Additionally, we discuss the exhibited timing of the general-purpose schedules compared to a state-of-the-art application specific TDM schedule to put in relation both types of schedules. We apply the proposed procedure to several standard types of communication provided in MPI and compare different patterns that are used to implement a specific communication. Our evaluation investigates the communications’ building blocks of the timing bounds and shows the tremendous impact of choosing the appropriate communication pattern. Finally, a case study demonstrates the application of the presented procedure to a complete MPI program.
With the method proposed in this work it is possible to perform a reusable WCET timing analysis for the communication in a NoC that is independent of the placement of tasks on the chip. Moreover, as the applied schedules are not optimized for a specific application but can be used for all applications in the same way, there are only marginal changes in the timing of the communication when the software is adapted or updated. Thus, there is no need to perform the timing analysis from scratch in such cases
Discontinuous Galerkin Spectral Element Methods for Astrophysical Flows in Multi-physics Applications
In engineering applications, discontinuous Galerkin methods (DG) have been proven to be a powerful and flexible class of high order methods for problems in computational fluid dynamics. However, the potential benefits of DG for applications in astrophysical contexts is still relatively unexplored in its entirety. To this day, a decent number of studies surveying DG for astrophysical flows have been conducted. But the adoption of DG by the astrophysics community is just beginning to gain traction and integration of DG into established, multi-physics simulation frameworks for comprehensive astrophysical modeling is still lacking. It is our firm believe, that the full potential of novel approaches for numerically solving the fluid equations only shows under the pressure of real-world simulations with all aspects of multi-physics, challenging flow configurations, resolution and runtime constraints, and efficiency metrics on high-performance systems involved. Thus, we see the pressing need to propel DG from the well-trodden path of cataloguing test results under "optimal" laboratory conditions towards the harsh and unforgiving environment of large-scale astrophysics simulations. Consequently, the core of this work is the development and deployment of a robust DG scheme solving the ideal magneto-hydrodynamics equations with multiple species on three-dimensional Cartesian grids with adaptive mesh refinement. We chose to implement DG within the venerable simulation framework FLASH, with a specific focus on multi-physics problems in astrophysics. This entails modifications of the vanilla DG scheme to make it fit seamlessly within FLASH in such a way that all other physics modules can be naturally coupled without additional implementation overhead. A key ingredient is that our DG scheme uses mean value data organized into blocks - the central data structure in FLASH. Having the opportunity to work on mean values, allows us to rely on a rock-solid, monotone Finite Volume (FV) scheme as "backup" whenever the high order DG method fails in cases when the flow gets too harsh. Finding ways to combine the two schemes in a fail-safe manner without loosing primary conservation while still maintaining high order accuracy for smooth, well-resolved flows involves a series of careful considerations, which we document in this thesis. The result of our work is a novel shock capturing scheme - a hybrid between FV and DG - with smooth transitions between low and high order fluxes according to solution smoothness estimators. We present extensive validations and test cases, specifically its interaction with multi-physics modules in FLASH such as (self-)gravity and radiative transfer. We also investigate the benefits and pitfalls of integrating end-to-end entropy stability into our numerical scheme, with special focus on highly compressible turbulent flows and shocks. Our implementation of DG in FLASH allows us to conduct preliminary yet comprehensive astrophysics simulations proving that our new solver is ready for assessments and investigations by the astrophysics community
Collected Papers (on various scientific topics), Volume XIII
This thirteenth volume of Collected Papers is an eclectic tome of 88 papers in various fields of sciences, such as astronomy, biology, calculus, economics, education and administration, game theory, geometry, graph theory, information fusion, decision making, instantaneous physics, quantum physics, neutrosophic logic and set, non-Euclidean geometry, number theory, paradoxes, philosophy of science, scientific research methods, statistics, and others, structured in 17 chapters (Neutrosophic Theory and Applications; Neutrosophic Algebra; Fuzzy Soft Sets; Neutrosophic Sets; Hypersoft Sets; Neutrosophic Semigroups; Neutrosophic Graphs; Superhypergraphs; Plithogeny; Information Fusion; Statistics; Decision Making; Extenics; Instantaneous Physics; Paradoxism; Mathematica; Miscellanea), comprising 965 pages, published between 2005-2022 in different scientific journals, by the author alone or in collaboration with the following 110 co-authors (alphabetically ordered) from 26 countries: Abduallah Gamal, Sania Afzal, Firoz Ahmad, Muhammad Akram, Sheriful Alam, Ali Hamza, Ali H. M. Al-Obaidi, Madeleine Al-Tahan, Assia Bakali, Atiqe Ur Rahman, Sukanto Bhattacharya, Bilal Hadjadji, Robert N. Boyd, Willem K.M. Brauers, Umit Cali, Youcef Chibani, Victor Christianto, Chunxin Bo, Shyamal Dalapati, Mario Dalcín, Arup Kumar Das, Elham Davneshvar, Bijan Davvaz, Irfan Deli, Muhammet Deveci, Mamouni Dhar, R. Dhavaseelan, Balasubramanian Elavarasan, Sara Farooq, Haipeng Wang, Ugur Halden, Le Hoang Son, Hongnian Yu, Qays Hatem Imran, Mayas Ismail, Saeid Jafari, Jun Ye, Ilanthenral Kandasamy, W.B. Vasantha Kandasamy, Darjan Karabašević, Abdullah Kargın, Vasilios N. Katsikis, Nour Eldeen M. Khalifa, Madad Khan, M. Khoshnevisan, Tapan Kumar Roy, Pinaki Majumdar, Sreepurna Malakar, Masoud Ghods, Minghao Hu, Mingming Chen, Mohamed Abdel-Basset, Mohamed Talea, Mohammad Hamidi, Mohamed Loey, Mihnea Alexandru Moisescu, Muhammad Ihsan, Muhammad Saeed, Muhammad Shabir, Mumtaz Ali, Muzzamal Sitara, Nassim Abbas, Munazza Naz, Giorgio Nordo, Mani Parimala, Ion Pătrașcu, Gabrijela Popović, K. Porselvi, Surapati Pramanik, D. Preethi, Qiang Guo, Riad K. Al-Hamido, Zahra Rostami, Said Broumi, Saima Anis, Muzafer Saračević, Ganeshsree Selvachandran, Selvaraj Ganesan, Shammya Shananda Saha, Marayanagaraj Shanmugapriya, Songtao Shao, Sori Tjandrah Simbolon, Florentin Smarandache, Predrag S. Stanimirović, Dragiša Stanujkić, Raman Sundareswaran, Mehmet Șahin, Ovidiu-Ilie Șandru, Abdulkadir Șengür, Mohamed Talea, Ferhat Taș, Selçuk Topal, Alptekin Ulutaș, Ramalingam Udhayakumar, Yunita Umniyati, J. Vimala, Luige Vlădăreanu, Ştefan Vlăduţescu, Yaman Akbulut, Yanhui Guo, Yong Deng, You He, Young Bae Jun, Wangtao Yuan, Rong Xia, Xiaohong Zhang, Edmundas Kazimieras Zavadskas, Zayen Azzouz Omar, Xiaohong Zhang, Zhirou Ma.
- …