82 research outputs found

    A new-generation class of parallel architectures and their performance evaluation

    Get PDF
    The development of computers with hundreds or thousands of processors and capability for very high performance is absolutely essential for many computation problems, such as weather modeling, fluid dynamics, and aerodynamics. Several interconnection networks have been proposed for parallel computers. Nevertheless, the majority of them are plagued by rather poor topological properties that result in large memory latencies for DSM (Distributed Shared-Memory) computers. On the other hand, scalable networks with very good topological properties are often impossible to build because of their prohibitively high VLSI (e.g., wiring) complexity. Such a network is the generalized hypercube (GH). The GH supports full-connectivity of its nodes in each dimension and is characterized by outstanding topological properties. In addition, low-dimensional GHs have very large bisection widths. We propose in this dissertation a new class of processor interconnections, namely HOWs (Highly Overlapping Windows), that are more generic than the GH, are highly scalable, and have comparable performance. We analyze the communications capabilities of 2-D HOW systems and demonstrate that in practical cases HOW systems perform much better than binary hypercubes for important communications patterns. These properties are in addition to the good scalability and low hardware complexity of HOW systems. We present algorithms for one-to-one, one-to-all broadcasting, all-to-all broadcasting, one-to-all personalized, and all-to-all personalized communications on HOW systems. These algorithms are developed and evaluated for several communication models. In addition, we develop techniques for the efficient embedding of popular topologies, such as the ring, the torus, and the hypercube, into 1-D and 2-D HOW systems. The objective is to show that 2-D HOW systems are not only scalable and easy to implement, but they also result in good embedding of several classical topologies

    On the implementation of P-RAM algorithms on feasible SIMD computers

    Get PDF
    The P-RAM model of computation has proved to be a very useful theoretical model for exploiting and extracting inherent parallelism in problems and thus for designing parallel algorithms. Therefore, it becomes very important to examine whether results obtained for such a model can be translated onto machines considered to be more realistic in the face of current technological constraints. In this thesis, we show how the implementation of many techniques and algorithms designed for the P-RAM can be achieved on the feasible SIMD class of computers. The first investigation concerns classes of problems solvable on the P-RAM model using the recursive techniques of compression, tree contraction and 'divide and conquer'. For such problems, specific methods are emphasised to achieve efficient implementations on some SIMD architectures. Problems such as list ranking, polynomial and expression evaluation are shown to have efficient solutions on the 2—dimensional mesh-connected computer. The balanced binary tree technique is widely employed to solve many problems in the P-RAM model. By proposing an implicit embedding of the binary tree of size n on a (√n x√n) mesh-connected computer (contrary to using the usual H-tree approach which requires a mesh of size ≈ (2√n x 2√n), we show that many of the problems solvable using this technique can be efficiently implementable on this architecture. Two efficient O (√n) algorithms for solving the bracket matching problem are presented. Consequently, the problems of expression evaluation (where the expression is given in an array form), evaluating algebraic expressions with a carrier of constant bounded size and parsing expressions of both bracket and input driven languages are all shown to have efficient solutions on the 2—dimensional mesh-connected computer. Dealing with non-tree structured computations we show that the Eulerian tour problem for a given graph with m edges and maximum vertex degree d can be solved in O(d√n) parallel time on the 2 —dimensional mesh-connected computer. A way to increase the processor utilisation on the 2-dimensional mesh-connected computer is also presented. The method suggested consists of pipelining sets of iteratively solvable problems each of which at each step of its execution uses only a fraction of available PE's

    Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms

    Full text link
    Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects. As the size of DL models and the compute efficiency of the accelerators has continued to increase, there has also been a corresponding steady increase in the bandwidth of these interconnects.Systems today provide 100s of gigabytes (GBs) of inter-connect bandwidth via a mix of solutions such as Multi-Chip packaging modules (MCM) and proprietary interconnects(e.g., NVlink) that together from the scale-up network of accelerators. However, as we identify in this work, a significant portion of this bandwidth goes under-utilized. This is because(i) using compute cores for executing collective operations such as all-reduce decreases overall compute efficiency, and(ii) there is memory bandwidth contention between the accesses for arithmetic operations vs those for collectives, and(iii) there are significant internal bus congestions that increase the latency of communication operations. To address this challenge, we propose a novel microarchitecture, calledAccelerator Collectives Engine(ACE), forDL collective communication offload. ACE is a smart net-work interface (NIC) tuned to cope with the high-bandwidth and low latency requirements of scale-up networks and is able to efficiently drive the various scale-up network systems(e.g. switch-based or point-to-point topologies). We evaluate the benefits of the ACE with micro-benchmarks (e.g. single collective performance) and popular DL models using an end-to-end DL training simulator. For modern DL workloads, ACE on average increases the net-work bandwidth utilization by 1.97X, resulting in 2.71X and 1.44X speedup in iteration time for ResNet-50 and GNMT, respectively

    NeuroBench:A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems

    Get PDF
    Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of nearly 100 co-authors across over 50 institutions in industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we present initial performance baselines across various model architectures on the algorithm track and outline the system track benchmark tasks and guidelines. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community

    Hypercube-Based Topologies With Incremental Link Redundancy.

    Get PDF
    Hypercube structures have received a great deal of attention due to the attractive properties inherent to their topology. Parallel algorithms targeted at this topology can be partitioned into many tasks, each of which running on one node processor. A high degree of performance is achievable by running every task individually and concurrently on each node processor available in the hypercube. Nevertheless, the performance can be greatly degraded if the node processors spend much time just communicating with one another. The goal in designing hypercubes is, therefore, to achieve a high ratio of computation time to communication time. The dissertation addresses primarily ways to enhance system performance by minimizing the communication time among processors. The need for improving the performance of hypercube networks is clearly explained. Three novel topologies related to hypercubes with improved performance are proposed and analyzed. Firstly, the Bridged Hypercube (BHC) is introduced. It is shown that this design is remarkably more efficient and cost-effective than the standard hypercube due to its low diameter. Basic routing algorithms such as one to one and broadcasting are developed for the BHC and proven optimal. Shortcomings of the BHC such as its asymmetry and limited application are clearly discussed. The Folded Hypercube (FHC), a symmetric network with low diameter and low degree of the node, is introduced. This new topology is shown to support highly efficient communications among the processors. For the FHC, optimal routing algorithms are developed and proven to be remarkably more efficient than those of the conventional hypercube. For both BHC and FHC, network parameters such as average distance, message traffic density, and communication delay are derived and comparatively analyzed. Lastly, to enhance the fault tolerance of the hypercube, a new design called Fault Tolerant Hypercube (FTH) is proposed. The FTH is shown to exhibit a graceful degradation in performance with the existence of faults. Probabilistic models based on Markov chain are employed to characterize the fault tolerance of the FTH. The results are verified by Monte Carlo simulation. The most attractive feature of all new topologies is the asymptotically zero overhead associated with them. The designs are simple and implementable. These designs can lead themselves to many parallel processing applications requiring high degree of performance

    Task allocation and migration on a star-network

    Full text link
    Modern day applications require computational power which cannot be satisfied with uniprocessor systems. So the use of multiprocessor systems in such jobs becomes necessary. This thesis presents an approach of allocating the tasks to a multiprocessor system called the star network. Generally, an incoming task requires only a part of the star network, and not the whole network, for its execution. So, we need a task allocation strategy which can identify the free processors forming a substar and allocate tasks to these substars. The task executes for a time equal to task residence time and then relinquishes the substar. Sometimes there might be enough free processors forming a substar in the network which can host the next incoming task. But the allocation strategy may not recognize the free processors as a substar. To create a substar of free processors to host the next task, task migration has to be performed such that the free processors are grouped into a substar. In this work, three processor allocation strategies: static, dynamic and dynamic work task migration are presented. Using simulations, a comparison of these strategies is done to obtain the percentage improvement of one strategy over the other. Also a comparative study of the working of these strategies in star-networks and hypercubes is done. A saving of 5-11% is achieved by for both the networks incorporating task-migration in dynamic allocation over simple dynamic allocation

    Information Leakage Attacks and Countermeasures

    Get PDF
    The scientific community has been consistently working on the pervasive problem of information leakage, uncovering numerous attack vectors, and proposing various countermeasures. Despite these efforts, leakage incidents remain prevalent, as the complexity of systems and protocols increases, and sophisticated modeling methods become more accessible to adversaries. This work studies how information leakages manifest in and impact interconnected systems and their users. We first focus on online communications and investigate leakages in the Transport Layer Security protocol (TLS). Using modern machine learning models, we show that an eavesdropping adversary can efficiently exploit meta-information (e.g., packet size) not protected by the TLS’ encryption to launch fingerprinting attacks at an unprecedented scale even under non-optimal conditions. We then turn our attention to ultrasonic communications, and discuss their security shortcomings and how adversaries could exploit them to compromise anonymity network users (even though they aim to offer a greater level of privacy compared to TLS). Following up on these, we delve into physical layer leakages that concern a wide array of (networked) systems such as servers, embedded nodes, Tor relays, and hardware cryptocurrency wallets. We revisit location-based side-channel attacks and develop an exploitation neural network. Our model demonstrates the capabilities of a modern adversary but also presents an inexpensive tool to be used by auditors for detecting such leakages early on during the development cycle. Subsequently, we investigate techniques that further minimize the impact of leakages found in production components. Our proposed system design distributes both the custody of secrets and the cryptographic operation execution across several components, thus making the exploitation of leaks difficult

    Schematics of Graphs and Hypergraphs

    Get PDF
    Graphenzeichnen als ein Teilgebiet der Informatik befasst sich mit dem Ziel Graphen oder deren Verallgemeinerung Hypergraphen geometrisch zu realisieren. BeschrĂ€nkt man sich dabei auf visuelles Hervorheben von wesentlichen Informationen in Zeichenmodellen, spricht man von Schemata. Hauptinstrumente sind Konstruktionsalgorithmen und Charakterisierungen von Graphenklassen, die fĂŒr die Konstruktion geeignet sind. In dieser Arbeit werden Schemata fĂŒr Graphen und Hypergraphen formalisiert und mit den genannten Instrumenten untersucht. In der Dissertation wird zunĂ€chst das „partial edge drawing“ (kurz: PED) Modell fĂŒr Graphen (bezĂŒglich gradliniger Zeichnung) untersucht. Dabei wird um Kreuzungen im Zentrum der Kante visuell zu eliminieren jede Kante durch ein kreuzungsfreies TeilstĂŒck (= Stummel) am Start- und am Zielknoten ersetzt. Als Standard hat sich eine PED-Variante etabliert, in der das LĂ€ngenverhĂ€ltnis zwischen Stummel und Kante genau 1⁄4 ist (kurz: 1⁄4-SHPED). FĂŒr 1⁄4-SHPEDs werden Konstruktionsalgorithmen, Klassifizierung, Implementierung und Evaluation prĂ€sentiert. Außerdem werden PED-Varianten mit festen Knotenpositionen und auf Basis orthogonaler Zeichnungen erforscht. Danach wird das BUS Modell fĂŒr Hypergraphen untersucht, in welchem Hyperkanten durch fette horizontale oder vertikale – als BUS bezeichnete – Segmente reprĂ€sentiert werden. Dazu wird eine vollstĂ€ndige Charakterisierung von planaren Inzidenzgraphen von Hypergraphen angegeben, die eine planare Zeichnung im BUS Modell besitzen, und diverse planare BUS-Varianten mit festen Knotenpositionen werden diskutiert. Zum Schluss wird erstmals eine Punktmenge von subquadratischer GrĂ¶ĂŸe angegeben, die eine planare Einbettung (Knoten werden auf Punkte abgebildet) von 2-außenplanaren Graphen ermöglicht
    • 

    corecore