Search CORE

82 research outputs found

A new-generation class of parallel architectures and their performance evaluation

Author: Wang Qian
Publication venue: Digital Commons @ NJIT
Publication date: 31/05/1999
Field of study

The development of computers with hundreds or thousands of processors and capability for very high performance is absolutely essential for many computation problems, such as weather modeling, fluid dynamics, and aerodynamics. Several interconnection networks have been proposed for parallel computers. Nevertheless, the majority of them are plagued by rather poor topological properties that result in large memory latencies for DSM (Distributed Shared-Memory) computers. On the other hand, scalable networks with very good topological properties are often impossible to build because of their prohibitively high VLSI (e.g., wiring) complexity. Such a network is the generalized hypercube (GH). The GH supports full-connectivity of its nodes in each dimension and is characterized by outstanding topological properties. In addition, low-dimensional GHs have very large bisection widths. We propose in this dissertation a new class of processor interconnections, namely HOWs (Highly Overlapping Windows), that are more generic than the GH, are highly scalable, and have comparable performance. We analyze the communications capabilities of 2-D HOW systems and demonstrate that in practical cases HOW systems perform much better than binary hypercubes for important communications patterns. These properties are in addition to the good scalability and low hardware complexity of HOW systems. We present algorithms for one-to-one, one-to-all broadcasting, all-to-all broadcasting, one-to-all personalized, and all-to-all personalized communications on HOW systems. These algorithms are developed and evaluated for several communication models. In addition, we develop techniques for the efficient embedding of popular topologies, such as the ring, the torus, and the hypercube, into 1-D and 2-D HOW systems. The objective is to show that 2-D HOW systems are not only scalable and easy to implement, but they also result in good embedding of several classical topologies

Digital Commons @ New Jersey Institute of Technology (NJIT)

Using PRAM algorithms on a uniform-memory-access shared-memory architecture

Author: Bader D. A.
Illendula A. K.
Moret B. M. E.
Weisse-Bernstein N.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 12/12/2006
Field of study

Infoscience - École polytechnique fédérale de Lausanne

On the implementation of P-RAM algorithms on feasible SIMD computers

Author: Ziani Ridha
Publication venue
Publication date
Field of study

The P-RAM model of computation has proved to be a very useful theoretical model for exploiting and extracting inherent parallelism in problems and thus for designing parallel algorithms. Therefore, it becomes very important to examine whether results obtained for such a model can be translated onto machines considered to be more realistic in the face of current technological constraints. In this thesis, we show how the implementation of many techniques and algorithms designed for the P-RAM can be achieved on the feasible SIMD class of computers. The first investigation concerns classes of problems solvable on the P-RAM model using the recursive techniques of compression, tree contraction and 'divide and conquer'. For such problems, specific methods are emphasised to achieve efficient implementations on some SIMD architectures. Problems such as list ranking, polynomial and expression evaluation are shown to have efficient solutions on the 2—dimensional mesh-connected computer. The balanced binary tree technique is widely employed to solve many problems in the P-RAM model. By proposing an implicit embedding of the binary tree of size n on a (√n x√n) mesh-connected computer (contrary to using the usual H-tree approach which requires a mesh of size ≈ (2√n x 2√n), we show that many of the problems solvable using this technique can be efficiently implementable on this architecture. Two efficient O (√n) algorithms for solving the bracket matching problem are presented. Consequently, the problems of expression evaluation (where the expression is given in an array form), evaluating algebraic expressions with a carrier of constant bounded size and parsing expressions of both bracket and input driven languages are all shown to have efficient solutions on the 2—dimensional mesh-connected computer. Dealing with non-tree structured computations we show that the Eulerian tour problem for a given graph with m edges and maximum vertex degree d can be solved in O(d√n) parallel time on the 2 —dimensional mesh-connected computer. A way to increase the processor utilisation on the 2-dimensional mesh-connected computer is also presented. The method suggested consists of pipelining sets of iteratively solvable problems each of which at each step of its execution uses only a fraction of available PE's

Warwick Research Archives Portal Repository

Benchmarking the computation and communication performance of the CM-5

Author: Geoffrey Fox
Kivanc Dincer
Sanjay Ranka
Zeki Bozkus
Publication venue: 'Wiley'
Publication date: 01/01/2002
Field of study

Crossref

Efficient Communication Acceleration for Next-Gen Scale-up Deep Learning Training Platforms

Author: Denton Matthew
Krishna Tushar
Rashidi Saeed
Sridharan Srinivas
Srinivasan Sudarshan
Publication venue
Publication date: 08/07/2020
Field of study

Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects. As the size of DL models and the compute efficiency of the accelerators has continued to increase, there has also been a corresponding steady increase in the bandwidth of these interconnects.Systems today provide 100s of gigabytes (GBs) of inter-connect bandwidth via a mix of solutions such as Multi-Chip packaging modules (MCM) and proprietary interconnects(e.g., NVlink) that together from the scale-up network of accelerators. However, as we identify in this work, a significant portion of this bandwidth goes under-utilized. This is because(i) using compute cores for executing collective operations such as all-reduce decreases overall compute efficiency, and(ii) there is memory bandwidth contention between the accesses for arithmetic operations vs those for collectives, and(iii) there are significant internal bus congestions that increase the latency of communication operations. To address this challenge, we propose a novel microarchitecture, calledAccelerator Collectives Engine(ACE), forDL collective communication offload. ACE is a smart net-work interface (NIC) tuned to cope with the high-bandwidth and low latency requirements of scale-up networks and is able to efficiently drive the various scale-up network systems(e.g. switch-based or point-to-point topologies). We evaluate the benefits of the ACE with micro-benchmarks (e.g. single collective performance) and popular DL models using an end-to-end DL training simulator. For modern DL workloads, ACE on average increases the net-work bandwidth utilization by 1.97X, resulting in 2.71X and 1.44X speedup in iteration time for ResNet-50 and GNMT, respectively

arXiv.org e-Print Archive

NeuroBench:A Framework for Benchmarking Neuromorphic Computing Algorithms and Systems

Author: Ahmed Zergham
Akl Mahmoud
Anderson Brian
Andreou Andreas G.
Bam Shrestha Sumit
Bartolozzi Chiara
Basu Arindam
Bogdan Petrut
Bohte Sander
Bouhadjar Younes
Buckley Sonia
Cauwenberghs Gert
Chicca Elisabetta
Corradi Federico
Danielescu Andreea
Daram Anurag
Davies Mike
de Croon Guido
Demirag Yigit
den Blanken Douwe
Eshraghian Jason
Fabre Maxime
Fischer Tobias
Forest Jeremy
Fra Vittorio
Frenkel Charlotte
Furber Steve
Furlong P. Michael
Gilpin William
Gilra Aditya
Gonzalez Hector A.
Hasan Ahmed Soikat
Hueber Paul
Indiveri Giacomo
Janapa Reddi Vijay
Joshi Siddharth
Karia Vedant
Khacef Lyes
Kleyko Denis
Knight James C.
Kriener Laura
Kubendran Rajkumar
Kudithipudi Dhireesha
Lenz Gregor
Leto Benedetto
Liu Shih-Chii
Liu Yao-Hong
Ma Haoyuan
Manohar Rajit
Margarit-Taulé Josep Maria
Mayr Christian
Micheli Aurora
Michmizos Konstantinos
Mishra Anurag Kumar
Muir Dylan
Neftci Emre
Nowotny Thomas
Ottati Fabrizio
Ozcelikkale Ayca
Pacik-Nelson Noah
Panda Priyadarshini
Park Jongkil
Payvand Melika
Pehle Christian
Petrovici Mihai A.
Pierro Alessandro
Posch Christoph
Renner Alpha
Sandamirskaya Yulia
Schaefer Clemens J.S.
Schemmel Johannes
Schmidgall Samuel
Schuman Catherine
Seo Jae-sun
Sheik Sadique
Sifalakis Manolis
Sironi Amos
Stewart Kenneth
Stewart Matthew
Stewart Terrence C.
Stratmann Philipp
Sun Pao-Sheng Vincent
Sun Tao
Tang Guangzhi
Timcheck Jonathan
Tuz Zohora Fatima
Tömen Nergis
Urgese Gianvito
Van den Berghe Korneel
van Schaik André
Vathakkattil Joseph George
Verhelst Marian
Vineyard Craig M.
Vogginger Bernhard
Wang Shenqi
Yik Jason
Yousefzadeh Amirreza
Zhou Biyan
Publication venue: arXiv.org
Publication date: 17/01/2024
Field of study

Neuromorphic computing shows promise for advancing computing efficiency and capabilities of AI applications using brain-inspired principles. However, the neuromorphic research field currently lacks standardized benchmarks, making it difficult to accurately measure technological advancements, compare performance with conventional methods, and identify promising future research directions. Prior neuromorphic computing benchmark efforts have not seen widespread adoption due to a lack of inclusive, actionable, and iterative benchmark design and guidelines. To address these shortcomings, we present NeuroBench: a benchmark framework for neuromorphic computing algorithms and systems. NeuroBench is a collaboratively-designed effort from an open community of nearly 100 co-authors across over 50 institutions in industry and academia, aiming to provide a representative structure for standardizing the evaluation of neuromorphic approaches. The NeuroBench framework introduces a common set of tools and systematic methodology for inclusive benchmark measurement, delivering an objective reference framework for quantifying neuromorphic approaches in both hardware-independent (algorithm track) and hardware-dependent (system track) settings. In this article, we present initial performance baselines across various model architectures on the algorithm track and outline the system track benchmark tasks and guidelines. NeuroBench is intended to continually expand its benchmarks and features to foster and track the progress made by the research community

Pure OAI Repository

Hypercube-Based Topologies With Incremental Link Redundancy.

Author: Latifi Shahram
Publication venue: LSU Digital Commons
Publication date: 01/01/1989
Field of study

Hypercube structures have received a great deal of attention due to the attractive properties inherent to their topology. Parallel algorithms targeted at this topology can be partitioned into many tasks, each of which running on one node processor. A high degree of performance is achievable by running every task individually and concurrently on each node processor available in the hypercube. Nevertheless, the performance can be greatly degraded if the node processors spend much time just communicating with one another. The goal in designing hypercubes is, therefore, to achieve a high ratio of computation time to communication time. The dissertation addresses primarily ways to enhance system performance by minimizing the communication time among processors. The need for improving the performance of hypercube networks is clearly explained. Three novel topologies related to hypercubes with improved performance are proposed and analyzed. Firstly, the Bridged Hypercube (BHC) is introduced. It is shown that this design is remarkably more efficient and cost-effective than the standard hypercube due to its low diameter. Basic routing algorithms such as one to one and broadcasting are developed for the BHC and proven optimal. Shortcomings of the BHC such as its asymmetry and limited application are clearly discussed. The Folded Hypercube (FHC), a symmetric network with low diameter and low degree of the node, is introduced. This new topology is shown to support highly efficient communications among the processors. For the FHC, optimal routing algorithms are developed and proven to be remarkably more efficient than those of the conventional hypercube. For both BHC and FHC, network parameters such as average distance, message traffic density, and communication delay are derived and comparatively analyzed. Lastly, to enhance the fault tolerance of the hypercube, a new design called Fault Tolerant Hypercube (FTH) is proposed. The FTH is shown to exhibit a graceful degradation in performance with the existence of faults. Probabilistic models based on Markov chain are employed to characterize the fault tolerance of the FTH. The results are verified by Monte Carlo simulation. The most attractive feature of all new topologies is the asymptotically zero overhead associated with them. The designs are simple and implementable. These designs can lead themselves to many parallel processing applications requiring high degree of performance

Louisiana State University

Task allocation and migration on a star-network

Author: Chatiwala Safdar F
Publication venue: Digital Scholarship@UNLV
Publication date: 01/01/1993
Field of study

Modern day applications require computational power which cannot be satisfied with uniprocessor systems. So the use of multiprocessor systems in such jobs becomes necessary. This thesis presents an approach of allocating the tasks to a multiprocessor system called the star network. Generally, an incoming task requires only a part of the star network, and not the whole network, for its execution. So, we need a task allocation strategy which can identify the free processors forming a substar and allocate tasks to these substars. The task executes for a time equal to task residence time and then relinquishes the substar. Sometimes there might be enough free processors forming a substar in the network which can host the next incoming task. But the allocation strategy may not recognize the free processors as a substar. To create a substar of free processors to host the next task, task migration has to be performed such that the free processors are grouped into a substar. In this work, three processor allocation strategies: static, dynamic and dynamic work task migration are presented. Using simulations, a comparison of these strategies is done to obtain the percentage improvement of one strategy over the other. Also a comparative study of the working of these strategies in star-networks and hypercubes is done. A saving of 5-11% is achieved by for both the networks incorporating task-migration in dynamic allocation over simple dynamic allocation

University of Nevada, Las Vegas Repository

Information Leakage Attacks and Countermeasures

Author: Mavroudis Vasilios
Publication venue: UCL (University College London)
Publication date: 28/01/2022
Field of study

The scientific community has been consistently working on the pervasive problem of information leakage, uncovering numerous attack vectors, and proposing various countermeasures. Despite these efforts, leakage incidents remain prevalent, as the complexity of systems and protocols increases, and sophisticated modeling methods become more accessible to adversaries. This work studies how information leakages manifest in and impact interconnected systems and their users. We first focus on online communications and investigate leakages in the Transport Layer Security protocol (TLS). Using modern machine learning models, we show that an eavesdropping adversary can efficiently exploit meta-information (e.g., packet size) not protected by the TLS’ encryption to launch fingerprinting attacks at an unprecedented scale even under non-optimal conditions. We then turn our attention to ultrasonic communications, and discuss their security shortcomings and how adversaries could exploit them to compromise anonymity network users (even though they aim to offer a greater level of privacy compared to TLS). Following up on these, we delve into physical layer leakages that concern a wide array of (networked) systems such as servers, embedded nodes, Tor relays, and hardware cryptocurrency wallets. We revisit location-based side-channel attacks and develop an exploitation neural network. Our model demonstrates the capabilities of a modern adversary but also presents an inexpensive tool to be used by auditors for detecting such leakages early on during the development cycle. Subsequently, we investigate techniques that further minimize the impact of leakages found in production components. Our proposed system design distributes both the custody of secrets and the cryptographic operation execution across several components, thus making the exploitation of leaks difficult

UCL Discovery

Schematics of Graphs and Hypergraphs

Author: Bruckdorfer Till Martin
Publication venue: Universität Tübingen
Publication date: 01/01/2016
Field of study

Graphenzeichnen als ein Teilgebiet der Informatik befasst sich mit dem Ziel Graphen oder deren Verallgemeinerung Hypergraphen geometrisch zu realisieren. Beschränkt man sich dabei auf visuelles Hervorheben von wesentlichen Informationen in Zeichenmodellen, spricht man von Schemata. Hauptinstrumente sind Konstruktionsalgorithmen und Charakterisierungen von Graphenklassen, die für die Konstruktion geeignet sind. In dieser Arbeit werden Schemata für Graphen und Hypergraphen formalisiert und mit den genannten Instrumenten untersucht. In der Dissertation wird zunächst das „partial edge drawing“ (kurz: PED) Modell für Graphen (bezüglich gradliniger Zeichnung) untersucht. Dabei wird um Kreuzungen im Zentrum der Kante visuell zu eliminieren jede Kante durch ein kreuzungsfreies Teilstück (= Stummel) am Start- und am Zielknoten ersetzt. Als Standard hat sich eine PED-Variante etabliert, in der das Längenverhältnis zwischen Stummel und Kante genau 1⁄4 ist (kurz: 1⁄4-SHPED). Für 1⁄4-SHPEDs werden Konstruktionsalgorithmen, Klassifizierung, Implementierung und Evaluation präsentiert. Außerdem werden PED-Varianten mit festen Knotenpositionen und auf Basis orthogonaler Zeichnungen erforscht. Danach wird das BUS Modell für Hypergraphen untersucht, in welchem Hyperkanten durch fette horizontale oder vertikale – als BUS bezeichnete – Segmente repräsentiert werden. Dazu wird eine vollständige Charakterisierung von planaren Inzidenzgraphen von Hypergraphen angegeben, die eine planare Zeichnung im BUS Modell besitzen, und diverse planare BUS-Varianten mit festen Knotenpositionen werden diskutiert. Zum Schluss wird erstmals eine Punktmenge von subquadratischer Größe angegeben, die eine planare Einbettung (Knoten werden auf Punkte abgebildet) von 2-außenplanaren Graphen ermöglicht

Publikationsserver der Universität Tübingen