536 research outputs found
Parallel processing and expert systems
Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 90's cannot enjoy an increased level of autonomy without the efficient use of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real time demands are met for large expert systems. Speed-up via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial labs in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems was surveyed. The survey is divided into three major sections: (1) multiprocessors for parallel expert systems; (2) parallel languages for symbolic computations; and (3) measurements of parallelism of expert system. Results to date indicate that the parallelism achieved for these systems is small. In order to obtain greater speed-ups, data parallelism and application parallelism must be exploited
Research on parallel logic language implementation and architecture at ICOT
Abstract is not availabl
GraphR: Accelerating Graph Processing Using ReRAM
This paper presents GRAPHR, the first ReRAM-based graph processing
accelerator. GRAPHR follows the principle of near-data processing and explores
the opportunity of performing massive parallel analog operations with low
hardware and energy cost. The analog computation is suit- able for graph
processing because: 1) The algorithms are iterative and could inherently
tolerate the imprecision; 2) Both probability calculation (e.g., PageRank and
Collaborative Filtering) and typical graph algorithms involving integers (e.g.,
BFS/SSSP) are resilient to errors. The key insight of GRAPHR is that if a
vertex program of a graph algorithm can be expressed in sparse matrix vector
multiplication (SpMV), it can be efficiently performed by ReRAM crossbar. We
show that this assumption is generally true for a large set of graph
algorithms. GRAPHR is a novel accelerator architecture consisting of two
components: memory ReRAM and graph engine (GE). The core graph computations are
performed in sparse matrix format in GEs (ReRAM crossbars). The
vector/matrix-based graph computation is not new, but ReRAM offers the unique
opportunity to realize the massive parallelism with unprecedented energy
efficiency and low hardware cost. With small subgraphs processed by GEs, the
gain of performing parallel operations overshadows the wastes due to sparsity.
The experiment results show that GRAPHR achieves a 16.01x (up to 132.67x)
speedup and a 33.82x energy saving on geometric mean compared to a CPU baseline
system. Com- pared to GPU, GRAPHR achieves 1.69x to 2.19x speedup and consumes
4.77x to 8.91x less energy. GRAPHR gains a speedup of 1.16x to 4.12x, and is
3.67x to 10.96x more energy efficiency compared to PIM-based architecture.Comment: Accepted to HPCA 201
Parallel processing and expert systems
Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited
A Survey of Graph Pre-processing Methods: From Algorithmic to Hardware Perspectives
Graph-related applications have experienced significant growth in academia
and industry, driven by the powerful representation capabilities of graph.
However, efficiently executing these applications faces various challenges,
such as load imbalance, random memory access, etc. To address these challenges,
researchers have proposed various acceleration systems, including software
frameworks and hardware accelerators, all of which incorporate graph
pre-processing (GPP). GPP serves as a preparatory step before the formal
execution of applications, involving techniques such as sampling, reorder, etc.
However, GPP execution often remains overlooked, as the primary focus is
directed towards enhancing graph applications themselves. This oversight is
concerning, especially considering the explosive growth of real-world graph
data, where GPP becomes essential and even dominates system running overhead.
Furthermore, GPP methods exhibit significant variations across devices and
applications due to high customization. Unfortunately, no comprehensive work
systematically summarizes GPP. To address this gap and foster a better
understanding of GPP, we present a comprehensive survey dedicated to this area.
We propose a double-level taxonomy of GPP, considering both algorithmic and
hardware perspectives. Through listing relavent works, we illustrate our
taxonomy and conduct a thorough analysis and summary of diverse GPP techniques.
Lastly, we discuss challenges in GPP and potential future directions
Automated CNN pipeline generation for heterogeneous architectures
Heterogeneity is a vital feature in emerging processor chip designing. Asymmetric multicore-clusters such as high-performance cluster and power efficient cluster are common in modern edge devices. One example is Intel\u27s Alder Lake featuring Golden Cove high-performance cores and Gracemont power-efficient cores. Chiplet-based technology allows organization of multi cores in form of multi-chip-modules, thus housing large number of cores in a processor. Interposer based packaging has enabled embedding High Bandwidth Memory (HBM) on chip and reduced transmission latency and energy consumption of chiplet-chiplet interconnect.\ua0For Instance Intel\u27s XeHPC Ponte Vecchio package integrates multi-chip GPU organization along with HBM modules.Since new devices feature heterogeneity at the level of cores, memory and on-chip interconnect, it has become important to steer optimization at application level in order to leverage the new heterogeneous, high-performing and power-efficient features of underlying computing platforms. An important high-performance application paradigm is Convolution Neural Networks (CNN). CNNs are widely used in many practical applications. The pipelined parallel implementation of CNN is favored for inference on edge devices. In this Licentiate thesis we present a novel scheme for automatic scheduling of CNN pipelines on heterogeneous devices. A pipeline schedule is a configuration that provides information on depth of pipeline, grouping of CNN layers into pipeline stages and mapping of pipeline stages onto computing units. We utilize simple compile-time hints which consists of workload information of individual CNN layers and performance hints of computing units.The proposed approach provides near optimal solution for a throughput maximizing pipeline. We model the problem as a design space exploration technique. We developed a time-efficient design space navigation through heuristics extracted from the knowledge of CNN structure and underlying computing platform. The proposed search scheme converges faster and utilizes real-time performance measurements as fitness values. The results demonstrate that the proposed scheme converges faster and can scale when used with larger networks and computing platforms. Since the scheme utilizes online performance measurements, one of the challenges is to avoid expensive configurations during online tuning. The results demonstrate that on average, ~80\% of the tested configurations are sub-optimal solutions.Another challenge is to reduce convergence time. The experiments show that proposed approach is 35x faster than stochastic optimization algorithms. Since the design space is large and complex, We show that the proposed scheme explores only ~0.1% of the total design space in case of large CNNs (having 50+ layers) and results in near-optimal solution
The Landscape of Compute-near-memory and Compute-in-memory: A Research and Commercial Overview
In today's data-centric world, where data fuels numerous application domains,
with machine learning at the forefront, handling the enormous volume of data
efficiently in terms of time and energy presents a formidable challenge.
Conventional computing systems and accelerators are continually being pushed to
their limits to stay competitive. In this context, computing near-memory (CNM)
and computing-in-memory (CIM) have emerged as potentially game-changing
paradigms. This survey introduces the basics of CNM and CIM architectures,
including their underlying technologies and working principles. We focus
particularly on CIM and CNM architectures that have either been prototyped or
commercialized. While surveying the evolving CIM and CNM landscape in academia
and industry, we discuss the potential benefits in terms of performance,
energy, and cost, along with the challenges associated with these cutting-edge
computing paradigms
PIMSAB: A Processing-In-Memory System with Spatially-Aware Communication and Bit-Serial-Aware Computation
Bit-serial Processing-In-Memory (PIM) is an attractive paradigm for
accelerator architectures, for parallel workloads such as Deep Learning (DL),
because of its capability to achieve massive data parallelism at a low area
overhead and provide orders-of-magnitude data movement savings by moving
computational resources closer to the data. While many PIM architectures have
been proposed, improvements are needed in communicating intermediate results to
consumer kernels, for communication between tiles at scale, for reduction
operations, and for efficiently performing bit-serial operations with
constants.
We present PIMSAB, a scalable architecture that provides spatially aware
communication network for efficient intra-tile and inter-tile data movement and
provides efficient computation support for generally inefficient bit-serial
compute patterns. Our architecture consists of a massive hierarchical array of
compute-enabled SRAMs (CRAMs) and is codesigned with a compiler to achieve high
utilization. The key novelties of our architecture are: (1) providing efficient
support for spatially-aware communication by providing local H-tree network for
reductions, by adding explicit hardware for shuffling operands, and by
deploying systolic broadcasting, and (2) taking advantage of the divisible
nature of bit-serial computations through adaptive precision, bit-slicing and
efficient handling of constant operations.
When compared against a similarly provisioned modern Tensor Core GPU (NVIDIA
A100), across common DL kernels and an end-to-end DL network (Resnet18), PIMSAB
outperforms the GPU by 3x, and reduces energy by 4.2x. We compare PIMSAB with
similarly provisioned state-of-the-art SRAM PIM (Duality Cache) and DRAM PIM
(SIMDRAM) and observe a speedup of 3.7x and 3.88x respectively.Comment: Aman Arora and Jian Weng are co-first authors with equal contributio
Configurable data center switch architectures
In this thesis, we explore alternative architectures for implementing con_gurable Data Center Switches along with the advantages that can be provided by such switches. Our first contribution centers around determining switch architectures that can be implemented on Field Programmable Gate Array (FPGA) to provide configurable switching protocols. In the process, we identify a gap in the availability of frameworks to realistically evaluate the performance of switch architectures in data centers and contribute a simulation framework that relies on realistic data center traffic patterns. Our framework is then used to evaluate the performance of currently existing as well as newly proposed FPGA-amenable switch designs. Through collaborative work with Meng and Papaphilippou, we establish that only small-medium range switches can be implemented on today's FPGAs. Our second contribution is a novel switch architecture that integrates a custom in-network hardware accelerator with a generic switch to accelerate Deep Neural Network training applications in data centers. Our proposed accelerator architecture is prototyped on an FPGA, and a scalability study is conducted to demonstrate the trade-offs of an FPGA implementation when compared to an ASIC implementation. In addition to the hardware prototype, we contribute a light weight load-balancing and congestion control protocol that leverages the unique communication patterns of ML data-parallel jobs to enable fair sharing of network resources across different jobs. Our large-scale simulations demonstrate the ability of our novel switch architecture and light weight congestion control protocol to both accelerate the training time of machine learning jobs by up to 1.34x and benefit other latency-sensitive applications by reducing their 99%-tile completion time by up to 4.5x. As for our final contribution, we identify the main requirements of in-network applications and propose a Network-on-Chip (NoC)-based architecture for supporting a heterogeneous set of applications. Observing the lack of tools to support such research, we provide a tool that can be used to evaluate NoC-based switch architectures.Open Acces
Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture
Many modern workloads, such as neural networks, databases, and graph
processing, are fundamentally memory-bound. For such workloads, the data
movement between main memory and CPU cores imposes a significant overhead in
terms of both latency and energy. A major reason is that this communication
happens through a narrow bus with high latency and limited bandwidth, and the
low data reuse in memory-bound workloads is insufficient to amortize the cost
of main memory access. Fundamentally addressing this data movement bottleneck
requires a paradigm where the memory system assumes an active role in computing
by integrating processing capabilities. This paradigm is known as
processing-in-memory (PIM).
Recent research explores different forms of PIM architectures, motivated by
the emergence of new 3D-stacked memory technologies that integrate memory with
a logic layer where processing elements can be easily placed. Past works
evaluate these architectures in simulation or, at best, with simplified
hardware prototypes. In contrast, the UPMEM company has designed and
manufactured the first publicly-available real-world PIM architecture.
This paper provides the first comprehensive analysis of the first
publicly-available real-world PIM architecture. We make two key contributions.
First, we conduct an experimental characterization of the UPMEM-based PIM
system using microbenchmarks to assess various architecture limits such as
compute throughput and memory bandwidth, yielding new insights. Second, we
present PrIM, a benchmark suite of 16 workloads from different application
domains (e.g., linear algebra, databases, graph processing, neural networks,
bioinformatics).Comment: Our open source software is available at
https://github.com/CMU-SAFARI/prim-benchmark
- …