310 research outputs found
Dynamic Systolization for Developing Multiprocessor Supercomputers
A dynamic network approach is introduced for developing reconfigurable, systolic arrays or wavefront processors; This allows one to design very powerful and flexible processors to be used in a general-purpose, reconfigurable, and fault-tolerant, multiprocessor computer system. The concepts of macro-dataflow and multitasking can be integrated to handle variable-resolution granularities in computationally intensive algorithms. A multiprocessor architecture, Remps, is proposed based on these design methodologies. The Remps architecture is generalized from the Cedar, HEP, Cray X- MP, Trac, NYU ultracomputer, S-l, Pumps, Chip, and SAM projects. Our goal is to provide a multiprocessor research model for developing design methodologies, multiprocessing and multitasking supports, dynamic systolic/wavefront array processors, interconnection networks, reconfiguration techniques, and performance analysis tools. These system design and operational techniques should be useful to those who are developing or evaluating multiprocessor supercomputers
Automated problem scheduling and reduction of synchronization delay effects
It is anticipated that in order to make effective use of many future high performance architectures, programs will have to exhibit at least a medium grained parallelism. A framework is presented for partitioning very sparse triangular systems of linear equations that is designed to produce favorable preformance results in a wide variety of parallel architectures. Efficient methods for solving these systems are of interest because: (1) they provide a useful model problem for use in exploring heuristics for the aggregation, mapping and scheduling of relatively fine grained computations whose data dependencies are specified by directed acrylic graphs, and (2) because such efficient methods can find direct application in the development of parallel algorithms for scientific computation. Simple expressions are derived that describe how to schedule computational work with varying degrees of granularity. The Encore Multimax was used as a hardware simulator to investigate the performance effects of using the partitioning techniques presented in shared memory architectures with varying relative synchronization costs
Rationale for and design of a generic tiled hierarchical phased array beamforming architecture
The purpose of the phased array beamforming project is to develop a generic flexible efficient phased array receiver platform, using a mixed signal hardware/software-codesign approach. The results will be applicable to any radio (RF) system, but we will focus on satellite receiver (DVB-S) and radar applications. We will present a preliminary mapping of beamforming processing on a tiled architecture and determine its scalability.\ud
\ud
The functionality, size and cost constraints imply an integrated mixed signal CMOS solution. For a generic flexible multi-standard solution, a software defined radio approach is taken. Because a scalable and dependable solution is needed, a tiled hierarchical architecture is proposed with reconfigurable hardware to regain flexibility. A mapping is provided of beamforming on the proposed architecture. The advantages and disadvantages of each solution are discussed with respect to applicability and scalability.\ud
\ud
Different beamforming processing solutions can be mapped on the same proposed tiled hierarchical architecture. This provides a flexible, scalable and reconfigurable solution for a wide application domain. Beamforming is a data-driven streaming process which lends itself well for a regular scalable architecture. Beamsteering on the other hand is much more control-oriented and future work will focus on how to support beamsteering on the proposed architecture as well
Towards effective modeling and programming multi-core tiled reconfigurable architectures
For a generic flexible efficient array antenna receiver platform a hierarchical reconfigurable tiled architecture has been proposed. The architecture provides a flexible reconfigurable solution, but partitioning, mapping, modeling and programming such systems remains an issue. We will advocate a model-based design approach and propose a single semantic (programming) model for representing the specification, design and implementation. This approach tackles these problems at a higher conceptual level, thereby exploiting the inherent composability and parallelism available in the formalism. A case study illustrates the use of the semantic model with examples from analogue/digital co-design and hardware/software co-design
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Efficiently and Transparently Maintaining High SIMD Occupancy in the Presence of Wavefront Irregularity
Demand is increasing for high throughput processing of irregular streaming applications; examples of such applications from scientific and engineering domains include biological sequence alignment, network packet filtering, automated face detection, and big graph algorithms. With wide SIMD, lightweight threads, and low-cost thread-context switching, wide-SIMD architectures such as GPUs allow considerable flexibility in the way application work is assigned to threads. However, irregular applications are challenging to map efficiently onto wide SIMD because data-dependent filtering or replication of items creates an unpredictable data wavefront of items ready for further processing. Straightforward implementations of irregular applications on a wide-SIMD architecture are prone to load imbalance and reduced occupancy, while more sophisticated implementations require advanced use of parallel GPU operations to redistribute work efficiently among threads.
This dissertation will present strategies for addressing the performance challenges of wavefront- irregular applications on wide-SIMD architectures. These strategies are embodied in a developer framework called Mercator that (1) allows developers to map irregular applications onto GPUs ac- cording to the streaming paradigm while abstracting from low-level data movement and (2) includes generalized techniques for transparently overcoming the obstacles to high throughput presented by wavefront-irregular applications on a GPU. Mercator forms the centerpiece of this dissertation, and we present its motivation, performance model, implementation, and extensions in this work
Model-driven development of data intensive applications over cloud resources
The proliferation of sensors over the last years has generated large amounts
of raw data, forming data streams that need to be processed. In many cases,
cloud resources are used for such processing, exploiting their flexibility, but
these sensor streaming applications often need to support operational and
control actions that have real-time and low-latency requirements that go beyond
the cost effective and flexible solutions supported by existing cloud
frameworks, such as Apache Kafka, Apache Spark Streaming, or Map-Reduce
Streams. In this paper, we describe a model-driven and stepwise refinement
methodological approach for streaming applications executed over clouds. The
central role is assigned to a set of Petri Net models for specifying functional
and non-functional requirements. They support model reuse, and a way to combine
formal analysis, simulation, and approximate computation of minimal and maximal
boundaries of non-functional requirements when the problem is either
mathematically or computationally intractable. We show how our proposal can
assist developers in their design and implementation decisions from a
performance perspective. Our methodology allows to conduct performance
analysis: The methodology is intended for all the engineering process stages,
and we can (i) analyse how it can be mapped onto cloud resources, and (ii)
obtain key performance indicators, including throughput or economic cost, so
that developers are assisted in their development tasks and in their decision
taking. In order to illustrate our approach, we make use of the pipelined
wavefront array.Comment: Preprin
- …