Search CORE

9,938 research outputs found

Recommended from our members

Executing matrix multiply on a process oriented data flow machine

Author: Bic Lubomir
Nagel Mark D.
Roy John M.A.
Publication venue: eScholarship, University of California
Publication date: 01/01/1990
Field of study

The Process-Oriented Dataflow System (PODS) is an execution model that combines the von Neumann and dataflow models of computation to gain the benefits of each. Central to PODS is the concept of array distribution and its effects on partitioning and mapping of processes.In PODS arrays are partitioned by simply assigning consecutive elements to each processing element (PE) equally. Since PODS uses single assignment, there will be only one producer of each element. This producing PE owns that element and will perform the necessary computations to assign it. Using this approach the filling loop is distributed across the PEs. This simple partitioning and mapping scheme provides excellent results for executing scientific code on MIMD machines. In this way PODS allows MIMD machines to exploit vector and data parallelism easily while still providing the flexibility of MIMD over SIMD for multi-user systems.In this paper, the classic matrix multiply algorithm, with 1024 data points, is executed on a PODS simulator and the results are presented and discussed. Matrix multiply is a good example because it has several interesting properties: there are multiple code-blocks; a new array must be dynamically allocated and distributed; there is a loop-carried dependency in the innermost loop; the two input arrays have different access patterns; and the sizes of the input arrays are not known at compile time. Matrix multiply also forms the basis for many important scientific algorithms such as: LU decomposition, convolution, and the Fast-Fourier Transform.The results show that PODS is comparable to both Iannucci's Hybrid Architecture and MIT's TTDA in terms of overhead and instruction power. They also show that PODS easily distributes the work load evenly across the PEs. The key result is that PODS can scale matrix multiply in a near linear fashion until there is little or no work to be performed for each PE. Then overhead and message passing become a major component of the execution time. With larger problems (e.g., >/=16k data points) this limit would be reached at around 256 PEs

eScholarship - University of California

Mira: A Framework for Static Performance Analysis

Author: Meng Kewen
Norris Boyana
Publication venue
Publication date: 22/05/2017
Field of study

The performance model of an application can pro- vide understanding about its runtime behavior on particular hardware. Such information can be analyzed by developers for performance tuning. However, model building and analyzing is frequently ignored during software development until perfor- mance problems arise because they require significant expertise and can involve many time-consuming application runs. In this paper, we propose a fast, accurate, flexible and user-friendly tool, Mira, for generating performance models by applying static program analysis, targeting scientific applications running on supercomputers. We parse both the source code and binary to estimate performance attributes with better accuracy than considering just source or just binary code. Because our analysis is static, the target program does not need to be executed on the target architecture, which enables users to perform analysis on available machines instead of conducting expensive exper- iments on potentially expensive resources. Moreover, statically generated models enable performance prediction on non-existent or unavailable architectures. In addition to flexibility, because model generation time is significantly reduced compared to dynamic analysis approaches, our method is suitable for rapid application performance analysis and improvement. We present several scientific application validation results to demonstrate the current capabilities of our approach on small benchmarks and a mini application

arXiv.org e-Print Archive

Crossref

Coarse-grained reconfigurable array architectures

Author: A Lambrechts
B Bougard
B Bougard
B Mei
B Mei
B Mei
B Sutter De
G Venkataramani
H Park
H Park
J Lee
JMP Cardoso
JW Waerdt van de
K Berkel van
K Bondalapati
K Sankaralingam
KE Coons
LH Lee
M Ahn
M Gebhart
M Schlansker
M Taylor
M Woh
MD Galanis
MH Lee
S Friedman
SA Mahlke
T Oh
Y Kim
Y Kim
Y Kim
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

Coarse-Grained Reconﬁgurable Array (CGRA) architectures accelerate the same inner loops that beneﬁt from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efﬁciently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on ﬂexibility, performance, and power-efﬁciency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual ﬁne-tuning of source code

Crossref

Ghent University Academic Bibliography

Phase Synchronization Operator for On-Chip Brain Functional Connectivity Computation

Author: Delgado Restituto Manuel
Rodríguez Vázquez Ángel Benito
Romaine James Brian
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

This paper presents an integer-based digital processor for the calculation of phase synchronization between two neural signals. It is based on the measurement of time periods between two consecutive minima. The simplicity of the approach allows for the use of elementary digital blocks, such as registers, counters, and adders. The processor, fabricated in a 0.18- μ m CMOS process, only occupies 0.05 mm 2 and consumes 15 nW from a 0.5 V supply voltage at a signal input rate of 1024 S/s. These low-area and low-power features make the proposed processor a valuable computing element in closed-loop neural prosthesis for the treatment of neural disorders, such as epilepsy, or for assessing the patterns of correlated activity in neural assemblies through the evaluation of functional connectivity maps.Ministerio de Economía y Competitividad TEC2016-80923-POffice of Naval Research (USA) N00014-19-1-215

idUS. Depósito de Investigación Universidad de Sevilla

Solid State Television Camera (CID)

Author: Green W. T.
Steele D. W.
Publication venue
Publication date
Field of study

The design, development and test are described of a charge injection device (CID) camera using a 244x248 element array. A number of video signal processing functions are included which maximize the output video dynamic range while retaining the inherently good resolution response of the CID. Some of the unique features of the camera are: low light level performance, high S/N ratio, antiblooming, geometric distortion, sequential scanning and AGC

NASA Technical Reports Server

An intelligent allocation algorithm for parallel processing

Author: Ananthram Kishan G.
Carroll Chester C.
Homaifar Abdollah
Publication venue
Publication date
Field of study

The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network

NASA Technical Reports Server

On-Orbit Validation of a Framework for Spacecraft-Initiated Communication Service Requests with NASA's SCaN Testbed

Author: Downey Joseph A.
Gannon Adam M.
Mortensen Dale J.
Piasecki Marie T.
Publication venue
Publication date
Field of study

We design, analyze, and experimentally validate a framework for demand-based allocation of high-performance space communication service in which the user spacecraft itself initiates a request for service. Leveraging machine-to-machine communications, the automated process has potential to improve the responsiveness and efficiency of space network operations. We propose an augmented ground station architecture in which a hemispherical-pattern antenna allows for reception of service requests sent from any user spacecraft within view. A suite of ground-based automation software acts upon these direct-to-Earth requests and allocates access to high-performance service through a ground station or relay satellite in response to immediate user demand. A software-defined radio transceiver, optimized for reception of weak signals from the helical antenna, is presented. Design and testing of signal processing equipment and a software framework to handle service requests is discussed. Preliminary results from on-orbit demonstrations with a testbed onboard the International Space Station are presented to verify feasibility of the concept

NASA Technical Reports Server