9,938 research outputs found
Recommended from our members
Executing matrix multiply on a process oriented data flow machine
The Process-Oriented Dataflow System (PODS) is an execution model that combines the von Neumann and dataflow models of computation to gain the benefits of each. Central to PODS is the concept of array distribution and its effects on partitioning and mapping of processes.In PODS arrays are partitioned by simply assigning consecutive elements to each processing element (PE) equally. Since PODS uses single assignment, there will be only one producer of each element. This producing PE owns that element and will perform the necessary computations to assign it. Using this approach the filling loop is distributed across the PEs. This simple partitioning and mapping scheme provides excellent results for executing scientific code on MIMD machines. In this way PODS allows MIMD machines to exploit vector and data parallelism easily while still providing the flexibility of MIMD over SIMD for multi-user systems.In this paper, the classic matrix multiply algorithm, with 1024 data points, is executed on a PODS simulator and the results are presented and discussed. Matrix multiply is a good example because it has several interesting properties: there are multiple code-blocks; a new array must be dynamically allocated and distributed; there is a loop-carried dependency in the innermost loop; the two input arrays have different access patterns; and the sizes of the input arrays are not known at compile time. Matrix multiply also forms the basis for many important scientific algorithms such as: LU decomposition, convolution, and the Fast-Fourier Transform.The results show that PODS is comparable to both Iannucci's Hybrid Architecture and MIT's TTDA in terms of overhead and instruction power. They also show that PODS easily distributes the work load evenly across the PEs. The key result is that PODS can scale matrix multiply in a near linear fashion until there is little or no work to be performed for each PE. Then overhead and message passing become a major component of the execution time. With larger problems (e.g., >/=16k data points) this limit would be reached at around 256 PEs
Mira: A Framework for Static Performance Analysis
The performance model of an application can pro- vide understanding about its
runtime behavior on particular hardware. Such information can be analyzed by
developers for performance tuning. However, model building and analyzing is
frequently ignored during software development until perfor- mance problems
arise because they require significant expertise and can involve many
time-consuming application runs. In this paper, we propose a fast, accurate,
flexible and user-friendly tool, Mira, for generating performance models by
applying static program analysis, targeting scientific applications running on
supercomputers. We parse both the source code and binary to estimate
performance attributes with better accuracy than considering just source or
just binary code. Because our analysis is static, the target program does not
need to be executed on the target architecture, which enables users to perform
analysis on available machines instead of conducting expensive exper- iments on
potentially expensive resources. Moreover, statically generated models enable
performance prediction on non-existent or unavailable architectures. In
addition to flexibility, because model generation time is significantly reduced
compared to dynamic analysis approaches, our method is suitable for rapid
application performance analysis and improvement. We present several scientific
application validation results to demonstrate the current capabilities of our
approach on small benchmarks and a mini application
Coarse-grained reconfigurable array architectures
Coarse-Grained Reconfigurable Array (CGRA) architectures accelerate the same inner loops that benefit from the high ILP support in VLIW architectures. By executing non-loop code on other cores, however, CGRAs can focus on such loops to execute them more efficiently. This chapter discusses the basic principles of CGRAs, and the wide range of design options available to a CGRA designer, covering a large number of existing CGRA designs. The impact of different options on flexibility, performance, and power-efficiency is discussed, as well as the need for compiler support. The ADRES CGRA design template is studied in more detail as a use case to illustrate the need for design space exploration, for compiler support and for the manual fine-tuning of source code
Phase Synchronization Operator for On-Chip Brain Functional Connectivity Computation
This paper presents an integer-based digital processor for the calculation of phase synchronization between two neural signals. It is based on the measurement of time periods between two consecutive minima. The simplicity of the approach allows for the use of elementary digital blocks, such as registers, counters, and adders. The processor, fabricated in a 0.18- μ m CMOS process, only occupies 0.05 mm 2 and consumes 15 nW from a 0.5 V supply voltage at a signal input rate of 1024 S/s. These low-area and low-power features make the proposed processor a valuable computing element in closed-loop neural prosthesis for the treatment of neural disorders, such as epilepsy, or for assessing the patterns of correlated activity in neural assemblies through the evaluation of functional connectivity maps.Ministerio de Economía y Competitividad TEC2016-80923-POffice of Naval Research (USA) N00014-19-1-215
Solid State Television Camera (CID)
The design, development and test are described of a charge injection device (CID) camera using a 244x248 element array. A number of video signal processing functions are included which maximize the output video dynamic range while retaining the inherently good resolution response of the CID. Some of the unique features of the camera are: low light level performance, high S/N ratio, antiblooming, geometric distortion, sequential scanning and AGC
An intelligent allocation algorithm for parallel processing
The problem of allocating nodes of a program graph to processors in a parallel processing architecture is considered. The algorithm is based on critical path analysis, some allocation heuristics, and the execution granularity of nodes in a program graph. These factors, and the structure of interprocessor communication network, influence the allocation. To achieve realistic estimations of the executive durations of allocations, the algorithm considers the fact that nodes in a program graph have to communicate through varying numbers of tokens. Coarse and fine granularities have been implemented, with interprocessor token-communication duration, varying from zero up to values comparable to the execution durations of individual nodes. The effect on allocation of communication network structures is demonstrated by performing allocations for crossbar (non-blocking) and star (blocking) networks. The algorithm assumes the availability of as many processors as it needs for the optimal allocation of any program graph. Hence, the focus of allocation has been on varying token-communication durations rather than varying the number of processors. The algorithm always utilizes as many processors as necessary for the optimal allocation of any program graph, depending upon granularity and characteristics of the interprocessor communication network
On-Orbit Validation of a Framework for Spacecraft-Initiated Communication Service Requests with NASA's SCaN Testbed
We design, analyze, and experimentally validate a framework for demand-based allocation of high-performance space communication service in which the user spacecraft itself initiates a request for service. Leveraging machine-to-machine communications, the automated process has potential to improve the responsiveness and efficiency of space network operations. We propose an augmented ground station architecture in which a hemispherical-pattern antenna allows for reception of service requests sent from any user spacecraft within view. A suite of ground-based automation software acts upon these direct-to-Earth requests and allocates access to high-performance service through a ground station or relay satellite in response to immediate user demand. A software-defined radio transceiver, optimized for reception of weak signals from the helical antenna, is presented. Design and testing of signal processing equipment and a software framework to handle service requests is discussed. Preliminary results from on-orbit demonstrations with a testbed onboard the International Space Station are presented to verify feasibility of the concept
- …