20 research outputs found

    Computer vision algorithms on reconfigurable logic arrays

    Full text link

    Scalability study in parallel computing

    Get PDF
    An asymptotic scalability metric, called Constant-Memory-per-Processor (CMP) scalability, is presented. This metric is useful in analyzing performance of a parallel algorithm on distributed memory architectures as the number of processors grows, but the memory size per processor remains fixed. To illustrate the CMP-scalability metric, parallel Matrix Multiplication (MM), Gauss-Jordan Elimination (GJE) with partial pivoting, and Fast Fourier Transform (FFT) algorithms are considered on the hypercube and two-dimensional mesh topologies;A comparison between the asymptotic CMP-scalability and the isoefficiency-scalability metrics is performed to attain a better understanding of scalability. An analysis of the scalability of GJE and FFT on a mesh predicts that GJE is asymptotically more scalable than FFT using the isoefficiency-scalability metric, but the CMP-scalability metric predicts that FFT is asymptotically more scalable than GJE. Closer investigation reveals that both are correct, and that each metric amounts to a different planer cross-section of the multi-dimensional performance surface. Combining information from both the isoefficiency and CMP-scalability metrics we are able to show how to predict the relative change in performance of two algorithms along the fixed-processor planar cross-section and the fixed-problem size planar cross-section;Scalability metrics such as the CMP-scalability metric and isoefficiency-scalability metric indicate the asymptotic behavior as the number of processors becomes large. However, we question how useful these metrics are on a specific machine with a fixed number of processors and a fixed memory per processor. We investigate the utility of the CMP-and isoefficiency-scalability metrics by a detailed analysis of the three algorithms on a 16K processor MasPar MP-1 machine. Included in our analysis are the effects of varying the processor speed and communication speeds of the parallel computer on the accuracy of these scalability metrics

    Broadcast with mask on a Massively Parallel Processing on a Chip

    Get PDF
    workshop drnoc2012The delay of instructions broadcast has a significant impact on the performance of Single Instruction Multiple Data (SIMD) architecture. This is especially true for massively parallel processing Systems-on-Chip (mppSoC), where the processing stage and that of setting up the communication mechanism need several clock periods. Subnetting is the strategy used to partition a single physical network into more than one smaller logical sub-networks (subnets). This technique better controls the broadcast instructions domain and the data traffic between network nodes. Furthermore, it allows to separate synchronous communications from asynchronous processing which maintains reliable communications and rapid processing through parallel processors. This paper describes the design of a communication model called broadcast with mask. This model is dedicated to mppSoC architecture with a huge number of processor elements because it maintains performances even when the number of processors increases. Simulation results and an FPGA implementation validate our approach

    Approaches to the implementation of binary relation inference network.

    Get PDF
    by C.W. Tong.Thesis (M.Phil.)--Chinese University of Hong Kong, 1994.Includes bibliographical references (leaves 96-98).Chapter 1 --- Introduction --- p.1Chapter 1.1 --- The Availability of Parallel Processing Machines --- p.2Chapter 1.1.1 --- Neural Networks --- p.5Chapter 1.2 --- Parallel Processing in the Continuous-Time Domain --- p.6Chapter 1.3 --- Binary Relation Inference Network --- p.10Chapter 2 --- Binary Relation Inference Network --- p.12Chapter 2.1 --- Binary Relation Inference Network --- p.12Chapter 2.1.1 --- Network Structure --- p.14Chapter 2.2 --- Shortest Path Problem --- p.17Chapter 2.2.1 --- Problem Statement --- p.17Chapter 2.2.2 --- A Binary Relation Inference Network Solution --- p.18Chapter 3 --- A Binary Relation Inference Network Prototype --- p.21Chapter 3.1 --- The Prototype --- p.22Chapter 3.1.1 --- The Network --- p.22Chapter 3.1.2 --- Computational Element --- p.22Chapter 3.1.3 --- Network Response Time --- p.27Chapter 3.2 --- Improving Response --- p.29Chapter 3.2.1 --- Removing Feedback --- p.29Chapter 3.2.2 --- Selecting Minimum with Diodes --- p.30Chapter 3.3 --- Speeding Up the Network Response --- p.33Chapter 3.4 --- Conclusion --- p.35Chapter 4 --- VLSI Building Blocks --- p.36Chapter 4.1 --- The Site --- p.37Chapter 4.2 --- The Unit --- p.40Chapter 4.2.1 --- A Minimum Finding Circuit --- p.40Chapter 4.2.2 --- A Tri-state Comparator --- p.44Chapter 4.3 --- The Computational Element --- p.45Chapter 4.3.1 --- Network Performances --- p.46Chapter 4.4 --- Discussion --- p.47Chapter 5 --- A VLSI Chip --- p.48Chapter 5.1 --- Spatial Configuration --- p.49Chapter 5.2 --- Layout --- p.50Chapter 5.2.1 --- Computational Elements --- p.50Chapter 5.2.2 --- The Network --- p.52Chapter 5.2.3 --- I/O Requirements --- p.53Chapter 5.2.4 --- Optional Modules --- p.53Chapter 5.3 --- A Scalable Design --- p.54Chapter 6 --- The Inverse Shortest Paths Problem --- p.57Chapter 6.1 --- Problem Statement --- p.59Chapter 6.2 --- The Embedded Approach --- p.63Chapter 6.2.1 --- The Formulation --- p.63Chapter 6.2.2 --- The Algorithm --- p.65Chapter 6.3 --- Implementation Results --- p.66Chapter 6.4 --- Other Implementations --- p.67Chapter 6.4.1 --- Sequential Machine --- p.67Chapter 6.4.2 --- Parallel Machine --- p.68Chapter 6.5 --- Discussion --- p.68Chapter 7 --- Closed Semiring Optimization Circuits --- p.71Chapter 7.1 --- Transitive Closure Problem --- p.72Chapter 7.1.1 --- Problem Statement --- p.72Chapter 7.1.2 --- Inference Network Solutions --- p.73Chapter 7.2 --- Closed Semirings --- p.76Chapter 7.3 --- Closed Semirings and the Binary Relation Inference Network --- p.79Chapter 7.3.1 --- Minimum Spanning Tree --- p.80Chapter 7.3.2 --- VLSI Implementation --- p.84Chapter 7.4 --- Conclusion --- p.86Chapter 8 --- Conclusions --- p.87Chapter 8.1 --- Summary of Achievements --- p.87Chapter 8.2 --- Future Work --- p.89Chapter 8.2.1 --- VLSI Fabrication --- p.89Chapter 8.2.2 --- Network Robustness --- p.90Chapter 8.2.3 --- Inference Network Applications --- p.91Chapter 8.2.4 --- Architecture for the Bellman-Ford Algorithm --- p.91Bibliography --- p.92Appendices --- p.99Chapter A --- Detailed Schematic --- p.99Chapter A.1 --- Schematic of the Inference Network Structures --- p.99Chapter A.1.1 --- Unit with Self-Feedback --- p.99Chapter A.1.2 --- Unit with Self-Feedback Removed --- p.100Chapter A.1.3 --- Unit with a Compact Minimizer --- p.100Chapter A.1.4 --- Network Modules --- p.100Chapter A.2 --- Inference Network Interface Circuits --- p.100Chapter B --- Circuit Simulation and Layout Tools --- p.107Chapter B.1 --- Circuit Simulation --- p.107Chapter B.2 --- VLSI Circuit Design --- p.110Chapter B.3 --- VLSI Circuit Layout --- p.111Chapter C --- The Conjugate-Gradient Descent Algorithm --- p.113Chapter D --- Shortest Path Problem on MasPar --- p.11

    NASA high performance computing and communications program

    Get PDF
    The National Aeronautics and Space Administration's HPCC program is part of a new Presidential initiative aimed at producing a 1000-fold increase in supercomputing speed and a 100-fold improvement in available communications capability by 1997. As more advanced technologies are developed under the HPCC program, they will be used to solve NASA's 'Grand Challenge' problems, which include improving the design and simulation of advanced aerospace vehicles, allowing people at remote locations to communicate more effectively and share information, increasing scientist's abilities to model the Earth's climate and forecast global environmental trends, and improving the development of advanced spacecraft. NASA's HPCC program is organized into three projects which are unique to the agency's mission: the Computational Aerosciences (CAS) project, the Earth and Space Sciences (ESS) project, and the Remote Exploration and Experimentation (REE) project. An additional project, the Basic Research and Human Resources (BRHR) project exists to promote long term research in computer science and engineering and to increase the pool of trained personnel in a variety of scientific disciplines. This document presents an overview of the objectives and organization of these projects as well as summaries of individual research and development programs within each project

    SCAC-Net: Reconfigurable Interconnection Network in SCAC Massively parallel SoC

    Get PDF
    International audienceParallel communication plays a critical role in massively parallel systems, especially in distributed memory systems executing parallel programs on shared data. Therefore, integrating an interconnection network in these systems becomes essential to ensure data inter-nodes exchange. Choose the most effective communication structure must meet certain criteria: speed, size and power consumption. Indeed, the communication phase should be as fast as possible to avoid compromising parallel computing, using small and low power consumption modules to facilitate the interconnection network extensibility in a scalable system. To meet these criteria and based on a module reuse methodology, we chose to integrate a reconfigurable SCAC-Net interconnection network to communicate data in SCAC Massively parallel SoC. This paper presents the detailed hardware implementation and discusses the performance evaluation of the proposed reconfigurable SCAC-Net network

    Modula-2* and its compilation

    Get PDF

    Automatic visual recognition using parallel machines

    Get PDF
    Invariant features and quick matching algorithms are two major concerns in the area of automatic visual recognition. The former reduces the size of an established model database, and the latter shortens the computation time. This dissertation, will discussed both line invariants under perspective projection and parallel implementation of a dynamic programming technique for shape recognition. The feasibility of using parallel machines can be demonstrated through the dramatically reduced time complexity. In this dissertation, our algorithms are implemented on the AP1000 MIMD parallel machines. For processing an object with a features, the time complexity of the proposed parallel algorithm is O(n), while that of a uniprocessor is O(n2). The two applications, one for shape matching and the other for chain-code extraction, are used in order to demonstrate the usefulness of our methods. Invariants from four general lines under perspective projection are also discussed in here. In contrast to the approach which uses the epipolar geometry, we investigate the invariants under isotropy subgroups. Theoretically speaking, two independent invariants can be found for four general lines in 3D space. In practice, we show how to obtain these two invariants from the projective images of four general lines without the need of camera calibration. A projective invariant recognition system based on a hypothesis-generation-testing scheme is run on the hypercube parallel architecture. Object recognition is achieved by matching the scene projective invariants to the model projective invariants, called transfer. Then a hypothesis-generation-testing scheme is implemented on the hypercube parallel architecture

    Medical image tomography: A statistically tailored neural network approach

    Get PDF
    In medical computed tomography (CT) the tomographic images are reconstructed from planar information collected 180∘ to 360∘ around the patient. In clinical applications, the reconstructions are typically produced using a filtered backprojection algorithm. Filtered backprojection methods have limitations that create a high percentage of statistical uncertainty in the reconstructed images. Many techniques have been developed which produce better reconstructions, but they tend to be computationally expensive, and thus, impractical for clinical use;Artificial neural networks (ANN) have been shown to be adept at learning and then simulating complex functional relationships. For medical tomography, a neural network can be trained to produce a reconstructed medical image given the planar data as input. Once trained an ANN can produce an accurate reconstruction very quickly;A backpropagation ANN with statistically derived activation functions has been developed to improve the trainability and generalization ability of a network to produce accurate reconstructions. The tailored activation functions are derived from the estimated probability density functions (p.d.f.s) of the ANN training data set. A set of sigmoid derivative functions are fitted to the p.d.f.s and then integrated to produce the ANN activation functions, which are also estimates of the cumulative distribution functions (c.d.f.s) of the training data. The statistically tailored activation functions and their derivatives are substituted for the logistic function and its derivative that are typically used in backpropagation ANNs;A set of geometric images was derived for training an ANN for cardiac SPECT image reconstruction. The planar projections for the geometric images were simulated using the Monte Carlo method to produce sixty-four 64-quadrant planar views taken 180 about each image. A 4096 x 629 x 4096 architecture ANN was simulated on the MasPar MP-2, a massively parallel single-instruction multiple-data (SIMD) computer. The ANN was trained on the set of geometric tomographic images. Trained on the geometric images, the ANN was able to generalize the input-to-output function of the planar data-to-tomogram and accurately reconstruct actual cardiac SPECT images

    Efficient Mapping of Neural Network Models on a Class of Parallel Architectures.

    Get PDF
    This dissertation develops a formal and systematic methodology for efficient mapping of several contemporary artificial neural network (ANN) models on k-ary n-cube parallel architectures (KNC\u27s). We apply the general mapping to several important ANN models including feedforward ANN\u27s trained with backpropagation algorithm, radial basis function networks, cascade correlation learning, and adaptive resonance theory networks. Our approach utilizes a parallel task graph representing concurrent operations of the ANN model during training. The mapping of the ANN is performed in two steps. First, the parallel task graph of the ANN is mapped to a virtual KNC of compatible dimensionality. This involves decomposing each operation into its atomic tasks. Second, the dimensionality of the virtual KNC architecture is recursively reduced through a sequence of transformations until a desired metric is optimized. We refer to this process as folding the virtual architecture. The optimization criteria we consider in this dissertation are defined in terms of the iteration time of the algorithm on the folded architecture. If necessary, the mapping scheme may utilize a subset of the processors of a given KNC architecture if it results in the most efficient simulation. A unique feature of our mapping is that it systematically selects an appropriate degree of parallelism leading to a highly efficient realization of the ANN model on KNC architectures. A novel feature of our work is its ability to efficiently map unit-allocating ANN\u27s. These networks possess a dynamic structure which grows during training. We present a highly efficient scheme for simulating such networks on existing KNC parallel architectures. We assume an upper bound on size of the neural network We perform the folding such that the iteration time of the largest network is minimized. We show that our mapping leads to near-optimal simulation of smaller instances of the neural network. In addition, based on our mapping no data migration or task rescheduling is needed as the size of network grows
    corecore