609 research outputs found

    Parallel computing in combinatorial optimization

    Get PDF

    Report from the MPP Working Group to the NASA Associate Administrator for Space Science and Applications

    Get PDF
    NASA's Office of Space Science and Applications (OSSA) gave a select group of scientists the opportunity to test and implement their computational algorithms on the Massively Parallel Processor (MPP) located at Goddard Space Flight Center, beginning in late 1985. One year later, the Working Group presented its report, which addressed the following: algorithms, programming languages, architecture, programming environments, the way theory relates, and performance measured. The findings point to a number of demonstrated computational techniques for which the MPP architecture is ideally suited. For example, besides executing much faster on the MPP than on conventional computers, systolic VLSI simulation (where distances are short), lattice simulation, neural network simulation, and image problems were found to be easier to program on the MPP's architecture than on a CYBER 205 or even a VAX. The report also makes technical recommendations covering all aspects of MPP use, and recommendations concerning the future of the MPP and machines based on similar architectures, expansion of the Working Group, and study of the role of future parallel processors for space station, EOS, and the Great Observatories era

    Doctor of Philosophy

    Get PDF
    dissertationThe embedded system space is characterized by a rapid evolution in the complexity and functionality of applications. In addition, the short time-to-market nature of the business motivates the use of programmable devices capable of meeting the conflicting constraints of low-energy, high-performance, and short design times. The keys to achieving these conflicting constraints are specialization and maximally extracting available application parallelism. General purpose processors are flexible but are either too power hungry or lack the necessary performance. Application-specific integrated circuits (ASICS) efficiently meet the performance and power needs but are inflexible. Programmable domain-specific architectures (DSAs) are an attractive middle ground, but their design requires significant time, resources, and expertise in a variety of specialties, which range from application algorithms to architecture and ultimately, circuit design. This dissertation presents CoGenE, a design framework that automates the design of energy-performance-optimal DSAs for embedded systems. For a given application domain and a user-chosen initial architectural specification, CoGenE consists of a a Compiler to generate execution binary, a simulator Generator to collect performance/energy statistics, and an Explorer that modifies the current architecture to improve energy-performance-area characteristics. The above process repeats automatically until the user-specified constraints are achieved. This removes or alleviates the time needed to understand the application, manually design the DSA, and generate object code for the DSA. Thus, CoGenE is a new design methodology that represents a significant improvement in performance, energy dissipation, design time, and resources. This dissertation employs the face recognition domain to showcase a flexible architectural design methodology that creates "ASIC-like" DSAs. The DSAs are instruction set architecture (ISA)-independent and achieve good energy-performance characteristics by coscheduling the often conflicting constraints of data access, data movement, and computation through a flexible interconnect. This represents a significant increase in programming complexity and code generation time. To address this problem, the CoGenE compiler employs integer linear programming (ILP)-based 'interconnect-aware' scheduling techniques for automatic code generation. The CoGenE explorer employs an iterative technique to search the complete design space and select a set of energy-performance-optimal candidates. When compared to manual designs, results demonstrate that CoGenE produces superior designs for three application domains: face recognition, speech recognition and wireless telephony. While CoGenE is well suited to applications that exhibit a streaming behavior, multithreaded applications like ray tracing present a different but important challenge. To demonstrate its generality, CoGenE is evaluated in designing a novel multicore N-wide SIMD architecture, known as StreamRay, for the ray tracing domain. CoGenE is used to synthesize the SIMD execution cores, the compiler that generates the application binary, and the interconnection subsystem. Further, separating address and data computations in space reduces data movement and contention for resources, thereby significantly improving performance compared to existing ray tracing approaches

    Acceleration of k-Nearest Neighbor and SRAD Algorithms Using Intel FPGA SDK for OpenCL

    Get PDF
    Field Programmable Gate Arrays (FPGAs) have been widely used for accelerating machine learning algorithms. However, the high design cost and time for implementing FPGA-based accelerators using traditional HDL-based design methodologies has discouraged users from designing FPGA-based accelerators. In recent years, a new CAD tool called Intel FPGA SDK for OpenCL (IFSO) allowed fast and efficient design of FPGA-based hardware accelerators from high level specification such as OpenCL. Even software engineers with basic hardware design knowledge could design FPGA-based accelerators. In this thesis, IFSO has been used to explore acceleration of k-Nearest-Neighbour (kNN) algorithm and Speckle Reducing Anisotropic Diffusion (SRAD) simulation using FPGAs. kNN is a popular algorithm used in machine learning. Bitonic sorting and radix sorting algorithms were used in the kNN algorithm to check if these provide any performance improvements. Acceleration of SRAD simulation was also explored. The experimental results obtained for these algorithms from FPGA-based acceleration were compared with the state of the art CPU implementation. The optimized algorithms were implemented on two different FPGAs (Intel Stratix A7 and Intel Arria 10 GX). Experimental results show that the FPGA-based accelerators provided similar or better execution time (up to 80X) and better power efficiency (75% reduction in power consumption) than traditional platforms such as a workstation based on two Intel Xeon processors E5-2620 Series (each with 6 cores and running at 2.4 GHz)

    Fast data-parallel rendering of digital volume images.

    Get PDF
    by Song Zou.Year shown on spine: 1997.Thesis (M.Phil.)--Chinese University of Hong Kong, 1995.Includes bibliographical references (leaves 69-[72]).Chapter 1 --- Introduction --- p.1Chapter 2 --- Related works --- p.7Chapter 2.1 --- Spatial domain methods --- p.8Chapter 2.2 --- Transformation based methods --- p.9Chapter 2.3 --- Parallel Implement ation --- p.10Chapter 3 --- Parallel computation model --- p.12Chapter 3.1 --- Introduction --- p.12Chapter 3.2 --- Classifications of Parallel Computers --- p.13Chapter 3.3 --- The SIMD machine architectures --- p.15Chapter 3.4 --- The communication within the parallel processors --- p.16Chapter 3.5 --- The parallel display mechanisms --- p.17Chapter 4 --- Data preparation --- p.20Chapter 4.1 --- Introduction --- p.20Chapter 4.2 --- Original data layout in the processor array --- p.21Chapter 4.3 --- Shading --- p.21Chapter 4.4 --- Classification --- p.23Chapter 5 --- Fast data parallel rotation and resampling algorithms --- p.25Chapter 5.1 --- Introduction --- p.25Chapter 5.2 --- Affine Transformation --- p.26Chapter 5.3 --- Related works --- p.28Chapter 5.3.1 --- Resampling in ray tracing --- p.28Chapter 5.3.2 --- Direct Rotation --- p.28Chapter 5.3.3 --- General resampling approaches --- p.29Chapter 5.3.4 --- Rotation by shear --- p.29Chapter 5.4 --- The minimum mismatch rotation --- p.31Chapter 5.5 --- Load balancing --- p.33Chapter 5.6 --- Resampling algorithm --- p.35Chapter 5.6.1 --- Nearest neighbor --- p.36Chapter 5.6.2 --- Linear Interpolation --- p.36Chapter 5.6.3 --- Aitken's Algorithm --- p.38Chapter 5.6.4 --- Polynomial resampling in 3D --- p.40Chapter 5.7 --- A comparison between the resampling algorithms --- p.40Chapter 5.7.1 --- The quality --- p.42Chapter 5.7.2 --- Implement ation and cost --- p.44Chapter 6 --- Data reordering using binary swap --- p.47Chapter 6.1 --- The sorting algorithm --- p.48Chapter 6.2 --- The communication cost --- p.51Chapter 7 --- Ray composition --- p.53Chapter 7.1 --- Introduction --- p.53Chapter 7.2 --- Ray Composition by Monte Carlo Method --- p.54Chapter 7.3 --- The Associative Color Model --- p.56Chapter 7.4 --- Parallel Implementation --- p.60Chapter 7.5 --- Discussion and further improvement --- p.63Chapter 8 --- Conclusion and further work --- p.67Bibliography --- p.6
    • …
    corecore