448 research outputs found
Accelerated Modeling of Near and Far-Field Diffraction for Coronagraphic Optical Systems
Accurately predicting the performance of coronagraphs and tolerancing optical
surfaces for high-contrast imaging requires a detailed accounting of
diffraction effects. Unlike simple Fraunhofer diffraction modeling, near and
far-field diffraction effects, such as the Talbot effect, are captured by
plane-to-plane propagation using Fresnel and angular spectrum propagation. This
approach requires a sequence of computationally intensive Fourier transforms
and quadratic phase functions, which limit the design and aberration
sensitivity parameter space which can be explored at high-fidelity in the
course of coronagraph design. This study presents the results of optimizing the
multi-surface propagation module of the open source Physical Optics Propagation
in PYthon (POPPY) package. This optimization was performed by implementing and
benchmarking Fourier transforms and array operations on graphics processing
units, as well as optimizing multithreaded numerical calculations using the
NumExpr python library where appropriate, to speed the end-to-end simulation of
observatory and coronagraph optical systems. Using realistic systems, this
study demonstrates a greater than five-fold decrease in wall-clock runtime over
POPPY's previous implementation and describes opportunities for further
improvements in diffraction modeling performance.Comment: Presented at SPIE ASTI 2018, Austin Texas. 11 pages, 6 figure
Recommended from our members
Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler
We describe a new operating system scheduling algorithm that improves performance isolation on chip multiprocessors (CMP). Poor performance isolation occurs when an application’s performance is determined by the behaviour of its co-runners, i.e., other applications simultaneously running with it. This performance dependency is caused by unfair, corunner-dependent cache allocation on CMPs. Poor performance isolation interferes with the operating system’s control over priority enforcement and hinders QoS provisioning. Previous solutions required modifications to the hardware. We present a new software solution. Our cache-fair algorithm ensures that the application runs as quickly as it would under fair cache allocation, regardless of how the cache is actually allocated. If the thread executes fewer instructions per cycle than it would under fair cache allocation, the scheduler increases that thread’s CPU timeslice. This way, the thread’s overall performance does not suffer because it is allowed to use the CPU longer. We describe our implementation of the algorithm in Solaris™ 10, and show that it significantly improves performance isolation for SPEC CPU, SPEC JBB and TPC-C.Engineering and Applied Science
Design and Performance of Scalable High-Performance Programmable Routers - Doctoral Dissertation, August 2002
The flexibility to adapt to new services and protocols without changes in the underlying hardware is and will increasingly be a key requirement for advanced networks. Introducing a processing component into the data path of routers and implementing packet processing in software provides this ability. In such a programmable router, a powerful processing infrastructure is necessary to achieve to level of performance that is comparable to custom silicon-based routers and to demonstrate the feasibility of this approach. This work aims at the general design of such programmable routers and, specifically, at the design and performance analysis of the processing subsystem. The necessity of programmable routers is motivated, and a router design is proposed. Based on the design, a general performance model is developed and quantitatively evaluated using a new network processor benchmark. Operational challenges, like scheduling of packets to processing engines, are addressed, and novel algorithms are presented. The results of this work give qualitative and quantitative insights into this new domain that combines issues from networking, computer architecture, and system design
Performance Modeling of Multithreaded Programs for Mobile Asymmetric Chip Multiprocessors
Abstract—Asymmetric chip multiprocessors (ACMPs) have multiple core types that are instruction-set compatible but op-timized differently to trade performance and power in mobile devices. The challenge for ACMPs is to map the program to the best core type and thread count to achieve performance require-ments under power constraints. This paper describes an empirical strategy, MONARCH, to automatically build estimation models that capture how a multithreaded program’s performance scales with thread count and core type. We show that MONARCH’s models are accurate and useful to find mappings that achieve performance goals while minimizing power. I
Recommended from our members
Cache-Fair Thread Scheduling for Multicore Processors
We present a new operating system scheduling algorithm for multicore processors. Our algorithm reduces the effects of unequal CPU cache sharing that occur on these processors and cause unfair CPU sharing, priority inversion, and inadequate CPU accounting. We describe the implementation of our algorithm in the Solaris operating system and demonstrate that it produces fairer schedules enabling better priority enforcement and improved performance stability for applications. With conventional scheduling algorithms, application performance on multicore processors varies by up to 36% depending on the runtime characteristics of concurrent processes. We reduce this variability by up to a factor of seven.Engineering and Applied Science
Doctor of Philosophy
dissertationWith the explosion of chip transistor counts, the semiconductor industry has struggled with ways to continue scaling computing performance in line with historical trends. In recent years, the de facto solution to utilize excess transistors has been to increase the size of the on-chip data cache, allowing fast access to an increased portion of main memory. These large caches allowed the continued scaling of single thread performance, which had not yet reached the limit of instruction level parallelism (ILP). As we approach the potential limits of parallelism within a single threaded application, new approaches such as chip multiprocessors (CMP) have become popular for scaling performance utilizing thread level parallelism (TLP). This dissertation identifies the operating system as a ubiquitous area where single threaded performance and multithreaded performance have often been ignored by computer architects. We propose that novel hardware and OS co-design has the potential to significantly improve current chip multiprocessor designs, enabling increased performance and improved power efficiency. We show that the operating system contributes a nontrivial overhead to even the most computationally intense workloads and that this OS contribution grows to a significant fraction of total instructions when executing several common applications found in the datacenter. We demonstrate that architectural improvements have had little to no effect on the performance of the OS over the last 15 years, leaving ample room for improvements. We specifically consider three potential solutions to improve OS execution on modern processors. First, we consider the potential of a separate operating system processor (OSP) operating concurrently with general purpose processors (GPP) in a chip multiprocessor organization, with several specialized structures acting as efficient conduits between these processors. Second, we consider the potential of segregating existing caching structures to decrease cache interference between the OS and application. Third, we propose that there are components within the OS itself that should be refactored to be both multithreaded and cache topology aware, which in turn, improves the performance and scalability of many-threaded applications
The fast multipole method at exascale
This thesis presents a top to bottom analysis on designing and implementing fast algorithms for current and future systems. We present new analysis, algorithmic techniques, and implementations of the Fast Multipole Method (FMM) for solving N- body problems. We target the FMM because it is broadly applicable to a variety of scientific particle simulations used to study electromagnetic, fluid, and gravitational phenomena, among others. Importantly, the FMM has asymptotically optimal time complexity with guaranteed approximation accuracy. As such, it is among the most attractive solutions for scalable particle simulation on future extreme scale systems.
We specifically address two key challenges. The first challenge is how to engineer fast code for today’s platforms. We present the first in-depth study of multicore op- timizations and tuning for FMM, along with a systematic approach for transforming a conventionally-parallelized FMM into a highly-tuned one. We introduce novel opti- mizations that significantly improve the within-node scalability of the FMM, thereby enabling high-performance in the face of multicore and manycore systems. The second challenge is how to understand scalability on future systems. We present a new algorithmic complexity analysis of the FMM that considers both intra- and inter- node communication costs. Using these models, we present results for choosing the optimal algorithmic tuning parameter. This analysis also yields the surprising prediction that although the FMM is largely compute-bound today, and therefore highly scalable on current systems, the trajectory of processor architecture designs, if there are no significant changes could cause it to become communication-bound as early as the year 2015. This prediction suggests the utility of our analysis approach, which directly relates algorithmic and architectural characteristics, for enabling a new kind of highlevel algorithm-architecture co-design.
To demonstrate the scientific significance of FMM, we present two applications
namely, direct simulation of blood which is a multi-scale multi-physics problem and large-scale biomolecular electrostatics. MoBo (Moving Boundaries) is the infrastruc- ture for the direct numerical simulation of blood. It comprises of two key algorithmic components of which FMM is one. We were able to simulate blood flow using Stoke- sian dynamics on 200,000 cores of Jaguar, a peta-flop system and achieve a sustained performance of 0.7 Petaflop/s. The second application we propose as future work in this thesis is biomolecular electrostatics where we solve for the electrical potential using the boundary-integral formulation discretized with boundary element methods (BEM). The computational kernel in solving the large linear system is dense matrix vector multiply which we propose can be calculated using our scalable FMM. We propose to begin with the two dielectric problem where the electrostatic field is cal- culated using two continuum dielectric medium, the solvent and the molecule. This is only a first step to solving biologically challenging problems which have more than two dielectric medium, ion-exclusion layers, and solvent filled cavities.
Finally, given the difficulty in producing high-performance scalable code, productivity is a key concern. Recently, numerical algorithms are being redesigned to take advantage of the architectural features of emerging multicore processors. These new classes of algorithms express fine-grained asynchronous parallelism and hence reduce the cost of synchronization. We performed the first extensive performance study of a recently proposed parallel programming model, called Concurrent Collections (CnC). In CnC, the programmer expresses her computation in terms of application-specific operations, partially-ordered by semantic scheduling constraints. The CnC model is well-suited to expressing asynchronous-parallel algorithms, so we evaluate CnC using two dense linear algebra algorithms in this style for execution on state-of-the-art mul- ticore systems. Our implementations in CnC was able to match and in some cases even exceed competing vendor-tuned and domain specific library codes. We combine these two distinct research efforts by expressing FMM in CnC, our approach tries to marry performance with productivity that will be critical on future systems. Looking forward, we would like to extend this to distributed memory machines, specifically implement FMM in the new distributed CnC, distCnC to express fine-grained paral- lelism which would require significant effort in alternative models.Ph.D
Power models, energy models and libraries for energy-efficient concurrent data structures and algorithms
EXCESS deliverable D2.3. More information at http://www.excess-project.eu/This deliverable reports the results of the power models, energy models and librariesfor energy-efficient concurrent data structures and algorithms as available by projectmonth 30 of Work Package 2 (WP2). It reports i) the latest results of Task 2.2-2.4 onproviding programming abstractions and libraries for developing energy-efficient datastructures and algorithms and ii) the improved results of Task 2.1 on investigating andmodeling the trade-off between energy and performance of concurrent data structuresand algorithms. The work has been conducted on two main EXCESS platforms: Intelplatforms with recent Intel multicore CPUs and Movidius Myriad platforms
- …