31 research outputs found
Parallel fast fourier transform in SPMD style of cilk
Copyright © 2019 Inderscience Enterprises Ltd. In this paper, we propose a parallel one-dimensional non-recursive fast Fourier transform (FFT) program based on conventional Cooley-Tukey’s algorithm written in C using Cilk in single program multiple data (SPMD) style. As a highly compact designed code, this code is compared with a highly tuned parallel recursive fast Fourier transform (FFT) using Cilk, which is included in Cilk package of version 5.4.6. Both algorithms are executed on multicore servers, and experimental results show that the performance of the SPMD style of Cilk fast Fourier transform (FFT) parallel code is highly competitive and promising
Cooperative hierarchical resource management for efficient composition of parallel software
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2010.Cataloged from PDF version of thesis.Includes bibliographical references (p. 93-96).There cannot be a thriving software industry in the upcoming manycore era unless programmers can compose arbitrary parallel codes without sacrificing performance. We believe that the efficient composition of parallel codes is best achieved by exposing unvirtualized hardware resources and sharing these cooperatively across parallel codes within an application. This thesis presents Lithe, a user-level framework that enables efficient composition of parallel software components. Lithe provides the basic primitives, standard interface, and thin runtime to enable parallel codes to efficiently use and share processing resources. Lithe can be inserted underneath the runtimes of legacy parallel software environments to provide bolt-on composability - without changing a single line of the original application code. Lithe can also serve as the foundation for building new parallel abstractions and runtime systems that automatically interoperate with one another. We have built and ported a wide range of interoperable scheduling, synchronization, and domain-specific libraries using Lithe. We show that the modifications needed are small and impose no performance penalty when running each library standalone. We also show that Lithe improves the performance of real world applications composed of multiple parallel libraries by simply relinking them with the new library binaries. Moreover, the Lithe version of an application even outperformed a third-party expert-tuned implementation by being more adaptive to different phases of the computation.by Heidi Pan.Ph.D
Type Oriented Parallel Programming
Context: Parallel computing is an important field within the sciences. With the emergence of multi, and soon many, core CPUs this is moving more and more into the domain of general computing. HPC programmers want performance, but at the moment this comes at a cost; parallel languages are either efficient or conceptually simple, but not both.
Aim: To develop and evaluate a novel programming paradigm which will address the problem of parallel programming and allow for languages which are both conceptually simple and efficient.
Method: A type-based approach, which allows the programmer to control all aspects of parallelism by the use and combination of types has been developed. As a vehicle to present and analyze this new paradigm a parallel language, Mesham, and associated compilation tools have also been created. By using types to express parallelism the programmer can exercise efficient, flexible control in a high level abstract model yet with a sufficiently rich amount of information in the source code upon which the compiler can perform static analysis and optimization.
Results: A number of case studies have been implemented in Mesham. Official benchmarks have been performed which demonstrate the paradigm allows one to write code which is comparable, in terms of performance, with existing high performance solutions. Sections of the parallel simulation package, Gadget-2, have been ported into Mesham, where substantial code simplifications have been made.
Conclusions: The results obtained indicate that the type-based approach does satisfy the aim of the research described in this thesis. By using this new paradigm the
programmer has been able to write parallel code which is both simple and efficient
Compiling for parallel multithreaded computation on symmetric multiprocessors
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 1998.Includes bibliographical references (p. 145-149).by Andrew Shaw.Ph.D
Recommended from our members
Data-Driven Programming Abstractions and Optimization for Multi-Core Platforms
Multi-core platforms have spread to all corners of the computing industry, and trends in design and power indicate that the shift to multi-core will become even wider-spread in the future. As the number of cores on a chip rises, the complexity of memory systems and on-chip interconnects increases drastically. The programmer inherits this complexity in the form of new responsibilities for task decomposition, synchronization, and data movement within an application, which hitherto have been concealed by complex processing pipelines or deemed unimportant since tasks were largely executed sequentially. To some extent, the need for explicit parallel programming is inevitable, due to limits in the instruction-level parallelism that can be automatically extracted from a program. However, these challenges create a great opportunity for the development of new programming abstractions which hide the low-level architectural complexity while exposing intuitive high-level mechanisms for expressing parallelism. Many models of parallel programming fall into the category of data-centric models, where the structure of an application depends on the role of data and communication in the relationships between tasks. The utilization of the inter-core communication networks and effective scaling to large data sets are decidedly important in designing efficient implementations of parallel applications. The questions of how many low-level architectural details should be exposed to the programmer, and how much parallelism in an application a programmer should expose to the compiler remain open-ended, with different answers depending on the architecture and the application in question. I propose that the key to unlocking the capabilities of multi-core platforms is the development of abstractions and optimizations which match the patterns of data movement in applications with the inter-core communication capabilities of the platforms. After a comparative analysis that confirms and stresses the importance of finding a good match between the programming abstraction, the application, and the architecture, this dissertation proposes two techniques that showcase the power of leveraging data dependency patterns in parallel performance optimizations. Flexible Filters dynamically balance load in stream programs by creating flexibility in the runtime data flow through the addition of redundant stream filters. This technique combines a static mapping with dynamic flow control to achieve light-weight, distributed and scalable throughput optimization. The properties of stream communication, i.e., FIFO pipes, enable flexible filters by exposing the backpressure dependencies between tasks. Next, I present Huckleberry, a novel recursive programming abstraction developed in order to allow programmers to expose data locality in divide-and-conquer algorithms at a high level of abstraction. Huckleberry automatically converts sequential recursive functions with explicit data partitioning into parallel implementations that can be ported across changes in the underlying architecture including the number of cores and the amount of on-chip memory. I then present a performance model for multi-core applications which provides an efficient means to evaluate the trade-offs between the computational and communication requirements of applications together with the hardware resources of a target multi-core architecture. The model encompasses all data-driven abstractions that can be reduced to a task graph representation and is extensible to performance techniques such as Flexible Filters that alter an application's original task graph. Flexible Filters and Huckleberry address the challenges of parallel programming on multi-core architectures by taking advantage of properties specific to the stream and recursive paradigms, and the performance model creates a unifying framework based on the communication between tasks in parallel applications. Combined, these contributions demonstrate that specialization with respect to communication patterns enhances the ability of parallel programming abstractions and optimizations to harvest the power of multi-core platforms
X10 for high-performance scientific computing
High performance computing is a key technology that enables large-scale physical
simulation in modern science. While great advances have been made in methods and
algorithms for scientific computing, the most commonly used programming models
encourage a fragmented view of computation that maps poorly to the underlying
computer architecture.
Scientific applications typically manifest physical locality, which means that interactions
between entities or events that are nearby in space or time are stronger
than more distant interactions. Linear-scaling methods exploit physical locality by approximating
distant interactions, to reduce computational complexity so that cost is
proportional to system size. In these methods, the computation required for each
portion of the system is different depending on that portion’s contribution to the
overall result. To support productive development, application programmers need
programming models that cleanly map aspects of the physical system being simulated
to the underlying computer architecture while also supporting the irregular
workloads that arise from the fragmentation of a physical system.
X10 is a new programming language for high-performance computing that uses
the asynchronous partitioned global address space (APGAS) model, which combines
explicit representation of locality with asynchronous task parallelism. This thesis
argues that the X10 language is well suited to expressing the algorithmic properties
of locality and irregular parallelism that are common to many methods for physical
simulation.
The work reported in this thesis was part of a co-design effort involving researchers
at IBM and ANU in which two significant computational chemistry codes
were developed in X10, with an aim to improve the expressiveness and performance
of the language. The first is a Hartree–Fock electronic structure code, implemented
using the novel Resolution of the Coulomb Operator approach. The second evaluates
electrostatic interactions between point charges, using either the smooth particle
mesh Ewald method or the fast multipole method, with the latter used to simulate
ion interactions in a Fourier Transform Ion Cyclotron Resonance mass spectrometer.
We compare the performance of both X10 applications to state-of-the-art software
packages written in other languages.
This thesis presents improvements to the X10 language and runtime libraries for
managing and visualizing the data locality of parallel tasks, communication using
active messages, and efficient implementation of distributed arrays. We evaluate these improvements in the context of computational chemistry application examples.
This work demonstrates that X10 can achieve performance comparable to established
programming languages when running on a single core. More importantly,
X10 programs can achieve high parallel efficiency on a multithreaded architecture,
given a divide-and-conquer pattern parallel tasks and appropriate use of worker-local
data. For distributed memory architectures, X10 supports the use of active messages
to construct local, asynchronous communication patterns which outperform global,
synchronous patterns. Although point-to-point active messages may be implemented
efficiently, productive application development also requires collective communications;
more work is required to integrate both forms of communication in the X10
language. The exploitation of locality is the key insight in both linear-scaling methods and
the APGAS programming model; their combination represents an attractive opportunity
for future co-design efforts
スケジューリング遅延に基づいたタスク並列ランタイムシステムの性能差の解析
学位の種別: 課程博士審査委員会委員 : (主査)東京大学准教授 豊田 正史, 東京大学教授 田浦 健次朗, 東京大学准教授 入江 英嗣, 東京大学教授 中島 研吾, 理化学研究所チームリーダ 佐藤 三久, 東京工業大学准教授 横田 理央University of Tokyo(東京大学
PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE
Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chip’s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks.
This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers
A dependency-aware parallel programming model
Designing parallel codes is hard. One of the most important roadblocks to parallel programming is the presence of data dependencies. These restrict parallelism and, in general, to work them around requires complex analysis and leads to convoluted solutions that decrease the quality of the code.
This thesis proposes a solution to parallel programming that incorporates data dependencies into the model. The programming model can handle that information and to dynamically find parallelism that otherwise would be hard to find. This approach improves both programmability and parallelism, and thus performance.
While this problem has already been solved in OpenMP 4 at the time of this publication, this research begun before the problem was even being considered for OpenMP 3. In fact, some of the contributions of this thesis have had an influence on the approach taken in OpenMP 4. However, the contributions go beyond that and cover aspects that have not been considered yet in OpenMP 4.
The approach we propose is based on function-level dependencies across disjoint blocks of contiguous memory. While finding dependencies under those constraints is simple, it is much harder to do so over strided and possibly partially overlapping sets of data. This thesis also proposes a solution to this problem. By doing so, we increase the range of applicability of the original solution and increase the span of applicability of the programming model. OpenMP4 does not currently cover this aspect.
Finally, we present a solution to take advantage of the performance characteristics of Non-Uniform Memory Access architectures. Our proposal is at the programming model level and does not require changes in the code. It automatically distributes the data and does not rely on data migration nor replication. Instead, it is based exclusively on scheduling the computations. While this process is automatic, it can be tuned through minor changes in the code that do not require any change in the programming model.
Throughout the thesis, we demonstrate the effectiveness of the proposal through benchmarks that are either hard to program using other paradigms or that have different solutions. In most cases, our solutions perform either on par or better than already existing solutions. This includes the implementations available in well-known high-performance parallel libraries.Dissenyar codis paral·lels es complex. Un dels principals esculls a l'hora de programar aplicacions paral·leles és la presència de dependències. Aquestes constrenyen el paral·lelisme, i en general, per evitar-les es requereix realitzar anàlisis complicades que donen lloc a solucions complexes que redueixen la qualitat del codi. Aquesta tesi proposa una solució a la programació paral·lela que incorpora al model les dependències de dades. El model de programació és capaç d'utilitzar aquesta informació per a trobar paral·lelisme que altrament seria molt difícil de detectar i d'extreure. Aquest enfoc augmenta la programabilitat i el paral·lelisme, i per tant també el rendiment. Tot i que al moment de la publicació d'aquesta tesi, el problema ja ha estat resolt a OpenMP 4, la recerca d'aquesta tesi va començar abans de que el problema s'hagués plantejat en l'àmbit d'OpenMP 3. De fet, algunes de les contribucions de la tesi han influït en la solució emprada a OpenMP 4. Tanmateix, les contribucions van més enllà i cobreixen aspectes que encara no han estat considerats a OpenMP 4. La proposta es basa en dependències a nivell de funció entre blocs de memòria continus i sense intersecció. Tot i que trobar dependències sota aquestes condicions és senzill, fer-ho sobre dades no contínues amb possibles interseccions parcials és molt més complex. Aquesta tesi també proposa una solució a aquest problema. Fent això, es millora el rang d'aplicació de la solució original i per tant el del model de programació. Aquest és un dels aspectes que encara no es contemplen a OpenMP 4. Finalment, es presenta una solució que té en compte les característiques de rendiment de les arquitectures NUMA (Accés No Uniforme a la Memòria). La proposta es planteja a nivell del model de programació i no precisa de canvis al codi ja que les dades es distribueixen automàticament. En lloc de basar-se en la migració i la replicació de les dades, es basa exclusivament en la planificació de l'execució de les computacions. Tot i que aquest procés és automàtic, es pot afinar mitjançant petits canvis en el codi que no arriben a alterar el model de programació. Al llarg d'aquesta tesi es demostra la efectivitat de les propostes a través de bancs de proves que son difícils de programar amb altres paradigmes o que tenen solucions diferents. A la majoria dels casos les nostres solucions tenen un rendiment similar o millor que les solucions preexistents, que inclouen implementacions en ben reconegues biblioteques paral·leles d'alt rendiment