197 research outputs found
Compiler analysis for trace-level speculative multithreaded architectures
Trace-level speculative multithreaded processors exploit trace-level speculation by means of two threads working cooperatively. One thread, called the speculative thread, executes instructions ahead of the other by speculating on the result of several traces. The other thread executes speculated traces and verifies the speculation made by the first thread. In this paper, we propose a static program analysis for identifying candidate traces to be speculated. This approach identifies large regions of code whose live-output values may be successfully predicted. We present several heuristics to determine the best opportunities for dynamic speculation, based on compiler analysis and program profiling information. Simulation results show that the proposed trace recognition techniques achieve on average a speed-up close to 38% for a collection of SPEC2000 benchmarks.Peer ReviewedPostprint (published version
PERFORMANCE OPTIMIZATION OF A STRUCTURED CFD CODE - GHOST ON COMMODITY CLUSTER ARCHITECTURES
This thesis focuses on optimizing the performance of an in-house, structured, 2D CFD code – GHOST, on commodity cluster architectures. The basic philosophy of the work is to optimize the cache usage of the code by implementing efficient coding techniques without changing the underlying numerical algorithm. Various optimization techniques that were implemented and the resulting changes in performance have been presented. Two techniques, external and internal blocking that were implemented earlier to tune the performance of this code have been reviewed. What follows is further tuning effort in order to circumvent the problems associated with using the blocking techniques. Later, to establish the universality of the optimization techniques, testing has been done on more complicated test case. All the techniques presented in this thesis have been tested on steady, laminar test cases. It has been proved that optimized versions of the code achieve better performances on variety of commodity cluster architectures chosen in this study
Resource-aware Programming in a High-level Language - Improved performance with manageable effort on clustered MPSoCs
Bis 2001 bedeutete Moores und Dennards Gesetz eine Verdoppelung der AusfĂĽhrungszeit alle 18 Monate durch verbesserte CPUs.
Heute ist Nebenläufigkeit das dominante Mittel zur Beschleunigung von Supercomputern bis zu mobilen Geräten.
Allerdings behindern neuere Phänomene wie "Dark Silicon" zunehmend eine weitere Beschleunigung durch Hardware.
Um weitere Beschleunigung zu erreichen muss sich auch die SoftÂware mehr ihrer Hardware Resourcen gewahr werden.
Verbunden mit diesem Phänomen ist eine immer heterogenere Hardware.
Supercomputer integrieren Beschleuniger wie GPUs.
Mobile SoCs (bspw. Smartphones) integrieren immer mehr Fähigkeiten.
Spezialhardware auszunutzen ist eine bekannte Methode, um den Energieverbrauch zu senken, was ein weiterer wichtiger Aspekt ist, welcher mit der reinen Geschwindigkeit abgewogen werde muss.
Zum Beispiel werden Supercomputer auch nach "Performance pro Watt" bewertet.
Zur Zeit sind systemnahe low-level Programmierer es gewohnt über Hardware nachzudenken, während der gemeine high-level Programmierer es vorzieht von der Plattform möglichst zu abstrahieren (bspw. Cloud).
"High-level" bedeutet nicht, dass Hardware irrelevant ist, sondern dass sie abstrahiert werden kann.
Falls Sie eine Java-Anwendung fĂĽr Android entwickeln, kann der Akku ein wichtiger Aspekt sein.
Irgendwann mĂĽssen aber auch Hochsprachen resourcengewahr werden, um Geschwindigkeit oder Energieverbrauch zu verbessern.
Innerhalb des Transregio "Invasive Computing" habe ich an diesen Problemen gearbeitet.
In meiner Dissertation stelle ich ein Framework vor, mit dem man Hochsprachenanwendungen resourcengewahr machen kann, um so die Leistung zu verbessern.
Das könnte beispielsweise erhöhte Effizienz oder schnellerer Ausführung für das System als Ganzes bringen.
Ein Kerngedanke dabei ist, dass Anwendungen sich nicht selbst optimieren.
Stattdessen geben sie alle Informationen an das Betriebssystem.
Das Betriebssystem hat eine globale Sicht und trifft Entscheidungen ĂĽber die Resourcen.
Diesen Prozess nennen wir "Invasion".
Die Aufgabe der Anwendung ist es, sich an diese Entscheidungen anzupassen, aber nicht selbst welche zu fällen.
Die Herausforderung besteht darin eine Sprache zu definieren, mit der Anwendungen Resourcenbedingungen und Leistungsinformationen kommunizieren.
So eine Sprache muss ausdrucksstark genug fĂĽr komplexe Informationen, erweiterbar fĂĽr neue Resourcentypen, und angenehm fĂĽr den Programmierer sein.
Die zentralen Beiträge dieser Dissertation sind:
Ein theoretisches Modell der Resourcen-Verwaltung, um die Essenz des resourcengewahren Frameworks zu beschreiben,
die Korrektheit der Entscheidungen des Betriebssystems bezĂĽglich der Bedingungen einer Anwendung zu begrĂĽnden
und zum Beweis meiner Thesen von Effizienz und Beschleunigung in der Theorie.
Ein Framework und eine Ăśbersetzungspfad resourcengewahrer Programmierung fĂĽr die Hochsprache X10.
Zur Bewertung des Ansatzes haben wir Anwendungen aus dem High Performance Computing implementiert.
Eine Beschleunigung von 5x konnte gemessen werden.
Ein Speicherkonsistenzmodell fĂĽr die X10 Programmiersprache, da dies ein notwendiger Schritt zu einer formalen Semantik ist, die das theoretische Modell und die konkrete Implementierung verknĂĽpft.
Zusammengefasst zeige ich, dass resourcengewahre Programmierung in Hoch\-sprachen auf zukĂĽnftigen Architekturen mit vielen Kernen mit vertretbarem Aufwand machbar ist und die Leistung verbessert
Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation
Despite the importance of sparse matrices in numerous fields of science,
software implementations remain difficult to use for non-expert users,
generally requiring the understanding of underlying details of the chosen
sparse matrix storage format. In addition, to achieve good performance, several
formats may need to be used in one program, requiring explicit selection and
conversion between the formats. This can be both tedious and error-prone,
especially for non-expert users. Motivated by these issues, we present a
user-friendly and open-source sparse matrix class for the C++ language, with a
high-level application programming interface deliberately similar to the widely
used MATLAB language. This facilitates prototyping directly in C++ and aids the
conversion of research code into production environments. The class internally
uses two main approaches to achieve efficient execution: (i) a hybrid storage
framework, which automatically and seamlessly switches between three underlying
storage formats (compressed sparse column, Red-Black tree, coordinate list)
depending on which format is best suited and/or available for specific
operations, and (ii) a template-based meta-programming framework to
automatically detect and optimise execution of common expression patterns.
Empirical evaluations on large sparse matrices with various densities of
non-zero elements demonstrate the advantages of the hybrid storage framework
and the expression optimisation mechanism.Comment: extended and revised version of an earlier conference paper
arXiv:1805.0338
Practical Sparse Matrices in C++ with Hybrid Storage and Template-Based Expression Optimisation
Despite the importance of sparse matrices in numerous fields of science,
software implementations remain difficult to use for non-expert users,
generally requiring the understanding of underlying details of the chosen
sparse matrix storage format. In addition, to achieve good performance, several
formats may need to be used in one program, requiring explicit selection and
conversion between the formats. This can be both tedious and error-prone,
especially for non-expert users. Motivated by these issues, we present a
user-friendly and open-source sparse matrix class for the C++ language, with a
high-level application programming interface deliberately similar to the widely
used MATLAB language. This facilitates prototyping directly in C++ and aids the
conversion of research code into production environments. The class internally
uses two main approaches to achieve efficient execution: (i) a hybrid storage
framework, which automatically and seamlessly switches between three underlying
storage formats (compressed sparse column, Red-Black tree, coordinate list)
depending on which format is best suited and/or available for specific
operations, and (ii) a template-based meta-programming framework to
automatically detect and optimise execution of common expression patterns.
Empirical evaluations on large sparse matrices with various densities of
non-zero elements demonstrate the advantages of the hybrid storage framework
and the expression optimisation mechanism.Comment: extended and revised version of an earlier conference paper
arXiv:1805.0338
Exploring novel designs of NLP solvers: Architecture and Implementation of WORHP
Mathematical Optimization in general and Nonlinear Programming in particular, are applied by many scientific disciplines, such as the automotive sector, the aerospace industry, or the space agencies. With some established NLP solvers having been available for decades, and with the mathematical community being rather conservative in this respect, many of their programming standards are severely outdated. It is safe to assume that such usability shortcomings impede the wider use of NLP methods; a representative example is the use of static workspaces by legacy FORTRAN codes. This dissertation gives an account of the construction of the European NLP solver WORHP by using and combining software standards and techniques that have not previously been applied to mathematical software to this extent. Examples include automatic code generation, a consistent reverse communication architecture and the elimination of static workspaces. The result is a novel, industrial-grade NLP solver that overcomes many technical weaknesses of established NLP solvers and other mathematical software
Scaling Robot Motion Planning to Multi-core Processors and the Cloud
Imagine a world in which robots safely interoperate with humans, gracefully and efficiently accomplishing everyday tasks. The robot's motions for these tasks, constrained by the design of the robot and task at hand, must avoid collisions with obstacles. Unfortunately, planning a constrained obstacle-free motion for a robot is computationally complex---often resulting in slow computation of inefficient motions. The methods in this dissertation speed up this motion plan computation with new algorithms and data structures that leverage readily available parallel processing, whether that processing power is on the robot or in the cloud, enabling robots to operate safer, more gracefully, and with improved efficiency. The contributions of this dissertation that enable faster motion planning are novel parallel lock-free algorithms, fast and concurrent nearest neighbor searching data structures, cache-aware operation, and split robot-cloud computation. Parallel lock-free algorithms avoid contention over shared data structures, resulting in empirical speedup proportional to the number of CPU cores working on the problem. Fast nearest neighbor data structures speed up searching in SO(3) and SE(3) metric spaces, which are needed for rigid body motion planning. Concurrent nearest neighbor data structures improve searching performance on metric spaces common to robot motion planning problems, while providing asymptotic wait-free concurrent operation. Cache-aware operation avoids long memory access times, allowing the algorithm to exhibit superlinear speedup. Split robot-cloud computation enables robots with low-power CPUs to react to changing environments by having the robot compute reactive paths in real-time from a set of motion plan options generated in a computationally intensive cloud-based algorithm. We demonstrate the scalability and effectiveness of our contributions in solving motion planning problems both in simulation and on physical robots of varying design and complexity. Problems include finding a solution to a complex motion planning problem, pre-computing motion plans that converge towards the optimal, and reactive interaction with dynamic environments. Robots include 2D holonomic robots, 3D rigid-body robots, a self-driving 1/10 scale car, articulated robot arms with and without mobile bases, and a small humanoid robot.Doctor of Philosoph
Proceedings of the Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016) Sofia, Bulgaria
Proceedings of: Third International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2016). Sofia (Bulgaria), October, 6-7, 2016
Recommended from our members
Scalability of preconditioners as a strategy for parallel computation of compressible fluid flow
Parallel implementations of a Newton-Krylov-Schwarz algorithm are used to solve a model problem representing low Mach number compressible fluid flow over a backward-facing step. The Mach number is specifically selected to result in a numerically {open_quote}stiff{close_quotes} matrix problem, based on an implicit finite volume discretization of the compressible 2D Navier-Stokes/energy equations using primitive variables. Newton`s method is used to linearize the discrete system, and a preconditioned Krylov projection technique is used to solve the resulting linear system. Domain decomposition enables the development of a global preconditioner via the parallel construction of contributions derived from subdomains. Formation of the global preconditioner is based upon additive and multiplicative Schwarz algorithms, with and without subdomain overlap. The degree of parallelism of this technique is further enhanced with the use of a matrix-free approximation for the Jacobian used in the Krylov technique (in this case, GMRES(k)). Of paramount interest to this study is the implementation and optimization of these techniques on parallel shared-memory hardware, namely the Cray C90 and SGI Challenge architectures. These architectures were chosen as representative and commonly available to researchers interested in the solution of problems of this type. The Newton-Krylov-Schwarz solution technique is increasingly being investigated for computational fluid dynamics (CFD) applications due to the advantages of full coupling of all variables and equations, rapid non-linear convergence, and moderate memory requirements. A parallel version of this method that scales effectively on the above architectures would be extremely attractive to practitioners, resulting in efficient, cost-effective, parallel solutions exhibiting the benefits of the solution technique
- …