Search CORE

2 research outputs found

Exascale Ready Work-Optimal Matrix Inversion

Author: Steffen Raoul
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/01/2012
Field of study

In dieser Diplomarbeit entwickle ich einen neuen Algorithmus OPT zur Inversion von Matrizen. Ich beweise Eigenschaften zur parallelen Laufzeit und zum Arbeitsaufwand von OPT. OPT ist kombiniert aus Strassens Algorithmus zur Inversion von Matrizen und aus Newton Approximation und basiert auf einer Subroutine zur Matrixmultiplikation. OPT ist ein arbeitsoptimaler Algorithmus, d.h. er benötigt höchstens einen konstanten Faktor mehr Arbeit als jeder andere (arbeitsoptimale) Algorithmus. Außerdem benötigt OPT nur plylogarithmische Zeit auf höchstens O(n3) Prozessoren, wobei die Prozessorzahl von der Multiplikationsroutine bestimmt wird. Damit vereint er diese beiden Vorteile von Strassens Algorithmus und Newton Approximation. Ich beweise eine neue Abschätzung zur numerischen Stabilität von Strassens Algorithmus kombiniert mit Newton Approximation. Im Zuge der Diplomarbeit habe ich OPT, Strassens Inversionsalgorithmus und Newton Approximation zusammen mit einer Matrixcontainerklasse in einem flexiblen Testprogramm implementiert. Ich beschreibe das Design der Implementierung und die Verwendung und Schwierigkeiten von BLAS für die Matrixmultiplikationsroutine. Im experimentellen Teil vergleiche ich die Laufzeit und die numerische Stabilität von OPT mit der Routine aus der Intel Math Kernel Library (MKL). Die konstanten Faktoren des Arbeitsaufwands von OPT erweisen sich als nicht mehr als doppelt so hoch wie die der MKL-Routine. Wie vorhergesagt skaliert OPT sehr gut. Selbst auf einem Computer mit nur acht Kernen ist er bereits deutlich schneller als die MKL-Routine. Bezüglich der numerischen Stabilität werden OPT und Strassens Algorithmus dessen schlechtem Ruf nicht gerecht. Stattdessen erzeugen sie Ergebnisse vergleichbar mit denen der MKL-Routine. Ich entdecke eine unerwartete Instabilität von Newton Approximation wodurch sie schlechtere Ergebnisse erzeugt als alle anderen Algorithmen in der Implementierung. Zu dieser Instabilität präsentiere ich einige weitere Experimente

KITopen

Recommended from our members

Modern FPGA placement techniques with hardware acceleration

Author: Dhar Shounak
Publication venue
Publication date: 24/04/2021
Field of study

In deep sub-micron technology nodes, Application-Specific Integrated Circuits (ASICs) are becoming expensive to design and manufacture. For this reason, Field Programmable Gate Arrays (FPGAs), which are general purpose and flexible programmable hardware, are gaining more design wins in low volume and fast evolving applications. Modern FPGAs are becoming popular in high performance data analytics, search engines, autonomous cars, communication and networking applications. FPGAs are also accompanied with a complete Computer-Aided Design (CAD) toolchain, that is used to optimally map and fit the design applications or workloads onto the underlying target FPGA device. These design applications mapped onto the FPGA demand high maximum achievable clock frequency (Fmax) and low power consumption while maintaining a low compilation time, which is a major hindrance in widespread adoption of FPGAs. The focus of this Ph.D. dissertation is the placement problem for FPGAs, which takes a major portion of the FPGA CAD tool runtime. A new algorithm for spreading cells during FPGA global placement is proposed, which achieves better wirelength and routing congestion and takes less runtime than the algorithm used in the state-of-the-art academic FPGA placer. We also propose FPGA acceleration of various subsystems of an analytic global placement algorithm, including wirelength gradient computation and spreading, which achieves significant speedup over the multi-threaded CPU version. A new detailed placement algorithm is proposed, which offers better tradeoff between quality and runtime compared to existing methods. This algorithm is also accelerated on a GPU and an FPGA, achieving significant speedup over multi-threaded CPU implementation. Another detailed placement algorithm is also proposed which physically re-aligns timing critical paths and improves Fmax with minimal runtime overhead. Both of these algorithms for detailed placement have shown good results on industrial benchmarks and have been integrated into an industrial FPGA CAD tool flowElectrical and Computer Engineerin

Texas ScholarWorks