18 research outputs found
Massiv-Parallele Algorithmen zum Laden von Daten auf Moderner Hardware
While systems face an ever-growing amount of data that needs to be ingested, queried and analysed, processors are seeing only moderate improvements in sequential processing performance. This thesis addresses the fundamental shift towards increasingly parallel processors and contributes multiple massively parallel algorithms to accelerate different stages of the ingestion pipeline, such as data parsing and sorting.Systeme sehen sich mit einer stetig anwachsenden Menge an Daten konfrontiert, die geladen und analysiert, sowie Anfragen darauf bearbeitet werden müssen. Gleichzeitig nimmt die sequentielle Verarbeitungsgeschwindigkeit von Prozessoren nur noch moderat zu. Diese Arbeit adressiert den Wandel hin zu zunehmend parallelen Prozessoren und leistet mit mehreren massiv-parallelen Algorithmen einen Beitrag um unterschiedliche Phasen der Datenverarbeitung wie zum Beispiel Parsing und Sortierung zu beschleunigen
Recommended from our members
Compiling Irregular Software to Specialized Hardware
High-level synthesis (HLS) has simplified the design process for energy-efficient hardware accelerators: a designer specifies an accelerator’s behavior in a “high-level” language, and a toolchain synthesizes register-transfer level (RTL) code from this specification. Many HLS systems produce efficient hardware designs for regular algorithms (i.e., those with limited conditionals or regular memory access patterns), but most struggle with irregular algorithms that rely on dynamic, data-dependent memory access patterns (e.g., traversing pointer-based structures like lists, trees, or graphs). HLS tools typically provide imperative, side-effectful languages to the designer, which makes it difficult to correctly specify and optimize complex, memory-bound applications.
In this dissertation, I present an alternative HLS methodology that leverages properties of functional languages to synthesize hardware for irregular algorithms. The main contribution is an optimizing compiler that translates pure functional programs into modular, parallel dataflow networks in hardware. I give an overview of this compiler, explain how its source and target together enable parallelism in the face of irregularity, and present two specific optimizations that further exploit this parallelism. Taken together, this dissertation verifies my thesis that pure functional programs exhibiting irregular memory access patterns can be compiled into specialized hardware and optimized for parallelism.
This work extends the scope of modern HLS toolchains. By relying on properties of pure functional languages, our compiler can synthesize hardware from programs containing constructs that commercial HLS tools prohibit, e.g., recursive functions and dynamic memory allocation. Hardware designers may thus use our compiler in conjunction with existing HLS systems to accelerate a wider class of algorithms than before
Generating renderers
Most production renderers developed for the film industry are huge pieces of software that are able to render extremely complex scenes. Unfortunately, they are implemented using the currently available programming models that are not well suited to modern computing hardware like CPUs with vector units or GPUs. Thus, they have to deal with the added complexity of expressing parallelism and using hardware features in those models. Since compilers cannot alone optimize and generate efficient programs for any type of hardware, because of the large optimization spaces and the complexity of the underlying compiler problems, programmers have to rely on compiler-specific hardware intrinsics or write non-portable code. The consequence of these limitations is that programmers resort to writing the same code twice when they need to port their algorithm on a different architecture, and that the code itself becomes difficult to maintain, as algorithmic details are buried under hardware details. Thankfully, there are solutions to this problem, taking the form of Domain-Specific Lan- guages. As their name suggests, these languages are tailored for one domain, and compilers can therefore use domain-specific knowledge to optimize algorithms and choose the best execution policy for a given target hardware. In this thesis, we opt for another way of encoding domain- specific knowledge: We implement a generic, high-level, and declarative rendering and traversal library in a functional language, and later refine it for a target machine by providing partial evaluation annotations. The partial evaluator then specializes the entire renderer according to the available knowledge of the scene: Shaders are specialized when their inputs are known, and in general, all redundant computations are eliminated. Our results show that the generated renderers are faster and more portable than renderers written with state-of-the-art competing libraries, and that in comparison, our rendering library requires less implementation effort.Die meisten in der Filmindustrie zum Einsatz kommenden Renderer sind riesige Softwaresysteme, die in der Lage sind, extrem aufwendige Szenen zu rendern. Leider sind diese mit den aktuell verfügbaren Programmiermodellen implementiert, welche nicht gut geeignet sind für moderne Rechenhardware wie CPUs mit Vektoreinheiten oder GPUs. Deshalb müssen Entwickler sich mit der zusätzlichen Komplexität auseinandersetzen, Parallelismus und Hardwarefunktionen in diesen Programmiermodellen auszudrücken. Da Compiler nicht selbständig optimieren und effiziente Programme für jeglichen Typ Hardware generieren können, wegen des großen Optimierungsraumes und der Komplexität des unterliegenden Kompilierungsproblems, müssen Programmierer auf Compiler-spezifische Hardware-“Intrinsics” zurückgreifen, oder nicht portierbaren Code schreiben. Die Konsequenzen dieser Limitierungen sind, dass Programmierer darauf zurückgreifen den gleichen Code zweimal zu schreiben, wenn sie ihre Algorithmen für eine andere Architektur portieren müssen, und dass der Code selbst schwer zu warten wird, da algorithmische Details unter Hardwaredetails verloren gehen. Glücklicherweise gibt es Lösungen für dieses Problem, in der Form von DSLs. Diese Sprachen sind maßgeschneidert für eine Domäne und Compiler können deshalb Domänenspezifisches Wissen nutzen, um Algorithmen zu optimieren und die beste Ausführungsstrategie für eine gegebene Zielhardware zu wählen. In dieser Dissertation wählen wir einen anderen Weg, Domänenspezifisches Wissen zu enkodieren: Wir implementieren eine generische, high-level und deklarative Rendering- und Traversierungsbibliothek in einer funktionalen Programmiersprache, und verfeinern sie später für eine Zielmaschine durch Bereitstellung von Annotationen für die partielle Auswertung. Der “Partial Evaluator” spezialisiert dann den kompletten Renderer, basierend auf dem verfügbaren Wissen über die Szene: Shader werden spezialisiert, wenn ihre Eingaben bekannt sind, und generell werden alle redundanten Berechnungen eliminiert. Unsere Ergebnisse zeigen, dass die generierten Renderer schneller und portierbarer sind, als Renderer geschrieben mit den aktuellen Techniken konkurrierender Bibliotheken und dass, im Vergleich, unsere Rendering Bibliothek weniger Implementierungsaufwand erfordert.This work was supported by the Federal Ministry of Education and Research (BMBF) as part of the Metacca and ProThOS projects as well as by the Intel Visual Computing Institute (IVCI) and Cluster of Excellence on Multimodal Computing and Interaction (MMCI) at Saarland University. Parts of it were also co-funded by the European Union(EU), as part of the Dreamspace project
Switching techniques for broadband ISDN
The properties of switching techniques suitable for use in broadband networks have been investigated. Methods for evaluating the performance of such switches have been reviewed. A notation has been introduced to describe a class of binary self-routing networks. Hence a technique has been developed for determining the nature of the equivalence between two networks drawn from this class. The necessary and sufficient condition for two packets not to collide in a binary self-routing network has been obtained. This has been used to prove the non-blocking property of the Batcher-banyan switch. A condition for a three-stage network with channel grouping and link speed-up to be nonblocking has been obtained, of which previous conditions are special cases.
A new three-stage switch architecture has been proposed, based upon a novel cell-level algorithm for path allocation in the intermediate stage of the switch. The algorithm is suited to hardware implementation using parallelism to achieve a very short execution time. An array of processors is required to implement the algorithm The processor has been shown to be of simple design. It must be initialised with a count representing the number of cells requesting a given output module. A fast method has been described for performing the request counting using a non-blocking binary self-routing network. Hardware is also required to forward routing tags from the processors to the appropriate data cells, when they have been allocated a path through the intermediate stage. A method of distributing these routing tags by means of a non-blocking copy network has been presented.
The performance of the new path allocation algorithm has been determined by simulation. The rate of cell loss can increase substantially in a three-stage switch when the output modules are non-uniformly loaded. It has been shown that the appropriate use of channel grouping in the intermediate stage of the switch can reduce the effect of non-uniform loading on performance
Doctor of Philosophy
dissertationStochastic methods, dense free-form mapping, atlas construction, and total variation are examples of advanced image processing techniques which are robust but computationally demanding. These algorithms often require a large amount of computational power as well as massive memory bandwidth. These requirements used to be ful lled only by supercomputers. The development of heterogeneous parallel subsystems and computation-specialized devices such as Graphic Processing Units (GPUs) has brought the requisite power to commodity hardware, opening up opportunities for scientists to experiment and evaluate the in uence of these techniques on their research and practical applications. However, harnessing the processing power from modern hardware is challenging. The di fferences between multicore parallel processing systems and conventional models are signi ficant, often requiring algorithms and data structures to be redesigned signi ficantly for efficiency. It also demands in-depth knowledge about modern hardware architectures to optimize these implementations, sometimes on a per-architecture basis. The goal of this dissertation is to introduce a solution for this problem based on a 3D image processing framework, using high performance APIs at the core level to utilize parallel processing power of the GPUs. The design of the framework facilitates an efficient application development process, which does not require scientists to have extensive knowledge about GPU systems, and encourages them to harness this power to solve their computationally challenging problems. To present the development of this framework, four main problems are described, and the solutions are discussed and evaluated: (1) essential components of a general 3D image processing library: data structures and algorithms, as well as how to implement these building blocks on the GPU architecture for optimal performance; (2) an implementation of unbiased atlas construction algorithms|an illustration of how to solve a highly complex and computationally expensive algorithm using this framework; (3) an extension of the framework to account for geometry descriptors to solve registration challenges with large scale shape changes and high intensity-contrast di fferences; and (4) an out-of-core streaming model, which enables developers to implement multi-image processing techniques on commodity hardware
Toward Optimizing Distributed Programs Directed by Configurations
Networks of workstations are now viable environments for running distributed
and parallel applications. Recent advances in software interconnection
technology enables programmers to prepare applications to run in dynamically
changing environments because module interconnection activity is regarded as
an essentially distinct and different intellectual activity so as isolated
from that of implementing individual modules. But there remains the question
of how to optimize the performance of those applications for a given execution
environment: how can developers realize performance gains without paying a high
programming cost to specialize their application for the target environment?
Interconnection technology has allowed programmers to tailor and tune their
applications on distributed environments, but the traditional approach to this
process has ignored the performance issue over gracefully seemless integration
of various software components
Path switching over multirate Benes network.
Mui Sze Wai.Thesis (M.Phil.)--Chinese University of Hong Kong, 2003.Includes bibliographical references (leaves 62-65).Abstracts in English and Chinese.Chapter 1. --- Introduction --- p.1Chapter 1.1 --- Evolution of Multirate Networks --- p.2Chapter 1.2 --- Some Results from Previous Work --- p.2Chapter 1.3 --- Multirate Traffic on Benes Network --- p.5Chapter 1.4 --- Organization --- p.7Chapter 2. --- Background Knowledge on Benes Network and Path Switching --- p.8Chapter 2.1 --- Benes Network --- p.9Chapter 2.1.1 --- Construction of Large Switching Fabrics --- p.9Chapter 2.1.2 --- Routing in Benes Network --- p.11Chapter 2.1.3 --- Performance when Operated as a Large Switch Fabric --- p.13Chapter 2.2 --- Path Switching --- p.14Chapter 2.2.1 --- Basic Concept of Path Switching --- p.14Chapter 2.2.2 --- Capacity Allocation and Route Assignment --- p.15Chapter 3. --- Path Switching over Benes Network --- p.20Chapter 3.1 --- The Model of path-switched Benes Network --- p.21Chapter 3.2 --- Module-to-Module Implementation --- p.21Chapter 3.2.1 --- The First Stage (Input Module) --- p.22Chapter 3.2.2 --- The Middle Stage (Central Module) --- p.23Chapter 3.2.3 --- The Last Stage (Output Module) --- p.24Chapter 3.3 --- Port-to-Port Implementation --- p.24Chapter 3.3.1 --- Uniform Traffic --- p.25Chapter 3.3.2 --- Mult irate Traffic --- p.26Chapter 3.4 --- Closing remarks --- p.29Chapter 4. --- Performance Analysis --- p.31Chapter 4.1 --- Traffic Constraints and Perform- ance Guarantees --- p.32Chapter 4.1.1 --- Arrival Curve and Service Curve --- p.33Chapter 4.1.2 --- Delay Bound and Backlog Bound --- p.36Chapter 4.2 --- Service Guarantees --- p.39Chapter 4.3 --- Deterministic Bounds --- p.42Chapter 4.3.1 --- Delay --- p.42Chapter 4.3.2 --- Backlog at Input Module --- p.44Chapter 4.3.3 --- Backlog at Output Module --- p.47Chapter 5. --- Simulation Results --- p.52Chapter 5.1 --- Uniform Traffic --- p.53Chapter 5.2 --- Multirate Traffic --- p.55Chapter 6. --- Conclusions and Future Research --- p.59Chapter 6.1 --- Suggestions for future research --- p.61Bibliography --- p.6
Recommended from our members
Performance-efficient mechanisms for managing irregularity in throughput processors
textRecent graphics processing units (GPUs) have emerged as a promising platform for general purpose computing and have been shown to be very efficient in executing parallel applications with regular control and memory access behavior. Current GPU architectures primarily adopt the single-instruction multiple-thread (SIMT) programming model that balances programmability and hardware efficiency. With SIMT, the programmer writes application code to be executed by scalar threads and each thread is supported with conditional branch and fine-grained load/store instruction for ease of programming. At the same time, the hardware and software collaboratively enable the grouping of scalar threads to be executed in a vectorized single-instruction multiple-data (SIMD) in-order pipeline, simplifying hardware design. As GPUs gain momentum in being utilized in various application domains, these throughput processors will increasingly demand more efficient execution of irregular applications. Current GPUs, however, suffer from reduced thread-level parallelism, underutilization of compute resources, inefficient on-chip caching, and waste in off-chip memory bandwidth utilization for highly irregular programs with divergent control and memory accesses. In this dissertation, I develop techniques that enable simple, robust, and highly effective performance optimizations for SIMT-based throughput processor architectures such that they can better manage irregularity. I first identify that previously suggested optimizations to the divergent control flow problem suffers from the following limitations: 1) serialized execution of diverging paths, 2) lack of robustness across regular/irregular codes, and 3) limited applicability. Based on such observations, I propose and evaluate three novel mechanisms that resolve the aforementioned issues, providing significant performance improvements while minimizing implementation overhead. In the second half of the dissertation, I observe that conventional coarse-grained memory hierarchy designs do not take into account the massively multi-threaded nature of GPUs, which leads to substantial waste in off-chip memory bandwidth utilization. I design and evaluate a locality-aware memory hierarchy for throughput processors, which retains the advantages of coarse-grained accesses for spatially and temporally local programs while permitting selective fine-grained access to memory. By adaptively adjusting the access granularity, memory bandwidth and energy consumption are reduced for data with low spatial/temporal locality without wasting control overheads or prefetching potential for data with high spatial locality.Electrical and Computer Engineerin
Procesamiento paralelo : Balance de carga dinámico en algoritmo de sorting
Algunas técnicas de sorting intentan balancear la carga mediante un muestreo inicial de los datos a ordenar y una distribución de los mismos de acuerdo a pivots. Otras redistribuyen listas parcialmente ordenadas de modo que cada procesador almacene un número aproximadamente igual de claves, y todos tomen parte del proceso de merge durante la ejecución. Esta Tesis presenta un nuevo método que balancea dinámicamente la carga basado en un enfoque diferente, buscando realizar una distribución del trabajo utilizando un estimador que permita predecir la carga de trabajo pendiente.
El método propuesto es una variante de Sorting by Merging Paralelo, esto es, una técnica basada en comparación. Las ordenaciones en los bloques se realizan mediante el método de Burbuja o Bubble Sort con centinela. En este caso, el trabajo a realizar -en términos de comparaciones e intercambios- se encuentra afectada por el grado de desorden de los datos. Se estudió la evolución de la cantidad de trabajo en cada iteración del algoritmo para diferentes tipos de secuencias de entrada, n datos con valores de a n sin repetición, datos al azar con distribución normal, observándose que el trabajo disminuye en cada iteración. Esto se utilizó para obtener una estimación del trabajo restante esperado a partir de una iteración determinada, y basarse en el mismo para corregir la distribución de la carga.
Con esta idea, el métoEs revisado por: http://sedici.unlp.edu.ar/handle/10915/9500Facultad de Ciencias Exacta