4 research outputs found

    Generating renderers

    Get PDF
    Most production renderers developed for the film industry are huge pieces of software that are able to render extremely complex scenes. Unfortunately, they are implemented using the currently available programming models that are not well suited to modern computing hardware like CPUs with vector units or GPUs. Thus, they have to deal with the added complexity of expressing parallelism and using hardware features in those models. Since compilers cannot alone optimize and generate efficient programs for any type of hardware, because of the large optimization spaces and the complexity of the underlying compiler problems, programmers have to rely on compiler-specific hardware intrinsics or write non-portable code. The consequence of these limitations is that programmers resort to writing the same code twice when they need to port their algorithm on a different architecture, and that the code itself becomes difficult to maintain, as algorithmic details are buried under hardware details. Thankfully, there are solutions to this problem, taking the form of Domain-Specific Lan- guages. As their name suggests, these languages are tailored for one domain, and compilers can therefore use domain-specific knowledge to optimize algorithms and choose the best execution policy for a given target hardware. In this thesis, we opt for another way of encoding domain- specific knowledge: We implement a generic, high-level, and declarative rendering and traversal library in a functional language, and later refine it for a target machine by providing partial evaluation annotations. The partial evaluator then specializes the entire renderer according to the available knowledge of the scene: Shaders are specialized when their inputs are known, and in general, all redundant computations are eliminated. Our results show that the generated renderers are faster and more portable than renderers written with state-of-the-art competing libraries, and that in comparison, our rendering library requires less implementation effort.Die meisten in der Filmindustrie zum Einsatz kommenden Renderer sind riesige Softwaresysteme, die in der Lage sind, extrem aufwendige Szenen zu rendern. Leider sind diese mit den aktuell verfügbaren Programmiermodellen implementiert, welche nicht gut geeignet sind für moderne Rechenhardware wie CPUs mit Vektoreinheiten oder GPUs. Deshalb müssen Entwickler sich mit der zusätzlichen Komplexität auseinandersetzen, Parallelismus und Hardwarefunktionen in diesen Programmiermodellen auszudrücken. Da Compiler nicht selbständig optimieren und effiziente Programme für jeglichen Typ Hardware generieren können, wegen des großen Optimierungsraumes und der Komplexität des unterliegenden Kompilierungsproblems, müssen Programmierer auf Compiler-spezifische Hardware-“Intrinsics” zurückgreifen, oder nicht portierbaren Code schreiben. Die Konsequenzen dieser Limitierungen sind, dass Programmierer darauf zurückgreifen den gleichen Code zweimal zu schreiben, wenn sie ihre Algorithmen für eine andere Architektur portieren müssen, und dass der Code selbst schwer zu warten wird, da algorithmische Details unter Hardwaredetails verloren gehen. Glücklicherweise gibt es Lösungen für dieses Problem, in der Form von DSLs. Diese Sprachen sind maßgeschneidert für eine Domäne und Compiler können deshalb Domänenspezifisches Wissen nutzen, um Algorithmen zu optimieren und die beste Ausführungsstrategie für eine gegebene Zielhardware zu wählen. In dieser Dissertation wählen wir einen anderen Weg, Domänenspezifisches Wissen zu enkodieren: Wir implementieren eine generische, high-level und deklarative Rendering- und Traversierungsbibliothek in einer funktionalen Programmiersprache, und verfeinern sie später für eine Zielmaschine durch Bereitstellung von Annotationen für die partielle Auswertung. Der “Partial Evaluator” spezialisiert dann den kompletten Renderer, basierend auf dem verfügbaren Wissen über die Szene: Shader werden spezialisiert, wenn ihre Eingaben bekannt sind, und generell werden alle redundanten Berechnungen eliminiert. Unsere Ergebnisse zeigen, dass die generierten Renderer schneller und portierbarer sind, als Renderer geschrieben mit den aktuellen Techniken konkurrierender Bibliotheken und dass, im Vergleich, unsere Rendering Bibliothek weniger Implementierungsaufwand erfordert.This work was supported by the Federal Ministry of Education and Research (BMBF) as part of the Metacca and ProThOS projects as well as by the Intel Visual Computing Institute (IVCI) and Cluster of Excellence on Multimodal Computing and Interaction (MMCI) at Saarland University. Parts of it were also co-funded by the European Union(EU), as part of the Dreamspace project

    Parallelization of the training of neural networks in heterogeneous systems

    Get PDF
    RESUMEN: En los últimos años el uso de las redes neuronales ha incrementado drásticamente, debido a su capacidad de adaptación a infinidad de problemas. Sin embargo, para que una red neuronal funcione correctamente es necesario un proceso de entrenamiento previo, donde ésta aprende a identificar patrones en los datos que recibe. Este proceso es muy largo y computacionalmente intensivo, debido a las operaciones que se realizan y a que es necesaria una gran cantidad de datos de ejemplo. Para mitigar esto, han ido surgiendo a lo largo del tiempo varias soluciones que reducen el tiempo, la complejidad y el consumo energético del proceso de entrenamiento. Una de estas soluciones es el uso de aceleradores dedicados como GPUs, en vez de procesadores convencionales. Esto se debe a su velocidad a la hora de realizar ciertas operaciones y a su excelente eficiencia energética respecto a las CPUs tradicionales. Sin embargo, debido a la creciente complejidad de las redes neuronales y de los problemas a los que éstas se aplican, el uso de un solo acelerador para entrenarlas se ha vuelto insuficiente. Es por esto por lo que es necesario paralelizar el entrenamiento de redes neuronales, distribuyendo la carga de trabajo entre los dispositivos disponibles de manera que se optimice el rendimiento. Existen una gran variedad de técnicas de paralelización de redes neuronales y cada framework de ML proporciona sus propias estrategias de paralelización, con lo que es difícil saber cuál es la más beneficiosa para cada situación y cuáles son sus efectos. En este proyecto se estudia cuál es exactamente el impacto que tienen estas técnicas de paralelización en el proceso de entrenamiento de una red neuronal. Para ello, se han evaluado varios frameworks de redes neuronales y sus estrategias de paralelización, eligiendo el más adecuado para este proyecto. Además, se ha desarrollado un benchmark en uno de estos frameworks, Pytorch. Este benchmark entrena un modelo ResNet-34 usando un dataset de clasificación de imágenes, grabando varias métricas del proceso, como la duración del mismo y de cada una de sus fases, la evolución de la precisión del modelo a lo largo del tiempo o la precisión final obtenida. Para poder realizar un estudio en profundidad, se han diseñado y realizado experimentos en torno a este benchmark, ejecutándolo en varios ambientes de paralelización: usando solo la CPU, usando una GPU, usando varias GPUs de manera paralela, etc; y guardando sus salidas, además del consumo energético del benchmark. También se propone y estudia un modelo de paralelización híbrido, que explota tanto las GPUs como los CPUs disponibles para entrenar un modelo de red neuronal distribuyendo para ello una copia del modelo a cada dispositivo y parte de los datos de entrenamiento, determinando al final si es viable o no. Los resultados obtenidos de los experimentos han sido positivos, se ha obtenido una escalabilidad casi lineal para el tiempo de la parte paralelizada; además, el consumo energético no se ha visto incrementado significativamente como resultado de la paralelización, obteniendo casi el doble de eficiencia energética respecto del entrenamiento no paralelizado.ABSTRACT: In recent times, Artificial Neural Networks have seen a significant increase in their use, due to their flexibility and adaptability to a myriad of tasks. However, in order for a neural network to work correctly, a training period is necessary, where the model learns to identify certain patterns in the input data it receives. This process is long and computationally expensive, due to the complexity of its operations and the sheer amount of training data required. To mitigate this, some techniques have appeared over the years that attempt to reduce training time, complexity and energy consumption. One of the most commonly used approaches is the use of dedicated accelerators, such as GPUs, instead of conventional processors. This is because of their higher speed and energy efficiency regarding some tasks relative to general purpose processors. However, due to the rising complexity of neural networks and the problems the attempt to solve, the use of a single accelerator has become insufficient. This has made the use of several accelerators in parallel necessary, distributing the workload equally between them in order to optimize performance. Several parallelization techniques exist nowadays and almost every ML framework implements its own distribution strategies, so knowing which is best for each situation and its effects has become a very difficult task. In this project, a benchmark is proposed to study the impact of these techniques, recording for that several metrics, such as training time or model precision. For this, an evaluation of several neural network frameworks is conducted, studying their parallelization strategies, picking one of them to use for the rest of the project. From this framework, a benchmark is created that trains a ResNet-34 model on an image classification dataset, measuring certain variables such as end-to-end training time, evolution of the model’s precision over time or final model precision. To gain more insight into these metrics, various experiments have been designed and conducted around this benchmark, using each of them in a different execution environment: using only CPU, using 1 GPU, using multiple GPUs in parallel, etc.; documenting not only the output from the benchmark but its energy consumption as well, in order to evaluate energy efficiency. A hybrid parallelization model is also proposed, in which the available GPUs are used in conjunction with the CPU to train the network, giving each of the components a copy of the model and a subset of the data, and evaluating afterwards its effectiveness and viability. The results obtained from these experiments are very positive, the scalability of the distributed model is almost lineal regarding the parallelized part; on top of that, energy consumption has not seen a significant increase as a result of the parallelization, meaning the energy efficiency of this paradigm is almost double the non-distributed training.Grado en Ingeniería Informátic

    A domain-extensible compiler with controllable automation of optimisations

    Get PDF
    In high performance domains like image processing, physics simulation or machine learning, program performance is critical. Programmers called performance engineers are responsible for the challenging task of optimising programs. Two major challenges prevent modern compilers targeting heterogeneous architectures from reliably automating optimisation. First, domain specific compilers such as Halide for image processing and TVM for machine learning are difficult to extend with the new optimisations required by new algorithms and hardware. Second, automatic optimisation is often unable to achieve the required performance, and performance engineers often fall back to painstaking manual optimisation. This thesis shows the potential of the Shine compiler to achieve domain-extensibility, controllable automation, and generate high performance code. Domain-extensibility facilitates adapting compilers to new algorithms and hardware. Controllable automation enables performance engineers to gradually take control of the optimisation process. The first research contribution is to add 3 code generation features to Shine, namely: synchronisation barrier insertion, kernel execution, and storage folding. Adding these features requires making novel design choices in terms of compiler extensibility and controllability. The rest of this thesis builds on these features to generate code with competitive runtime compared to established domain-specific compilers. The second research contribution is to demonstrate how extensibility and controllability are exploited to optimise a standard image processing pipeline for corner detection. Shine achieves 6 well-known image processing optimisations, 2 of them not being supported by Halide. Our results on 4 ARM multi-core CPUs show that the code generated by Shine for corner detection runs up to 1.4× faster than the Halide code. However, we observe that controlling rewriting is tedious, motivating the need for more automation. The final research contribution is to introduce sketch-guided equality saturation, a semiautomated technique that allows performance engineers to guide program rewriting by specifying rewrite goals as sketches: program patterns that leave details unspecified. We evaluate this approach by applying 7 realistic optimisations of matrix multiplication. Without guidance, the compiler fails to apply the 5 most complex optimisations even given an hour and 60GB of RAM. With the guidance of at most 3 sketch guides, each 10 times smaller than the complete program, the compiler applies the optimisations in seconds using less than 1GB
    corecore