130 research outputs found
Software Porting of a 3D Reconstruction Algorithm to Razorcam Embedded System on Chip
A method is presented to calculate depth information for a UAV navigation system from Keypoints in two consecutive image frames using a monocular camera sensor as input and the OpenCV library. This method was first implemented in software and run on a general-purpose Intel CPU, then ported to the RazorCam Embedded Smart-Camera System and run on an ARM CPU onboard the Xilinx Zynq-7000. The results of performance and accuracy testing of the software implementation are then shown and analyzed, demonstrating a successful port of the software to the RazorCam embedded system on chip that could potentially be used onboard a UAV with tight constraints of size, weight, and power. The potential impacts will be seen through the continuation of this research in the Smart ES lab at University of Arkansas
Online Modeling and Tuning of Parallel Stream Processing Systems
Writing performant computer programs is hard. Code for high performance applications is profiled, tweaked, and re-factored for months specifically for the hardware for which it is to run. Consumer application code doesn\u27t get the benefit of endless massaging that benefits high performance code, even though heterogeneous processor environments are beginning to resemble those in more performance oriented arenas. This thesis offers a path to performant, parallel code (through stream processing) which is tuned online and automatically adapts to the environment it is given. This approach has the potential to reduce the tuning costs associated with high performance code and brings the benefit of performance tuning to consumer applications where otherwise it would be cost prohibitive. This thesis introduces a stream processing library and multiple techniques to enable its online modeling and tuning. Stream processing (also termed data-flow programming) is a compute paradigm that views an application as a set of logical kernels connected via communications links or streams. Stream processing is increasingly used by computational-x and x-informatics fields (e.g., biology, astrophysics) where the focus is on safe and fast parallelization of specific big-data applications. A major advantage of stream processing is that it enables parallelization without necessitating manual end-user management of non-deterministic behavior often characteristic of more traditional parallel processing methods. Many big-data and high performance applications involve high throughput processing, necessitating usage of many parallel compute kernels on several compute cores. Optimizing the orchestration of kernels has been the focus of much theoretical and empirical modeling work. Purely theoretical parallel programming models can fail when the assumptions implicit within the model are mis-matched with reality (i.e., the model is incorrectly applied). Often it is unclear if the assumptions are actually being met, even when verified under controlled conditions. Full empirical optimization solves this problem by extensively searching the range of likely configurations under native operating conditions. This, however, is expensive in both time and energy. For large, massively parallel systems, even deciding which modeling paradigm to use is often prohibitively expensive and unfortunately transient (with workload and hardware). In an ideal world, a parallel run-time will re-optimize an application continuously to match its environment, with little additional overhead. This work presents methods aimed at doing just that through low overhead instrumentation, modeling, and optimization. Online optimization provides a good trade-off between static optimization and online heuristics. To enable online optimization, modeling decisions must be fast and relatively accurate. Online modeling and optimization of a stream processing system first requires the existence of a stream processing framework that is amenable to the intended type of dynamic manipulation. To fill this void, we developed the RaftLib C++ template library, which enables usage of the stream processing paradigm for C++ applications (it is the run-time which is the basis of almost all the work within this dissertation). An application topology is specified by the user, however almost everything else is optimizable by the run-time. RaftLib takes advantage of the knowledge gained during the design of several prior streaming languages (notably Auto-Pipe). The resultant framework enables online migration of tasks, auto-parallelization, online buffer-reallocation, and other useful dynamic behaviors that were not available in many previous stream processing systems. Several benchmark applications have been designed to assess the performance gains through our approaches and compare performance to other leading stream processing frameworks. Information is essential to any modeling task, to that end a low-overhead instrumentation framework has been developed which is both dynamic and adaptive. Discovering a fast and relatively optimal configuration for a stream processing application often necessitates solving for buffer sizes within a finite capacity queueing network. We show that a generalized gain/loss network flow model can bootstrap the process under certain conditions. Any modeling effort, requires that a model be selected; often a highly manual task, involving many expensive operations. This dissertation demonstrates that machine learning methods (such as a support vector machine) can successfully select models at run-time for a streaming application. The full set of approaches are incorporated into the open source RaftLib framework
Media gateway utilizando um GPU
Mestrado em Engenharia de Computadores e Telemátic
Recommended from our members
Towards an Accurate Description of Strongly Correlated Chemical Systems with Phaseless Auxiliary-Field Quantum Monte Carlo - Methodological Advances and Applications
The exact and phaseless variants of auxiliary-field quantum Monte Carlo (AFQMC) have been shown to be capable of producing accurate ground-state energies for a wide variety of systems including those which exhibit substantial electron correlation effects. The first chapter of this thesis will provide an overview of the relevant electronic structure problem, and the phaseless AFQMC (ph-AFQMC) methodology.
The computational cost of performing these calculations has to date been relatively high, impeding many important applications of these approaches. In Chapter 2 we present a correlated sampling methodology for AFQMC which relies on error cancellation to dramatically accelerate the calculation of energy differences of relevance to chemical transformations. In particular, we show that our correlated sampling-based ph-AFQMC approach is capable of calculating redox properties, deprotonation free energies, and hydrogen abstraction energies in an efficient manner without sacrificing accuracy. We validate the computational protocol by calculating the ionization potentials and electron affinities of the atoms contained in the G2 test set and then proceed to utilize a composite method, which treats fixed-geometry processes with correlated sampling-based AFQMC and relaxation energies via MP2, to compute the ionization potential, deprotonation free energy, and the O-H bond dissociation energy of methanol, all to within chemical accuracy. We show that the efficiency of correlated sampling relative to uncorrelated calculations increases with system and basis set size and that correlated sampling greatly reduces the required number of random walkers to achieve a target statistical error. This translates to reductions in wall-times by factors of 55, 25, and 24 for the ionization potential of the K atom, the deprotonation of methanol, and hydrogen abstraction from the O-H bond of methanol, respectively.
In Chapter 3 we present an implementation of ph-AFQMC utilizing graphical processing units (GPUs). The AFQMC method is recast in terms of matrix operations which are spread across thousands of processing cores and are executed in batches using custom Compute Unified Device Architecture kernels and the hardware-optimized cuBLAS matrix library. Algorithmic advances include a batched Sherman-Morrison-Woodbury algorithm to quickly update matrix determinants and inverses, density-fitting of the two-electron integrals, an energy algorithm involving a high-dimensional precomputed tensor, and the use of single-precision floating point arithmetic. These strategies result in dramatic reductions in wall-times for both single- and multi-determinant trial wavefunctions. For typical calculations we find speed-ups of roughly two orders of magnitude using just a single GPU card. Furthermore, we achieve near-unity parallel efficiency using 8 GPU cards on a single node, and can reach moderate system sizes via a local memory-slicing approach. We illustrate the robustness of our implementation on hydrogen chains of increasing length, and through the calculation of all-electron ionization potentials of the first-row transition metal atoms. We compare long imaginary-time calculations utilizing a population control algorithm with our previously published correlated sampling approach, and show that the latter improves not only the efficiency but also the accuracy of the computed ionization potentials. Taken together, the GPU implementation combined with correlated sampling provides a compelling computational method that will broaden the application of ph-AFQMC to the description of realistic correlated electronic systems.
In Chapter 4 the bond dissociation energies of a set of 44 3d transition metal-containing diatomics are computed with ph-AFQMC utilizing the correlated sampling technique. We investigate molecules with H, N, O, F, Cl, and S ligands, including those in the 3dMLBE20 database first compiled by Truhlar and co-workers with calculated and experimental values that have since been revised by various groups. In order to make a direct comparison of the accuracy of our ph-AFQMC calculations with previously published results from 10 DFT functionals, CCSD(T), and icMR-CCSD(T), we establish an objective selection protocol which utilizes the most recent experimental results except for a few cases with well-specified discrepancies. With the remaining set of 41 molecules, we find that ph-AFQMC gives robust agreement with experiment superior to that of all other methods, with a mean absolute error (MAE) of 1.4(4) kcal/mol and maximum error of 3(3) kcal/mol (parenthesis account for reported experimental uncertainties and the statistical errors of our ph-AFQMC calculations). In comparison, CCSD(T) and B97, the best performing DFT functional considered here, have MAEs of 2.8 and 3.7 kcal/mol, respectively, and maximum errors in excess of 17 kcal/mol (for the CoS diatomic). While a larger and more diverse data set would be required to demonstrate that ph-AFQMC is truly a benchmark method for transition metal systems, our results indicate that the method has tremendous potential, exhibiting unprecedented consistency and accuracy compared to other approximate quantum chemical approaches.
The energy gap between the lowest-lying singlet and triplet states is an important quantity in chemical photocatalysis, with relevant applications ranging from triplet fusion in optical upconversion to the design of organic light-emitting devices. The ab initio prediction of singlet-triplet (ST) gaps is challenging due to the potentially biradical nature of the involved states, combined with the potentially large size of relevant molecules. In Chapter 5, we show that ph-AFQMC can accurately predict ST gaps for chemical systems with singlet states of highly biradical nature, including a set of 13 small molecules and the ortho-, meta-, and para- isomers of benzyne. With respect to gas-phase experiments, ph-AFQMC using CASSCF trial wavefunctions achieves a mean averaged error of ~1 kcal/mol. Furthermore, we find that in the context of a spin-projection technique, ph-AFQMC using unrestricted single-determinant trial wavefunctions, which can be readily obtained for even very large systems, produces equivalently high accuracy. We proceed to show that this scalable methodology is capable of yielding accurate ST gaps for all linear polyacenes for which experimental measurements exist, i.e. naphthalene, anthracene, tetracene, and pentacene. Our results suggest a protocol for selecting either unrestricted Hartree-Fock or Kohn-Sham orbitals for the single-determinant trial wavefunction, based on the extent of spin-contamination. These findings provide a reliable computational tool with which to investigate specific photochemical processes involving large molecules that may have substantial biradical character. We compute the ST gaps for a set of anthracene derivatives which are potential triplet-triplet annihilators for optical upconversion, and compare our ph-AFQMC predictions with those from DFT and CCSD(T) methods.
We conclude with a discussion of ongoing projects, further methodological improvements on the horizon, and future applications of ph-AFQMC to chemical systems of interest in the fields of biology, drug-discovery, catalysis, and condensed matter physics
Optimización del rendimiento y la eficiencia energética en sistemas masivamente paralelos
RESUMEN Los sistemas heterogéneos son cada vez más relevantes, debido a sus capacidades de rendimiento y eficiencia energética, estando presentes en todo tipo de plataformas de cómputo, desde dispositivos embebidos y servidores, hasta nodos HPC de grandes centros de datos. Su complejidad hace que sean habitualmente usados bajo el paradigma de tareas y el modelo de programación host-device. Esto penaliza fuertemente el aprovechamiento de los aceleradores y el consumo energético del sistema, además de dificultar la adaptación de las aplicaciones.
La co-ejecución permite que todos los dispositivos cooperen para computar el mismo problema, consumiendo menos tiempo y energía. No obstante, los programadores deben encargarse de toda la gestión de los dispositivos, la distribución de la carga y la portabilidad del código entre sistemas, complicando notablemente su programación.
Esta tesis ofrece contribuciones para mejorar el rendimiento y la eficiencia energética en estos sistemas masivamente paralelos. Se realizan propuestas que abordan objetivos generalmente contrapuestos: se mejora la usabilidad y la programabilidad, a la vez que se garantiza una mayor abstracción y extensibilidad del sistema, y al mismo tiempo se aumenta el rendimiento, la escalabilidad y la eficiencia energética. Para ello, se proponen dos motores de ejecución con enfoques completamente distintos.
EngineCL, centrado en OpenCL y con una API de alto nivel, favorece la máxima compatibilidad entre todo tipo de dispositivos y proporciona un sistema modular extensible. Su versatilidad permite adaptarlo a entornos para los que no fue concebido, como aplicaciones con ejecuciones restringidas por tiempo o simuladores HPC de dinámica molecular, como el utilizado en un centro de investigación internacional.
Considerando las tendencias industriales y enfatizando la aplicabilidad profesional, CoexecutorRuntime proporciona un sistema flexible centrado en C++/SYCL que dota de soporte a la co-ejecución a la tecnología oneAPI. Este runtime acerca a los programadores al dominio del problema, posibilitando la explotación de estrategias dinámicas adaptativas que mejoran la eficiencia en todo tipo de aplicaciones.ABSTRACT Heterogeneous systems are becoming increasingly relevant, due to their performance and energy efficiency capabilities, being present in all types of computing platforms, from embedded devices and servers to HPC nodes in large data centers. Their complexity implies that they are usually used under the task paradigm and the host-device programming model. This strongly penalizes accelerator utilization and system energy consumption, as well as making it difficult to adapt applications.
Co-execution allows all devices to simultaneously compute the same problem, cooperating to consume less time and energy. However, programmers must handle all device management, workload distribution and code portability between systems, significantly complicating their programming.
This thesis offers contributions to improve performance and energy efficiency in these massively parallel systems. The proposals address the following generally conflicting objectives: usability and programmability are improved, while ensuring enhanced system abstraction and extensibility, and at the same time performance, scalability and energy efficiency are increased. To achieve this, two runtime systems with completely different approaches are proposed.
EngineCL, focused on OpenCL and with a high-level API, provides an extensible modular system and favors maximum compatibility between all types of devices. Its versatility allows it to be adapted to environments for which it was not originally designed, including applications with time-constrained executions or molecular dynamics HPC simulators, such as the one used in an international research center.
Considering industrial trends and emphasizing professional applicability, CoexecutorRuntime provides a flexible C++/SYCL-based system that provides co-execution support for oneAPI technology. This runtime brings programmers closer to the problem domain, enabling the exploitation of dynamic adaptive strategies that improve efficiency in all types of applications.Funding: This PhD has been supported by the Spanish Ministry of Education (FPU16/03299 grant),
the Spanish Science and Technology Commission under contracts TIN2016-76635-C2-2-R
and PID2019-105660RB-C22.
This work has also been partially supported by the Mont-Blanc 3: European Scalable and
Power Efficient HPC Platform based on Low-Power Embedded Technology project (G.A. No.
671697) from the European Union’s Horizon 2020 Research and Innovation Programme
(H2020 Programme). Some activities have also been funded by the Spanish Science and Technology
Commission under contract TIN2016-81840-REDT (CAPAP-H6 network).
The Integration II: Hybrid programming models of Chapter 4 has been partially performed
under the Project HPC-EUROPA3 (INFRAIA-2016-1-730897), with the support of the EC
Research Innovation Action under the H2020 Programme. In particular, the author gratefully
acknowledges the support of the SPMT Department of the High Performance Computing
Center Stuttgart (HLRS)
Runtime methods for energy-efficient, image processing using significance driven learning.
Ph. D. Thesis.Image and Video processing applications are opening up a whole
range of opportunities for processing at the "edge" or IoT applications
as the demand for high accuracy processing high resolution images
increases. However this comes with an increase in the quantity of data
to be processed and stored, thereby causing a significant increase in
the computational challenges. There is a growing interest in developing
hardware systems that provide energy efficient solutions to this
challenge. The challenges in Image Processing are unique because the
increase in resolution, not only increases the data to be processed but
also the amount of information detail scavenged from the data is also
greatly increased. This thesis addresses the concept of extracting the
significant image information to enable processing the data intelligently
within a heterogeneous system.
We propose a unique way of defining image significance, based on
what causes us to react when something "catches our eye", whether it
be static or dynamic, whether it be in our central field of focus or our
peripheral vision. This significance technique proves to be a relatively
economical process in terms of energy and computational effort.
We investigate opportunities for further computational and energy
efficiency that are available by elective use of heterogeneous system
elements.
We utilise significance to adaptively select regions of interest for selective
levels of processing dependent on their relative significance.
We further demonstrate that exploiting the computational slack time
released by this process, we can apply throttling of the processor
speed to effect greater energy savings. This demonstrates a reduction
in computational effort and energy efficiency a process that we term
adaptive approximate computing.
We demonstrate that our approach reduces energy in a range of 50 to
75%, dependent on user quality demand, for a real-time performance
requirement of 10 fps for a WQXGA image, when compared with the
existing approach that is agnostic of significance. We further hypothesise
that by use of heterogeneous elements that savings up to 90%
could be achievable in both performance and energy when compared
with running OpenCV on the CPU alone
- …