Search CORE

639 research outputs found

Algorithmic issues in visual object recognition

Author: Hussein Mohamed Elsayed Ahmed
Publication venue
Publication date: 01/01/2009
Field of study

This thesis is divided into two parts covering two aspects of research in the area of visual object recognition. Part I is about human detection in still images. Human detection is a challenging computer vision task due to the wide variability in human visual appearances and body poses. In this part, we present several enhancements to human detection algorithms. First, we present an extension to the integral images framework to allow for constant time computation of non-uniformly weighted summations over rectangular regions using a bundle of integral images. Such computational element is commonly used in constructing gradient-based feature descriptors, which are the most successful in shape-based human detection. Second, we introduce deformable features as an alternative to the conventional static features used in classifiers based on boosted ensembles. Deformable features can enhance the accuracy of human detection by adapting to pose changes that can be described as translations of body features. Third, we present a comprehensive evaluation framework for cascade-based human detectors. The presented framework facilitates comparison between cascade-based detection algorithms, provides a confidence measure for result, and deploys a practical evaluation scenario. Part II explores the possibilities of enhancing the speed of core algorithms used in visual object recognition using the computing capabilities of Graphics Processing Units (GPUs). First, we present an implementation of Graph Cut on GPUs, which achieves up to 4x speedup against compared to a CPU implementation. The Graph Cut algorithm has many applications related to visual object recognition such as segmentation and 3D point matching. Second, we present an efficient sparse approximation of kernel matrices for GPUs that can significantly speed up kernel based learning algorithms, which are widely used in object detection and recognition. We present an implementation of the Affinity Propagation clustering algorithm based on this representation, which is about 6 times faster than another GPU implementation based on a conventional sparse matrix representation

Digital Repository at the University of Maryland

Multi-GPU maximum entropy image synthesis for radio astronomy

Author: Casassus S.
Cárcamo M.
Moral V.
Rannou F. R.
Román P.
Publication venue: 'Elsevier BV'
Publication date: 06/11/2017
Field of study

The maximum entropy method (MEM) is a well known deconvolution technique in radio-interferometry. This method solves a non-linear optimization problem with an entropy regularization term. Other heuristics such as CLEAN are faster but highly user dependent. Nevertheless, MEM has the following advantages: it is unsupervised, it has a statistical basis, it has a better resolution and better image quality under certain conditions. This work presents a high performance GPU version of non-gridding MEM, which is tested using real and simulated data. We propose a single-GPU and a multi-GPU implementation for single and multi-spectral data, respectively. We also make use of the Peer-to-Peer and Unified Virtual Addressing features of newer GPUs which allows to exploit transparently and efficiently multiple GPUs. Several ALMA data sets are used to demonstrate the effectiveness in imaging and to evaluate GPU performance. The results show that a speedup from 1000 to 5000 times faster than a sequential version can be achieved, depending on data and image size. This allows to reconstruct the HD142527 CO(6-5) short baseline data set in 2.1 minutes, instead of 2.5 days that takes a sequential version on CPU.Comment: 11 pages, 13 figure

arXiv.org e-Print Archive

Large-scale Machine Learning in High-dimensional Datasets

Author: Hansen Toke Jansen
Publication venue: Technical University of Denmark
Publication date: 01/01/2013
Field of study

Online Research Database In Technology

Dense and sparse parallel linear algebra algorithms on graphics processing units

Author: Lamas Daviña Alejandro
Publication venue: 'Universitat Politecnica de Valencia'
Publication date: 13/11/2018
Field of study

Una línea de desarrollo seguida en el campo de la supercomputación es el uso de procesadores de propósito específico para acelerar determinados tipos de cálculo. En esta tesis estudiamos el uso de tarjetas gráficas como aceleradores de la computación y lo aplicamos al ámbito del álgebra lineal. En particular trabajamos con la biblioteca SLEPc para resolver problemas de cálculo de autovalores en matrices de gran dimensión, y para aplicar funciones de matrices en los cálculos de aplicaciones científicas. SLEPc es una biblioteca paralela que se basa en el estándar MPI y está desarrollada con la premisa de ser escalable, esto es, de permitir resolver problemas más grandes al aumentar las unidades de procesado. El problema lineal de autovalores, Ax = lambda x en su forma estándar, lo abordamos con el uso de técnicas iterativas, en concreto con métodos de Krylov, con los que calculamos una pequeña porción del espectro de autovalores. Este tipo de algoritmos se basa en generar un subespacio de tamaño reducido (m) en el que proyectar el problema de gran dimensión (n), siendo m << n. Una vez se ha proyectado el problema, se resuelve este mediante métodos directos, que nos proporcionan aproximaciones a los autovalores del problema inicial que queríamos resolver. Las operaciones que se utilizan en la expansión del subespacio varían en función de si los autovalores deseados están en el exterior o en el interior del espectro. En caso de buscar autovalores en el exterior del espectro, la expansión se hace mediante multiplicaciones matriz-vector. Esta operación la realizamos en la GPU, bien mediante el uso de bibliotecas o mediante la creación de funciones que aprovechan la estructura de la matriz. En caso de autovalores en el interior del espectro, la expansión requiere resolver sistemas de ecuaciones lineales. En esta tesis implementamos varios algoritmos para la resolución de sistemas de ecuaciones lineales para el caso específico de matrices con estructura tridiagonal a bloques, que se ejecutan en GPU. En el cálculo de las funciones de matrices hemos de diferenciar entre la aplicación directa de una función sobre una matriz, f(A), y la aplicación de la acción de una función de matriz sobre un vector, f(A)b. El primer caso implica un cálculo denso que limita el tamaño del problema. El segundo permite trabajar con matrices dispersas grandes, y para resolverlo también hacemos uso de métodos de Krylov. La expansión del subespacio se hace mediante multiplicaciones matriz-vector, y hacemos uso de GPUs de la misma forma que al resolver autovalores. En este caso el problema proyectado comienza siendo de tamaño m, pero se incrementa en m en cada reinicio del método. La resolución del problema proyectado se hace aplicando una función de matriz de forma directa. Nosotros hemos implementado varios algoritmos para calcular las funciones de matrices raíz cuadrada y exponencial, en las que el uso de GPUs permite acelerar el cálculo.One line of development followed in the field of supercomputing is the use of specific purpose processors to speed up certain types of computations. In this thesis we study the use of graphics processing units as computer accelerators and apply it to the field of linear algebra. In particular, we work with the SLEPc library to solve large scale eigenvalue problems, and to apply matrix functions in scientific applications. SLEPc is a parallel library based on the MPI standard and is developed with the premise of being scalable, i.e. to allow solving larger problems by increasing the processing units. We address the linear eigenvalue problem, Ax = lambda x in its standard form, using iterative techniques, in particular with Krylov's methods, with which we calculate a small portion of the eigenvalue spectrum. This type of algorithms is based on generating a subspace of reduced size (m) in which to project the large dimension problem (n), being m << n. Once the problem has been projected, it is solved by direct methods, which provide us with approximations of the eigenvalues of the initial problem we wanted to solve. The operations used in the expansion of the subspace vary depending on whether the desired eigenvalues are from the exterior or from the interior of the spectrum. In the case of searching for exterior eigenvalues, the expansion is done by matrix-vector multiplications. We do this on the GPU, either by using libraries or by creating functions that take advantage of the structure of the matrix. In the case of eigenvalues from the interior of the spectrum, the expansion requires solving linear systems of equations. In this thesis we implemented several algorithms to solve linear systems of equations for the specific case of matrices with a block-tridiagonal structure, that are run on GPU. In the computation of matrix functions we have to distinguish between the direct application of a matrix function, f(A), and the action of a matrix function on a vector, f(A)b. The first case involves a dense computation that limits the size of the problem. The second allows us to work with large sparse matrices, and to solve it we also make use of Krylov's methods. The expansion of subspace is done by matrix-vector multiplication, and we use GPUs in the same way as when solving eigenvalues. In this case the projected problem starts being of size m, but it is increased by m on each restart of the method. The solution of the projected problem is done by directly applying a matrix function. We have implemented several algorithms to compute the square root and the exponential matrix functions, in which the use of GPUs allows us to speed up the computation.Una línia de desenvolupament seguida en el camp de la supercomputació és l'ús de processadors de propòsit específic per a accelerar determinats tipus de càlcul. En aquesta tesi estudiem l'ús de targetes gràfiques com a acceleradors de la computació i ho apliquem a l'àmbit de l'àlgebra lineal. En particular treballem amb la biblioteca SLEPc per a resoldre problemes de càlcul d'autovalors en matrius de gran dimensió, i per a aplicar funcions de matrius en els càlculs d'aplicacions científiques. SLEPc és una biblioteca paral·lela que es basa en l'estàndard MPI i està desenvolupada amb la premissa de ser escalable, açò és, de permetre resoldre problemes més grans en augmentar les unitats de processament. El problema lineal d'autovalors, Ax = lambda x en la seua forma estàndard, ho abordem amb l'ús de tècniques iteratives, en concret amb mètodes de Krylov, amb els quals calculem una xicoteta porció de l'espectre d'autovalors. Aquest tipus d'algorismes es basa a generar un subespai de grandària reduïda (m) en el qual projectar el problema de gran dimensió (n), sent m << n. Una vegada s'ha projectat el problema, es resol aquest mitjançant mètodes directes, que ens proporcionen aproximacions als autovalors del problema inicial que volíem resoldre. Les operacions que s'utilitzen en l'expansió del subespai varien en funció de si els autovalors desitjats estan en l'exterior o a l'interior de l'espectre. En cas de cercar autovalors en l'exterior de l'espectre, l'expansió es fa mitjançant multiplicacions matriu-vector. Aquesta operació la realitzem en la GPU, bé mitjançant l'ús de biblioteques o mitjançant la creació de funcions que aprofiten l'estructura de la matriu. En cas d'autovalors a l'interior de l'espectre, l'expansió requereix resoldre sistemes d'equacions lineals. En aquesta tesi implementem diversos algorismes per a la resolució de sistemes d'equacions lineals per al cas específic de matrius amb estructura tridiagonal a blocs, que s'executen en GPU. En el càlcul de les funcions de matrius hem de diferenciar entre l'aplicació directa d'una funció sobre una matriu, f(A), i l'aplicació de l'acció d'una funció de matriu sobre un vector, f(A)b. El primer cas implica un càlcul dens que limita la grandària del problema. El segon permet treballar amb matrius disperses grans, i per a resoldre-ho també fem ús de mètodes de Krylov. L'expansió del subespai es fa mitjançant multiplicacions matriu-vector, i fem ús de GPUs de la mateixa forma que en resoldre autovalors. En aquest cas el problema projectat comença sent de grandària m, però s'incrementa en m en cada reinici del mètode. La resolució del problema projectat es fa aplicant una funció de matriu de forma directa. Nosaltres hem implementat diversos algorismes per a calcular les funcions de matrius arrel quadrada i exponencial, en les quals l'ús de GPUs permet accelerar el càlcul.Lamas Daviña, A. (2018). Dense and sparse parallel linear algebra algorithms on graphics processing units [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/112425TESI

RiuNet

Large Scale Electronic Structure Studies on the Energetics of Dislocations in Al-Mg Materials System and Its Connection to Mesoscale Models

Author: Das Sambit
Publication venue
Publication date: 01/01/2019
Field of study

Computational modeling of dislocation behavior is vital for designing new lightweight metallic alloys. However, extraordinary challenges are posed by the multiscale physics ranging over a vast span of interacting length-scales from electronic-structure and atomic-scale effects at the dislocation core (

< 10^{-9} {rm m}

) to long-ranged elastic interactions at the continuum scale (

sim 10 upmu

). In particular, quantification of the energetics associated with electronic-structure effects inside the dislocation core and its interaction with the external macroscopic elastic fields have not been explored due to limitations of current electronic-structure methods based on the widely used plane-wave based discretization. This thesis seeks to address the above challenges by developing computational methodologies to conduct large-scale real-space electronic-structure studies of energetics of dislocations in Aluminum and Magnesium, and use these results to develop phenomenological connections to mesoscale models of plasticity like discrete dislocation dynamics (DDD), which study the collective behavior of the dislocations at longer length scales (

sim

1--15

upmu

). First, a local real-space formulation of orbital-free Density Functional Theory is developed based on prior work, and implemented using finite-element discretization. The local real-space formulation coupled with bulk Dirichlet boundary conditions enables a direct computation of the isolated dislocation core energy. Studies on dislocations in Aluminum and Magnesium suggest that the core-size---region with significant contribution of electronic effects to dislocation energetics---is around seven to eleven times the magnitude of the Burgers vector. This is in stark contrast to prior displacement field based core size estimates of one to three times the magnitude of the Burgers vector. Interestingly, our study further indicates that the core-energy of the dislocations in both Aluminum and Magnesium is strongly dependent on external macroscopic strains with a non-zero slope at zero external strain. Next, the computed dislocation core energetics is used to develop a continuum model for an arbitrary aggregate of dislocations in an infinite isotropic elastic continua. This model, which accounts for the core energy dependence on macroscopic deformation provides a phenomenological approach to incorporate the electronic structure effects into mesoscale DDD simulations. Application of this model to derive nodal forces in a discrete dislocation network, leads to additional configurational forces beyond those considered in existing DDD models. Using case studies, we show that even up to distances of

10-15

nm between the dislocations, these additional configurational forces are non-trivial in relation to the elastic Peach-Koehler force. Furthermore, the core force model is incorporated into a DDD implementation, where significant influence of core force on elementary dislocation mechanisms in Aluminum such as critical stress of a Frank-Read source and structure of a dislocation binary junction are demonstrated. To enable the above electronic-structure studies of dislocations in generic material systems, calculations using the more accurate and transferable Kohn-Sham Density Functional Theory (KS-DFT) are required. The final part of this thesis extends previous work on real-space adaptive spectral finite-element discretization of KS-DFT to develop numerical strategies and implementation innovations, which significantly reduce the computational pre-factor, while increasing the arithmetic intensity and lowering the data movement costs on both many-core and heterogeneous architectures. This has enabled systematically convergent and massively parallel (demonstrated up to 192,000 MPI tasks) KS-DFT calculations on material systems up to

sim 100,000

electrons. Using GPUs, an unprecedented sustained performance of 46 PFLOPS (27.8% peak FP64 performance) is demonstrated on a large-scale benchmark dislocation system in Magnesium containing 105,080 electrons.PHDMechanical EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153417/1/dsambit_1.pd

Deep Blue Documents at the University of Michigan

Lattice QCD with Domain Decomposition on Intel Xeon Phi Co-Processors

Author: Dubey Pradeep
Heybrock Simon
Joó Bálint
Kalamkar Dhiraj D.
Smelyanskiy Mikhail
Vaidyanathan Karthikeyan
Wettig Tilo
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 08/12/2014
Field of study

The gap between the cost of moving data and the cost of computing continues to grow, making it ever harder to design iterative solvers on extreme-scale architectures. This problem can be alleviated by alternative algorithms that reduce the amount of data movement. We investigate this in the context of Lattice Quantum Chromodynamics and implement such an alternative solver algorithm, based on domain decomposition, on Intel Xeon Phi co-processor (KNC) clusters. We demonstrate close-to-linear on-chip scaling to all 60 cores of the KNC. With a mix of single- and half-precision the domain-decomposition method sustains 400-500 Gflop/s per chip. Compared to an optimized KNC implementation of a standard solver [1], our full multi-node domain-decomposition solver strong-scales to more nodes and reduces the time-to-solution by a factor of 5.Comment: 12 pages, 7 figures, presented at Supercomputing 2014, November 16-21, 2014, New Orleans, Louisiana, USA, speaker Simon Heybrock; SC '14 Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 69-80, IEEE Press Piscataway, NJ, USA (c)201

arXiv.org e-Print Archive

Crossref

High-Level GPU Programming: Domain-Specific Optimization and Inference

Author: Lejdfors Calle
Publication venue
Publication date: 01/01/2008
Field of study

When writing computer software one is often forced to balance the need for high run-time performance with high programmer productivity. By using a high-level language it is often possible to cut development times, but this typically comes at the cost of reduced run-time performance. Using a lower-level language, programs can be made very efficient but at the cost of increased development time. Real-time computer graphics is an area where there are very high demands on both performance and visual quality. Typically, large portions of such applications are written in lower-level languages and also rely on dedicated hardware, in the form of programmable graphics processing units (GPUs), for handling computationally demanding rendering algorithms. These GPUs are parallel stream processors, specialized towards computer graphics, that have computational performance more than a magnitude higher than corresponding CPUs. This has revolutionized computer graphics and also led to GPUs being used to solve more general numerical problems, such as fluid and physics simulation, protein folding, image processing, and databases. Unfortunately, the highly specialized nature of GPUs has also made them difficult to program. In this dissertation we show that GPUs can be programmed at a higher level, while maintaining performance, compared to current lower-level languages. By constructing a domain-specific language (DSL), which provides appropriate domain-specific abstractions and user-annotations, it is possible to write programs in a more abstract and modular manner. Using knowledge of the domain it is possible for the DSL compiler to generate very efficient code. We show that, by experiment, the performance of our DSLs is equal to that of GPU programs written by hand using current low-level languages. Also, control over the trade-offs between visual quality and performance is retained. In the papers included in this dissertation, we present domain-specific languages targeted at numerical processing and computer graphics, respectively. These DSL have been implemented as embedded languages in Python, a dynamic programming language that provide a rich set of high-level features. In this dissertation we show how these features can be used to facilitate the construction of embedded languages

Lund University Publications

High-performance and hardware-aware computing: proceedings of the second International Workshop on New Frontiers in High-performance and Hardware-aware Computing (HipHaC\u2711), San Antonio, Texas, USA, February 2011 ; (in conjunction with HPCA-17)

Author: Buchty Rainer
Weiß Jan-Philipp
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2011
Field of study

High-performance system architectures are increasingly exploiting heterogeneity. The HipHaC workshop aims at combining new aspects of parallel, heterogeneous, and reconfigurable microprocessor technologies with concepts of high-performance computing and, particularly, numerical solution methods. Compute- and memory-intensive applications can only benefit from the full hardware potential if all features on all levels are taken into account in a holistic approach

KITopen

Polyhedral Compilation: Applications, Approximations and GPU-specific Optimizations

Author: Patwardhan Abhishek A
Upadrasta Ramakrishna
Publication venue
Publication date: 01/01/2018
Field of study

Polyhedral compilation has been successful in analyzing, optimizing, automatically parallelizing a�ne computations for modern heterogenous target architectures. Many of the tools have been developed to automate the process of program analysis and transformations for a�ne control parts of programs including widely used open-source and production compilers such as GCC, LLVM, IBM/XL. This thesis makes contribution to the polyhedral model in three orthogonal dimensions as follows: • Applications: Applies polyhedral loop transformations on Deep learning computation kernel to demonstrate the e�ectiveness of complex loop transformations on these kernels. • Approximations: Developes two efficient algorithms to over-approximate convex polyhedra into U-TVPI polyhedra having applications in polyhedral compilation as well as automated program verification. • GPU-Specific Optimizations: Builds end-to-end fully automatic compiler framework to generate cache optimized CUDA code begining from sequential C program by using polyhedral modelling techniques.

Research Archive of Indian Institute of Technology Hyderabad