8 research outputs found
gSuite: A Flexible and Framework Independent Benchmark Suite for Graph Neural Network Inference on GPUs
As the interest to Graph Neural Networks (GNNs) is growing, the importance of
benchmarking and performance characterization studies of GNNs is increasing. So
far, we have seen many studies that investigate and present the performance and
computational efficiency of GNNs. However, the work done so far has been
carried out using a few high-level GNN frameworks. Although these frameworks
provide ease of use, they contain too many dependencies to other existing
libraries. The layers of implementation details and the dependencies complicate
the performance analysis of GNN models that are built on top of these
frameworks, especially while using architectural simulators. Furthermore,
different approaches on GNN computation are generally overlooked in prior
characterization studies, and merely one of the common computational models is
evaluated. Based on these shortcomings and needs that we observed, we developed
a benchmark suite that is framework independent, supporting versatile
computational models, easily configurable and can be used with architectural
simulators without additional effort.
Our benchmark suite, which we call gSuite, makes use of only hardware
vendor's libraries and therefore it is independent of any other frameworks.
gSuite enables performing detailed performance characterization studies on GNN
Inference using both contemporary GPU profilers and architectural GPU
simulators. To illustrate the benefits of our new benchmark suite, we perform a
detailed characterization study with a set of well-known GNN models with
various datasets; running gSuite both on a real GPU card and a timing-detailed
GPU simulator. We also implicate the effect of computational models on
performance. We use several evaluation metrics to rigorously measure the
performance of GNN computation.Comment: IEEE International Symposium on Workload Characterization (IISWC)
202
Multi-tasking scheduling for heterogeneous systems
Heterogeneous platforms play an increasingly important role in modern computer
systems. They combine high performance with low power consumption. From mobiles
to supercomputers, we see an increasing number of computer systems that are
heterogeneous.
The most well-known heterogeneous system, CPU+GPU platforms have been widely
used in recent years. As they become more mainstream, serving multiple tasks from
multiple users is an emerging challenge. A good scheduler can greatly improve performance.
However, indiscriminately allocating tasks based on availability leads to poor
performance. As modern GPUs have a large number of hardware resources, most tasks
cannot efficiently utilize all of them. Concurrent task execution on GPU is a promising
solution, however, indiscriminately running tasks in parallel causes a slowdown.
This thesis focuses on scheduling OpenCL kernels. A runtime framework is developed
to determine where to schedule OpenCL kernels. It predicts the best-fit device by
using a machine learning-based classifier, then schedules the kernels accordingly to either
CPU or GPU. To improve GPU utilization, a kernel merging approach is proposed.
Kernels are merged if their predicted co-execution can provide better performance than
sequential execution. A machine learning based classifier is developed to find the best
kernel pairs for co-execution on GPU. Finally, a runtime framework is developed to
schedule kernels separately on either CPU or GPU, and run kernels in pairs if their
co-execution can improve performance. The approaches developed in this thesis significantly
improve system performance and outperform all existing techniques
Parallel data-local training for optimizing Word2Vec embeddings for word and graph embeddings
The Word2Vec model is a neural network-based unsupervised word embedding technique widely used in applications such as natural language processing, bioinformatics and graph mining. As Word2Vec repeatedly performs Stochastic Gradient Descent (SGD) to minimize the objective function, it is very compute-intensive. However, existing methods for parallelizing Word2Vec are not optimized enough for data locality to achieve high performance. In this paper, we develop a parallel data-locality-enhanced Word2Vec algorithm based on Skip-gram with a novel negative sampling method that decouples loss calculation with positive and negative samples; this allows us to efficiently reformulate matrix-matrix operations for the negative samples over the sentence. Experimental results demonstrate our parallel implementations on multi-core CPUs and GPUs achieve significant performance improvement over the existing state-of-the-art parallel Word2Vec implementations while maintaining evaluation quality. We also show the utility of our Word2Vec implementation within the Node2Vec algorithm which accelerates embedding learning for large graphs
Mapping parallel programs to heterogeneous multi-core systems
Heterogeneous computer systems are ubiquitous in all areas of computing, from mobile
to high-performance computing. They promise to deliver increased performance
at lower energy cost than purely homogeneous, CPU-based systems. In recent years
GPU-based heterogeneous systems have become increasingly popular. They combine
a programmable GPU with a multi-core CPU. GPUs have become flexible enough
to not only handle graphics workloads but also various kinds of general-purpose
algorithms. They are thus used as a coprocessor or accelerator alongside the CPU.
Developing applications for GPU-based heterogeneous systems involves several
challenges. Firstly, not all algorithms are equally suited for GPU computing. It is thus
important to carefully map the tasks of an application to the most suitable processor
in a system. Secondly, current frameworks for heterogeneous computing, such as
OpenCL, are low-level, requiring a thorough understanding of the hardware by the
programmer. This high barrier to entry could be lowered by automatically generating
and tuning this code from a high-level and thus more user-friendly programming
language. Both challenges are addressed in this thesis.
For the task mapping problem a machine learning-based approach is presented in
this thesis. It combines static features of the program code with runtime information
on input sizes to predict the optimal mapping of OpenCL kernels. This approach is
further extended to also take contention on the GPU into account. Both methods are
able to outperform competing mapping approaches by a significant margin.
Furthermore, this thesis develops a method for targeting GPU-based heterogeneous
systems from OpenMP, a directive-based framework for parallel computing.
OpenMP programs are translated to OpenCL and optimized for GPU performance.
At runtime a predictive model decides whether to execute the original OpenMP code
on the CPU or the generated OpenCL code on the GPU. This approach is shown to
outperform both a competing approach as well as hand-tuned code
Towards The Efficient Use Of Fine-Grained Provenance In Datascience Applications
Recent years have witnessed increased demand for users to be able to interpret the results of data science pipelines, locate erroneous data items in the input, evaluate the importance of individual input data items, and acknowledge the contributions of data curators. Such applications often involve the use of the provenance at a fine-grained level, and require very fast response time. To address this issue, my goal is to expedite the use of fine-grained provenance in applications within both the database and machine learning domains, which are ubiquitous in contemporary data science pipelines. In applications from the database domain, I focus on the problem of data citation and provide two different types of solutions, Rewriting-based solutions and Provenance-based solutions, to generate fine-grained citations to database query results by implicitly or explicitly leveraging provenance information. In applications from the ML domain, the first considers the problem of incrementally updating ML models after the deletions of a small subset of training samples. This is critical for understanding the importance of individual training samples to ML models, especially in online pipelines. For this problem, I provide two solutions, PrIU and DeltaGrad, to incrementally update ML models constructed by SGD/GD methods, which utilize provenance information collected during the training phase on the full dataset before the deletion requests. The second application from the ML domain that I focus on is to explore how to clean label uncertainties located in the ML training dataset in a more efficient and cheaper manner. To address this problem, I proposed a solution, CHEF, to reduce the cost and the overhead at each phase of the label cleaning pipeline and maintain the overall model performance simultaneously. I also propose initial ideas for how to remove some assumptions used in these solutions to extend them to more general scenarios
Invariant object recognition : biologically plausible and machine learning approaches
Understanding the processes that facilitate object recognition is a task that draws on a wide range of fields, integrating knowledge from neuroscience, psychology, computer science and mathematics. The substantial work done in these fields has lead to two major outcomes: Firstly, a rich interplay between computational models and biological experiments that seek to explain the biological processes that underpin object recognition. Secondly, engineered vision systems that on many tasks are approaching the performance of humans.
This work first highlights the importance of ensuring models which are aiming for biological relevance actually produce biologically plausible representations that are consistent with what has been measured within the primate visual cortex. To accomplish this two leading biologically plausible models, HMAX and VisNet are compared on a set of visual processing tasks.
The work then changes approach, focusing on models that do not explicitly seek to model any biological process, but rather solve a particular vision task with the goal being increased performance. This section explores the recently discovered problem convolution networks being susceptible to adversarial exemplars. An extension of previous work is shown that allows state-of-the-art networks to be fooled to classify any image as any label while leaving that original image visually unchanged. Secondly an efficient implementation of applying dropout in a batchwise fashion is introduced that approximately halves the computational cost, allowing models twice as large to be trained. Finally an extension to Deep Belief Networks is proposed that constrains the connectivity of the a given layer to that of a topologically local region of the previous one
Improving Pattern Recognition and Neural Network Algorithms With Applications to Solar Panel Energy Optimization
Artificial Intelligence is a big part of automation and with today\u27s technological advances, artificial intelligence has taken great strides towards positioning itself as the technology of the future to control, enhance and perfect automation. Computer vision includes pattern recognition and classification and machine learning. Computer vision is at the core of decision making and it is a vast and fruitful branch of artificial intelligence. In this work, we expose novel algorithms and techniques built upon existing technologies to improve pattern recognition and neural network training, initially motivated by a multidisciplinary effort to build a robot that helps maintain and optimize solar panel energy production.
Our contributions detail an improved non-linear pre-processing technique to enhance poorly illuminated images based on modifications to the standard histogram equalization for an image. While the original motivation was to improve nocturnal navigation, the results have applications in surveillance, search and rescue, medical imaging enhancing, and many others.
We created a vision system for precise camera distance positioning motivated to correctly locate the robot for capture of solar panel images for classification. The classification algorithm marks solar panels as clean or dirty for later processing. Our algorithm extends past image classification and, based on historical and experimental data, it identifies the optimal moment in which to perform maintenance on marked solar panels as to minimize the energy and profit loss.
In order to improve upon the classification algorithm, we delved into feedforward neural networks because of their recent advancements, proven universal approximation and classification capabilities, and excellent recognition rates. We explore state-of-the-art neural network training techniques offering pointers and insights, culminating on the implementation of a complete library with support for modern deep learning architectures, multilayer percepterons and convolutional neural networks.
Our research with neural networks has encountered a great deal of difficulties regarding hyperparameter estimation for good training convergence rate and accuracy. Most hyperparameters, including architecture, learning rate, regularization, trainable parameters (or weights) initialization, and so on, are chosen via a trial and error process with some educated guesses. However, we developed the first quantitative method to compare weight initialization strategies, a critical hyperparameter choice during training, to estimate among a group of candidate strategies which would make the network converge to the highest classification accuracy faster with high probability. Our method provides a quick, objective measure to compare initialization strategies to select the best possible among them beforehand without having to complete multiple training sessions for each candidate strategy to compare final results