897 research outputs found
A Survey of CUDA-based Multidimensional Scaling on GPU Architecture
The need to analyze large amounts of multivariate data raises the fundamental problem of dimensionality reduction which is defined as a process of mapping data from high-dimensional space into low-dimensional. One of the most popular methods for handling this problem is multidimensional scaling. Due to the technological advances, the dimensionality of the input data as well as the amount of processed data is increasing steadily but the requirement of processing these data within a reasonable time frame still remains an open problem. Recent development in graphics hardware allows to perform generic parallel computations on powerful hardware and provides an opportunity to solve many time-constrained problems in both graphical and non-graphical domain. The purpose of this survey is to describe and analyze recent implementations of multidimensional scaling algorithms on graphics processing units and present the applicability of these algorithms on such architectures based on the experimental results which show a decrease of execution time for multi-level approaches
Enabling a High Throughput Real Time Data Pipeline for a Large Radio Telescope Array with GPUs
The Murchison Widefield Array (MWA) is a next-generation radio telescope
currently under construction in the remote Western Australia Outback. Raw data
will be generated continuously at 5GiB/s, grouped into 8s cadences. This high
throughput motivates the development of on-site, real time processing and
reduction in preference to archiving, transport and off-line processing. Each
batch of 8s data must be completely reduced before the next batch arrives.
Maintaining real time operation will require a sustained performance of around
2.5TFLOP/s (including convolutions, FFTs, interpolations and matrix
multiplications). We describe a scalable heterogeneous computing pipeline
implementation, exploiting both the high computing density and FLOP-per-Watt
ratio of modern GPUs. The architecture is highly parallel within and across
nodes, with all major processing elements performed by GPUs. Necessary
scatter-gather operations along the pipeline are loosely synchronized between
the nodes hosting the GPUs. The MWA will be a frontier scientific instrument
and a pathfinder for planned peta- and exascale facilities.Comment: Version accepted by Comp. Phys. Com
A scalable H-matrix approach for the solution of boundary integral equations on multi-GPU clusters
In this work, we consider the solution of boundary integral equations by
means of a scalable hierarchical matrix approach on clusters equipped with
graphics hardware, i.e. graphics processing units (GPUs). To this end, we
extend our existing single-GPU hierarchical matrix library hmglib such that it
is able to scale on many GPUs and such that it can be coupled to arbitrary
application codes. Using a model GPU implementation of a boundary element
method (BEM) solver, we are able to achieve more than 67 percent relative
parallel speed-up going from 128 to 1024 GPUs for a model geometry test case
with 1.5 million unknowns and a real-world geometry test case with almost 1.2
million unknowns. On 1024 GPUs of the cluster Titan, it takes less than 6
minutes to solve the 1.5 million unknowns problem, with 5.7 minutes for the
setup phase and 20 seconds for the iterative solver. To the best of the
authors' knowledge, we here discuss the first fully GPU-based
distributed-memory parallel hierarchical matrix Open Source library using the
traditional H-matrix format and adaptive cross approximation with an
application to BEM problems
Adaptive multispectral GPU accelerated architecture for Earth Observation satellites
In recent years the growth in quantity, diversity and capability of Earth Observation (EO) satellites, has enabled increase’s in the achievable payload data dimensionality and volume. However, the lack of equivalent advancement in downlink technology has resulted in the development of an onboard data bottleneck. This bottleneck must be alleviated in order for EO satellites to continue to efficiently provide high quality and increasing quantities of payload data. This research explores the selection and implementation of state-of-the-art multidimensional image compression algorithms and proposes a new onboard data processing architecture, to help alleviate the bottleneck and increase the data throughput of the platform. The proposed new system is based upon a backplane architecture to provide scalability with different satellite platform sizes and varying mission’s objectives. The heterogeneous nature of the architecture allows benefits of both Field Programmable Gate Array (FPGA) and Graphical Processing Unit (GPU) hardware to be leveraged for maximised data processing throughput
Performance Comparison Of Two Data Mining Algorithms On Big Data Platforms
In this Big data era, the need for performing large-scale computations is evident. A better understanding of the most suitable platforms which can efficiently run these computations is needed. In this thesis, we attempt to compare four such big data platforms, namely Hadoop, Spark, GPU, and Multicore CPU. We compare these platforms using two prominent data mining algorithms, namely, K-means clustering and K-nearest neighbour classification and discuss specific implementation-level details. We provide several insights into the best possible implementations of these algorithms and systematically compare the benefits and drawbacks of each of these platforms. We conduct experiments by varying data size and parameters to obtain runtime and scalability performances of these platforms. Our experiments show that GPU and Multicore CPU are faster but have certain limitations. On the other hand, Hadoop and Spark are able to handle large scale datasets. We also observe that Spark performs better than Hadoop for both iterative and non-iterative jobs. In summary, we have examined different characteristics of four big data platforms and provided comparative analysis for the cases of two algorithms. Since many other data mining algorithms either use these two methods during pre-processing or as an integral component, we hope that our analysis will have impact in many other applications and algorithms beyond the ones that are being reported in this thesis
Dynamic automatic differentiation of GPU broadcast kernels
We show how forward-mode automatic differentiation (AD) can be employed within larger reverse-mode computations to dynamically differentiate broadcast operations in a GPU-friendly manner. Our technique fully exploits the broadcast Jacobian's inherent sparsity structure, and unlike a pure reverse-mode approach, this "mixed-mode" approach does not require a backwards pass over the broadcasted operation's subgraph, obviating the need for several reverse-mode-specific programmability restrictions on user-authored broadcast operations. Most notably, this approach allows broadcast fusion in primal code despite the presence of data-dependent control flow. We discuss an experiment in which a Julia implementation of our technique outperformed pure reverse-mode TensorFlow and Julia implementations for differentiating through broadcast operations within an HM-LSTM cell update calculation
- …