3,598 research outputs found
Efficient Parallel Video Encoding on Heterogeneous Systems
Proceedings of: First International Workshop on Sustainable Ultrascale Computing Systems (NESUS 2014). Porto (Portugal), August 27-28, 2014.In this study we propose an efficient method for collaborative H.264/AVC inter-loop encoding in heterogeneous CPU+GPU systems. This method relies on specifically developed extensive library of highly optimized parallel algorithms for both CPU and GPU architectures, and all inter-loop modules. In order to minimize the overall encoding time, this method integrates adaptive load balancing for the most computationally intensive, inter-prediction modules, which is based on dynamically built functional performance models of heterogenous devices and inter-loop modules. The proposed method also introduces efficient communication-aware techniques, which maximize data reusing, and decrease the overhead of expensive data transfers in collaborative video encoding. The experimental results show that the proposed method is able of achieving real-time video encoding for very demanding video coding parameters, i.e., full HD video format, 64×64 pixels search area and the exhaustive motion estimation.This work was supported by national funds through FCT – Fundação para a Ciência e a Tecnologia, under projects PEst-OE/EEI/LA0021/2013, PTDC/EEI-ELC/3152/2012 and PTDC/EEA-ELC/117329/2010
Scalable and interpretable product recommendations via overlapping co-clustering
We consider the problem of generating interpretable recommendations by
identifying overlapping co-clusters of clients and products, based only on
positive or implicit feedback. Our approach is applicable on very large
datasets because it exhibits almost linear complexity in the input examples and
the number of co-clusters. We show, both on real industrial data and on
publicly available datasets, that the recommendation accuracy of our algorithm
is competitive to that of state-of-art matrix factorization techniques. In
addition, our technique has the advantage of offering recommendations that are
textually and visually interpretable. Finally, we examine how to implement our
technique efficiently on Graphical Processing Units (GPUs).Comment: In IEEE International Conference on Data Engineering (ICDE) 201
Scaling Deep Learning on GPU and Knights Landing clusters
The speed of deep neural networks training has become a big bottleneck of
deep learning research and development. For example, training GoogleNet by
ImageNet dataset on one Nvidia K20 GPU needs 21 days. To speed up the training
process, the current deep learning systems heavily rely on the hardware
accelerators. However, these accelerators have limited on-chip memory compared
with CPUs. To handle large datasets, they need to fetch data from either CPU
memory or remote processors. We use both self-hosted Intel Knights Landing
(KNL) clusters and multi-GPU clusters as our target platforms. From an
algorithm aspect, current distributed machine learning systems are mainly
designed for cloud systems. These methods are asynchronous because of the slow
network and high fault-tolerance requirement on cloud systems. We focus on
Elastic Averaging SGD (EASGD) to design algorithms for HPC clusters. Original
EASGD used round-robin method for communication and updating. The communication
is ordered by the machine rank ID, which is inefficient on HPC clusters.
First, we redesign four efficient algorithms for HPC systems to improve
EASGD's poor scaling on clusters. Async EASGD, Async MEASGD, and Hogwild EASGD
are faster \textcolor{black}{than} their existing counterparts (Async SGD,
Async MSGD, and Hogwild SGD, resp.) in all the comparisons. Finally, we design
Sync EASGD, which ties for the best performance among all the methods while
being deterministic. In addition to the algorithmic improvements, we use some
system-algorithm codesign techniques to scale up the algorithms. By reducing
the percentage of communication from 87% to 14%, our Sync EASGD achieves 5.3x
speedup over original EASGD on the same platform. We get 91.5% weak scaling
efficiency on 4253 KNL cores, which is higher than the state-of-the-art
implementation
3D high definition video coding on a GPU-based heterogeneous system
H.264/MVC is a standard for supporting the sensation of 3D, based on coding from 2 (stereo) to N views. H.264/MVC adopts many coding options inherited from single view H.264/AVC, and thus its complexity is even higher, mainly because the number of processing views is higher. In this manuscript, we aim at an efficient parallelization of the most computationally intensive video encoding module for stereo sequences. In particular, inter prediction and its collaborative execution on a heterogeneous platform. The proposal is based on an efficient dynamic load balancing algorithm and on breaking encoding dependencies. Experimental results demonstrate the proposed algorithm's ability to reduce the encoding time for different stereo high definition sequences. Speed-up values of up to 90× were obtained when compared with the reference encoder on the same platform. Moreover, the proposed algorithm also provides a more energy-efficient approach and hence requires less energy than the sequential reference algorith
Simulating the weak death of the neutron in a femtoscale universe with near-Exascale computing
The fundamental particle theory called Quantum Chromodynamics (QCD) dictates
everything about protons and neutrons, from their intrinsic properties to
interactions that bind them into atomic nuclei. Quantities that cannot be fully
resolved through experiment, such as the neutron lifetime (whose precise value
is important for the existence of light-atomic elements that make the sun shine
and life possible), may be understood through numerical solutions to QCD. We
directly solve QCD using Lattice Gauge Theory and calculate nuclear observables
such as neutron lifetime. We have developed an improved algorithm that
exponentially decreases the time-to solution and applied it on the new CORAL
supercomputers, Sierra and Summit. We use run-time autotuning to distribute GPU
resources, achieving 20% performance at low node count. We also developed
optimal application mapping through a job manager, which allows CPU and GPU
jobs to be interleaved, yielding 15% of peak performance when deployed across
large fractions of CORAL.Comment: 2018 Gordon Bell Finalist: 9 pages, 9 figures; v2: fixed 2 typos and
appended acknowledgement
FLASH: Randomized Algorithms Accelerated over CPU-GPU for Ultra-High Dimensional Similarity Search
We present FLASH (\textbf{F}ast \textbf{L}SH \textbf{A}lgorithm for
\textbf{S}imilarity search accelerated with \textbf{H}PC), a similarity search
system for ultra-high dimensional datasets on a single machine, that does not
require similarity computations and is tailored for high-performance computing
platforms. By leveraging a LSH style randomized indexing procedure and
combining it with several principled techniques, such as reservoir sampling,
recent advances in one-pass minwise hashing, and count based estimations, we
reduce the computational and parallelization costs of similarity search, while
retaining sound theoretical guarantees.
We evaluate FLASH on several real, high-dimensional datasets from different
domains, including text, malicious URL, click-through prediction, social
networks, etc. Our experiments shed new light on the difficulties associated
with datasets having several million dimensions. Current state-of-the-art
implementations either fail on the presented scale or are orders of magnitude
slower than FLASH. FLASH is capable of computing an approximate k-NN graph,
from scratch, over the full webspam dataset (1.3 billion nonzeros) in less than
10 seconds. Computing a full k-NN graph in less than 10 seconds on the webspam
dataset, using brute-force (), will require at least 20 teraflops. We
provide CPU and GPU implementations of FLASH for replicability of our results
- …