834 research outputs found
On data skewness, stragglers, and MapReduce progress indicators
We tackle the problem of predicting the performance of MapReduce
applications, designing accurate progress indicators that keep programmers
informed on the percentage of completed computation time during the execution
of a job. Through extensive experiments, we show that state-of-the-art progress
indicators (including the one provided by Hadoop) can be seriously harmed by
data skewness, load unbalancing, and straggling tasks. This is mainly due to
their implicit assumption that the running time depends linearly on the input
size. We thus design a novel profile-guided progress indicator, called
NearestFit, that operates without the linear hypothesis assumption and exploits
a careful combination of nearest neighbor regression and statistical curve
fitting techniques. Our theoretical progress model requires fine-grained
profile data, that can be very difficult to manage in practice. To overcome
this issue, we resort to computing accurate approximations for some of the
quantities used in our model through space- and time-efficient data streaming
algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive
empirical assessment over the Amazon EC2 platform on a variety of real-world
benchmarks shows that NearestFit is practical w.r.t. space and time overheads
and that its accuracy is generally very good, even in scenarios where
competitors incur non-negligible errors and wide prediction fluctuations.
Overall, NearestFit significantly improves the current state-of-art on progress
analysis for MapReduce
Graph Convolutional Neural Networks for Web-Scale Recommender Systems
Recent advancements in deep neural networks for graph-structured data have
led to state-of-the-art performance on recommender system benchmarks. However,
making these methods practical and scalable to web-scale recommendation tasks
with billions of items and hundreds of millions of users remains a challenge.
Here we describe a large-scale deep recommendation engine that we developed and
deployed at Pinterest. We develop a data-efficient Graph Convolutional Network
(GCN) algorithm PinSage, which combines efficient random walks and graph
convolutions to generate embeddings of nodes (i.e., items) that incorporate
both graph structure as well as node feature information. Compared to prior GCN
approaches, we develop a novel method based on highly efficient random walks to
structure the convolutions and design a novel training strategy that relies on
harder-and-harder training examples to improve robustness and convergence of
the model. We also develop an efficient MapReduce model inference algorithm to
generate embeddings using a trained model. We deploy PinSage at Pinterest and
train it on 7.5 billion examples on a graph with 3 billion nodes representing
pins and boards, and 18 billion edges. According to offline metrics, user
studies and A/B tests, PinSage generates higher-quality recommendations than
comparable deep learning and graph-based alternatives. To our knowledge, this
is the largest application of deep graph embeddings to date and paves the way
for a new generation of web-scale recommender systems based on graph
convolutional architectures.Comment: KDD 201
An Improvement in K-NN Graph Construction with Locality Sensitive Hashing on MapReduce
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ปดํจํฐ๊ณตํ๋ถ, 2015. 2. ๊นํ์ฃผ.k-Nearest Neighbor(k-NN)๊ทธ๋ํ๋ ๋ชจ๋ ๋
ธ๋์ ๋ํ k-NN ์ ๋ณด๋ฅผ ๋ํ๋ด๋ ๋ฐ์ดํฐ ๊ตฌ์กฐ๋ก์จ, ๋ง์ ์ ๋ณด๊ฒ์ ๋ฐ ์ถ์ฒ ์์คํ
์์ k-NN๊ทธ๋ํ๋ฅผ ํ์ฉํ๊ณ ์๋ค. ์ด๋ฌํ ์ฅ์ ์๋ ๋ถ๊ตฌํ๊ณ brute-force๋ฐฉ๋ฒ์ k-NN๊ทธ๋ํ ์์ฑ ๋ฐฉ๋ฒ์ O(n^2)์ ์๊ฐ๋ณต์ก๋๋ฅผ ๊ฐ๊ธฐ ๋๋ฌธ์ ๋น
๋ฐ์ดํฐ ์
์ ๋ํด์๋ ์ฒ๋ฆฌ๊ฐ ๊ณค๋ํ๋ค. ๋ฐ๋ผ์, ๊ณ ์ฐจ์, ํฌ์ ๋ฐ์ดํฐ์ ํจ์จ์ ์ธ Locality Sensitive Hashing ๊ธฐ๋ฒ์ ๋ถ์ฐํ๊ฒฝ์ธ MapReduce ํ๊ฒฝ์์ ์ฌ์ฉํ์ฌ k-NN๊ทธ๋ํ๋ฅผ ์์ฑํ๋ ์๊ณ ๋ฆฌ์ฆ์ด ์ฐ๊ตฌ๋๊ณ ์๋ค. K-NN ๊ทธ๋ํ ์์ฑ์ ์ฌ์ฉ์๋ฅผ ์ด์ํ๋ณด ๊ทธ๋ฃน์ผ๋ก ๋ง๋ค๊ณ ํ๋ณด๊ทธ๋ฃน ๋ด์ ์์ ๋ํด์๋ง brute-forceํ๊ฒ ์ ์ฌ๋๋ฅผ ๊ณ์ฐํ๋ divide-and-conquer(two-stage) ๋ฐฉ๋ฒ์ ์ฌ์ฉํ๋ค. ํนํ, ๊ทธ๋ํ ์์ฑ๊ณผ์ ์ค ์ ์ฌ๋ ๊ณ์ฐํ๋ ๋ถ๋ถ์ด ๊ฐ์ฅ ๋ง์ ์๊ฐ์ด ์์๋๋ฏ๋ก ํ๋ณด ๊ทธ๋ฃน์ ์ด๋ป๊ฒ ๋ง๋๋ ๊ฒ์ธ์ง๊ฐ ์ค์ํ๋ค. ๊ธฐ์กด์ ๋ฐฉ๋ฒ์ ์ฌ์ด์ฆ๊ฐ ํฐ ํ๋ณด๊ทธ๋ฃน์ ๋ฐฉ์งํ๋๋ฐ ํ๊ณ์ ์ด ์๋ค. ๋ณธ ๋
ผ๋ฌธ์์๋ ํจ์จ์ ์ธ k-NN ๊ทธ๋ํ ์์ฑ์ ์ํ์ฌ ์ฌ์ด์ฆ๊ฐ ํฐ ํ๋ณด๊ทธ๋ฃน์ hierarchical LSH๋ฅผ ์ฌ์ฉํ์ฌ ์ฌ๊ตฌ์ฑํ๋ ์๊ณ ๋ฆฌ์ฆ์ ์ ์ํ์๋ค. ์คํ ๊ฒฐ๊ณผ ๋ณธ ๋
ผ๋ฌธ์์ ์ ์ํ ๋ฐฉ๋ฒ์ ๊ธฐ์กด์ ๋ฐฉ๋ฒ๋ณด๋ค ๋น ๋ฅด๊ฒ ๋ ์ ํํ ๊ทผ์ฌ ๊ทธ๋ํ๋ฅผ ์์ฑํจ์ ํ์ธํ์๋ค.The k nearest neighbor (k-NN) graph construction is an important operation with many web related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Despite its many elegant properties, the brute-force k-NN graph construction method has computational complexity of O(n^2), which is prohibitive for large scale data sets. based distributed frameworks, MapReduce is gaining increasingly widespread use in applications that process large amounts of data. Based on the divide-and-conquer(two-stage) strategy, we engage the locality sensitive hashing technique which is used for high-dimension and sparse data to divide users into small groups, and calculate similarity using brute-force method on MapReduce. Specifically, generating candidate group stage is important since brute-force calculation is performed in following step. In this paper, we proposed an efficient algorithm for approximating k-NN graphs by re-grouping candidate group using hierarchical LSH. Experimental results show that our approach is more effective than existing method in aspects of graph accuracy and scan rate.์ 1 ์ฅ ์๋ก 1
์ 2 ์ฅ ๊ด๋ จ ์ฐ๊ตฌ 6
์ 3 ์ฅ ๋ฐฐ๊ฒฝ์ง์ 10
3.1 LSH(Locality Sensitive Hashing) 10
3.1.1 MinHash 11
3.2 ์ํ์น ํ๋ก(Apache Hadoop) 12
3.2.1 ๋งต๋ฆฌ๋์ค(MapReduce) 12
์ 4 ์ฅ LSH๋ฅผ ์ด์ฉํ k-NN ๊ทธ๋ํ ์์ฑ 15
4.1 ๋ฌธ์ ์ ์ 15
4.2 MinHash๋ฅผ ์ด์ฉํ ์ด์ํ๋ณด ๊ทธ๋ฃน์์ฑ 16
4.3 ์ด์ํ๋ณด ๊ทธ๋ฃน ์ฌ๊ตฌ์ฑ 17
4.3.1 ์ผ์ ํ ๊ทธ๋ฃน์ฌ์ด์ฆ๋ก ๊ทธ๋ฃน ์ฌ๊ตฌ์ฑ 18
4.3.2 ์ฌ ํด์ฑ์ ํตํ ๊ทธ๋ฃน ์ฌ๊ตฌ์ฑ 21
4.4 ์ด์ํ๋ณด ๊ทธ๋ฃน๋ด์ ์ ์ฌ๋ ๊ฒ์ฌ ๋ฐ k-NN ์ถ์ถ 24
์ 5 ์ฅ ์ฑ๋ฅํ๊ฐ 27
5.1 ์คํ์ค์ 27
5.1.1 ๋ฐ์ดํฐ ์
27
5.1.2 ๋น๊ต ์๊ณ ๋ฆฌ์ฆ 28
5.1.3 ์ฑ๋ฅํ๊ฐ ๊ธฐ์ค 29
5.2 ์คํ ๊ฒฐ๊ณผ 30
5.2.1 ์๊ฐ๊ณผ ์ ํ๋ 30
5.2.2 ์ ํ๋์ ์๋ ์ ์ฌ๋ ๊ณ์ฐ๋น์จ(Scan Rate) 33
5.2.3 ์ ํ๋์ ํด์ํ
์ด๋ธ ๊ฐ์ 35
5.2.4 ํด๋ฌ์คํฐ ์ฆ๊ฐ์ ๋ฐ๋ฅธ ์ํฅ 36
5.2.5 ์ก ์ํ์๊ฐ(job completion time)๊ณผ ๋ฆฌ๋์ค ์
ํ ๋ฐ์ดํธ(reduce shuffle bytes) 37
5.2.6 ๋งต ์ํ์๊ฐ 41
์ 6 ์ฅ ๊ฒฐ๋ก ๋ฐ ํฅํ ์ฐ๊ตฌ 42
์ฐธ๊ณ ๋ฌธํ 43
๋ถ๋ก 47
Abstract 53Maste
Accelerating Spatial Data Processing with MapReduce
AbstractโMapReduce is a key-value based programming model and an associated implementation for processing large data sets. It has been adopted in various scenarios and seems promising. However, when spatial computation is expressed straightforward by this key-value based model, difficulties arise due to unfit features and performance degradation. In this paper, we present methods as follows: 1) a splitting method for balancing workload, 2) pending file structure and redundant data partition dealing with relation between spatial objects, 3) a strip-based two-direction plane sweep-ing algorithm for computation accelerating. Based on these methods, ANN(All nearest neighbors) query and astronomical cross-certification are developed. Performance evaluation shows that the MapReduce-based spatial applications outperform the traditional one on DBMS
GGNN: Graph-based GPU Nearest Neighbor Search
Approximate nearest neighbor (ANN) search in high dimensions is an integral
part of several computer vision systems and gains importance in deep learning
with explicit memory representations. Since PQT and FAISS started to leverage
the massive parallelism offered by GPUs, GPU-based implementations are a
crucial resource for today's state-of-the-art ANN methods. While most of these
methods allow for faster queries, less emphasis is devoted to accelerate the
construction of the underlying index structures. In this paper, we propose a
novel search structure based on nearest neighbor graphs and information
propagation on graphs. Our method is designed to take advantage of GPU
architectures to accelerate the hierarchical building of the index structure
and for performing the query. Empirical evaluation shows that GGNN
significantly surpasses the state-of-the-art GPU- and CPU-based systems in
terms of build-time, accuracy and search speed
Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces
ยฉ 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces
- โฆ