Search CORE

834 research outputs found

On data skewness, stragglers, and MapReduce progress indicators

Author: Chambers J. M.
Dai J.
Gufler B.
Herodotou H.
Herodotou H.
Li J.
Ousterhout K.
Zaharia M.
Publication venue
Publication date: 01/01/2015
Field of study

We tackle the problem of predicting the performance of MapReduce applications, designing accurate progress indicators that keep programmers informed on the percentage of completed computation time during the execution of a job. Through extensive experiments, we show that state-of-the-art progress indicators (including the one provided by Hadoop) can be seriously harmed by data skewness, load unbalancing, and straggling tasks. This is mainly due to their implicit assumption that the running time depends linearly on the input size. We thus design a novel profile-guided progress indicator, called NearestFit, that operates without the linear hypothesis assumption and exploits a careful combination of nearest neighbor regression and statistical curve fitting techniques. Our theoretical progress model requires fine-grained profile data, that can be very difficult to manage in practice. To overcome this issue, we resort to computing accurate approximations for some of the quantities used in our model through space- and time-efficient data streaming algorithms. We implemented NearestFit on top of Hadoop 2.6.0. An extensive empirical assessment over the Amazon EC2 platform on a variety of real-world benchmarks shows that NearestFit is practical w.r.t. space and time overheads and that its accuracy is generally very good, even in scenarios where competitors incur non-negligible errors and wide prediction fluctuations. Overall, NearestFit significantly improves the current state-of-art on progress analysis for MapReduce

arXiv.org e-Print Archive

Crossref

Archivio della ricerca- LUISS Libera Università Internazionale degli Studi Sociali Guido Carli di Roma

Archivio della ricerca- Università di Roma La Sapienza

Graph Convolutional Neural Networks for Web-Scale Recommender Systems

Author: Chen Kaifeng
Eksombatchai Pong
Hamilton William L.
He Ruining
Leskovec Jure
Ying Rex
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 05/06/2018
Field of study

Recent advancements in deep neural networks for graph-structured data have led to state-of-the-art performance on recommender system benchmarks. However, making these methods practical and scalable to web-scale recommendation tasks with billions of items and hundreds of millions of users remains a challenge. Here we describe a large-scale deep recommendation engine that we developed and deployed at Pinterest. We develop a data-efficient Graph Convolutional Network (GCN) algorithm PinSage, which combines efficient random walks and graph convolutions to generate embeddings of nodes (i.e., items) that incorporate both graph structure as well as node feature information. Compared to prior GCN approaches, we develop a novel method based on highly efficient random walks to structure the convolutions and design a novel training strategy that relies on harder-and-harder training examples to improve robustness and convergence of the model. We also develop an efficient MapReduce model inference algorithm to generate embeddings using a trained model. We deploy PinSage at Pinterest and train it on 7.5 billion examples on a graph with 3 billion nodes representing pins and boards, and 18 billion edges. According to offline metrics, user studies and A/B tests, PinSage generates higher-quality recommendations than comparable deep learning and graph-based alternatives. To our knowledge, this is the largest application of deep graph embeddings to date and paves the way for a new generation of web-scale recommender systems based on graph convolutional architectures.Comment: KDD 201

arXiv.org e-Print Archive

Crossref

An Improvement in K-NN Graph Construction with Locality Sensitive Hashing on MapReduce

Author: 이인회
Publication venue: 서울대학교 대학원
Publication date: 01/02/2015
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2015. 2. 김형주.k-Nearest Neighbor(k-NN)그래프는 모든 노드에 대한 k-NN 정보를 나타내는 데이터 구조로써, 많은 정보검색 및 추천 시스템에서 k-NN그래프를 활용하고 있다. 이러한 장점에도 불구하고 brute-force방법의 k-NN그래프 생성 방법은 O(n^2)의 시간복잡도를 갖기 때문에 빅데이터 셋에 대해서는 처리가 곤란하다. 따라서, 고차원, 희소 데이터에 효율적인 Locality Sensitive Hashing 기법을 분산환경인 MapReduce 환경에서 사용하여 k-NN그래프를 생성하는 알고리즘이 연구되고 있다. K-NN 그래프 생성은 사용자를 이웃후보 그룹으로 만들고 후보그룹 내의 쌍에 대해서만 brute-force하게 유사도를 계산하는 divide-and-conquer(two-stage) 방법을 사용했다. 특히, 그래프 생성과정 중 유사도 계산하는 부분이 가장 많은 시간이 소요되므로 후보 그룹을 어떻게 만드는 것인지가 중요하다. 기존의 방법은 사이즈가 큰 후보그룹을 방지하는데 한계점이 있다. 본 논문에서는 효율적인 k-NN 그래프 생성을 위하여 사이즈가 큰 후보그룹을 hierarchical LSH를 사용하여 재구성하는 알고리즘을 제시하였다. 실험 결과 본 논문에서 제시한 방법은 기존의 방법보다 빠르게 더 정확한 근사 그래프를 생성함을 확인하였다.The k nearest neighbor (k-NN) graph construction is an important operation with many web related applications, including collaborative filtering, similarity search, and many others in data mining and machine learning. Despite its many elegant properties, the brute-force k-NN graph construction method has computational complexity of O(n^2), which is prohibitive for large scale data sets. based distributed frameworks, MapReduce is gaining increasingly widespread use in applications that process large amounts of data. Based on the divide-and-conquer(two-stage) strategy, we engage the locality sensitive hashing technique which is used for high-dimension and sparse data to divide users into small groups, and calculate similarity using brute-force method on MapReduce. Specifically, generating candidate group stage is important since brute-force calculation is performed in following step. In this paper, we proposed an efficient algorithm for approximating k-NN graphs by re-grouping candidate group using hierarchical LSH. Experimental results show that our approach is more effective than existing method in aspects of graph accuracy and scan rate.제 1 장 서론 1 제 2 장 관련 연구 6 제 3 장 배경지식 10 3.1 LSH(Locality Sensitive Hashing) 10 3.1.1 MinHash 11 3.2 아파치 하둡(Apache Hadoop) 12 3.2.1 맵리듀스(MapReduce) 12 제 4 장 LSH를 이용한 k-NN 그래프 생성 15 4.1 문제 정의 15 4.2 MinHash를 이용한 이웃후보 그룹생성 16 4.3 이웃후보 그룹 재구성 17 4.3.1 일정한 그룹사이즈로 그룹 재구성 18 4.3.2 재 해싱을 통한 그룹 재구성 21 4.4 이웃후보 그룹내의 유사도 검사 및 k-NN 추출 24 제 5 장 성능평가 27 5.1 실험설정 27 5.1.1 데이터 셋 27 5.1.2 비교 알고리즘 28 5.1.3 성능평가 기준 29 5.2 실험 결과 30 5.2.1 시간과 정확도 30 5.2.2 정확도와 상대 유사도 계산비율(Scan Rate) 33 5.2.3 정확도와 해시테이블 개수 35 5.2.4 클러스터 증가에 따른 영향 36 5.2.5 잡 수행시간(job completion time)과 리듀스 셔플 바이트(reduce shuffle bytes) 37 5.2.6 맵 수행시간 41 제 6 장 결론 및 향후 연구 42 참고문헌 43 부록 47 Abstract 53Maste

SNU Open Repository and Archive

Towards distributed node similarity search on graphs

Author: CHEN Lu
GAO Yunjun
GUO Wei
WEN Shiting
ZHANG Tianming
ZHENG Baihua
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2020
Field of study

Institutional Knowledge at Singapore Management University

VBN

Accelerating Spatial Data Processing with MapReduce

Author: Bibo Tu
Jiao Dai
Jizhong Han
Kai Wang
Wei Zhou
Xuan Song
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2010
Field of study

Abstract—MapReduce is a key-value based programming model and an associated implementation for processing large data sets. It has been adopted in various scenarios and seems promising. However, when spatial computation is expressed straightforward by this key-value based model, difficulties arise due to unfit features and performance degradation. In this paper, we present methods as follows: 1) a splitting method for balancing workload, 2) pending file structure and redundant data partition dealing with relation between spatial objects, 3) a strip-based two-direction plane sweep-ing algorithm for computation accelerating. Based on these methods, ANN(All nearest neighbors) query and astronomical cross-certification are developed. Performance evaluation shows that the MapReduce-based spatial applications outperform the traditional one on DBMS

CiteSeerX

Crossref

GGNN: Graph-based GPU Nearest Neighbor Search

Author: Groh Fabian
Lensch Hendrik P. A.
Ruppert Lukas
Wieschollek Patrick
Publication venue
Publication date: 12/04/2021
Field of study

Approximate nearest neighbor (ANN) search in high dimensions is an integral part of several computer vision systems and gains importance in deep learning with explicit memory representations. Since PQT and FAISS started to leverage the massive parallelism offered by GPUs, GPU-based implementations are a crucial resource for today's state-of-the-art ANN methods. While most of these methods allow for faster queries, less emphasis is devoted to accelerate the construction of the underlying index structures. In this paper, we propose a novel search structure based on nearest neighbor graphs and information propagation on graphs. Our method is designed to take advantage of GPU architectures to accelerate the hierarchical building of the index structure and for performing the query. Empirical evaluation shows that GGNN significantly surpasses the state-of-the-art GPU- and CPU-based systems in terms of build-time, accuracy and search speed

arXiv.org e-Print Archive

Publikationsserver der Universität Tübingen

Shared Nearest-Neighbor Quantum Game-Based Attribute Reduction with Hierarchical Coevolutionary Spark and Its Application in Consistent Segmentation of Neonatal Cerebral Cortical Surfaces

Author: Cao Z
Ding W
Lin CT
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

© 2012 IEEE. The unprecedented increase in data volume has become a severe challenge for conventional patterns of data mining and learning systems tasked with handling big data. The recently introduced Spark platform is a new processing method for big data analysis and related learning systems, which has attracted increasing attention from both the scientific community and industry. In this paper, we propose a shared nearest-neighbor quantum game-based attribute reduction (SNNQGAR) algorithm that incorporates the hierarchical coevolutionary Spark model. We first present a shared coevolutionary nearest-neighbor hierarchy with self-evolving compensation that considers the features of nearest-neighborhood attribute subsets and calculates the similarity between attribute subsets according to the shared neighbor information of attribute sample points. We then present a novel attribute weight tensor model to generate ranking vectors of attributes and apply them to balance the relative contributions of different neighborhood attribute subsets. To optimize the model, we propose an embedded quantum equilibrium game paradigm (QEGP) to ensure that noisy attributes do not degrade the big data reduction results. A combination of the hierarchical coevolutionary Spark model and an improved MapReduce framework is then constructed that it can better parallelize the SNNQGAR to efficiently determine the preferred reduction solutions of the distributed attribute subsets. The experimental comparisons demonstrate the superior performance of the SNNQGAR, which outperforms most of the state-of-the-art attribute reduction algorithms. Moreover, the results indicate that the SNNQGAR can be successfully applied to segment overlapping and interdependent fuzzy cerebral tissues, and it exhibits a stable and consistent segmentation performance for neonatal cerebral cortical surfaces

OPUS - University of Technology Sydney

University of Tasmania Open Access Repository