1 research outputs found

    Scalable Methods and Algorithms for Very Large Graphs Based on Sampling

    Get PDF
    Analyzing real-life networks is a computationally intensive task due to the sheer size of networks. Direct analysis is even impossible when the network data is not entirely accessible. For instance, user networks in Twitter and Facebook are not available for third parties to explore their properties directly. Thus, sampling-based algorithms are indispensable. This dissertation addresses the conļ¬dence interval (CI) and bias problems in real-world network analysis. It uses estimations of the number of triangles (hereafter āˆ†) and clustering coefficient (hereafter C) as a case study. Metric āˆ† in a graph is an important measurement for understanding the graph. It is also directly related to C in a graph, which is one of the most important indicators for social networks. The methods proposed in this dissertation can be utilized in other sampling problems. First, we proposed two new methods to estimate āˆ† based on random edge sampling in both streaming and non-streaming models. These methods outperformed the state-of-the-art methods consistently and could be better by orders of magnitude when the graph is very large. More importantly, we proved the improvement ratio analytically and veriļ¬ed our result extensively in real-world networks. The analytical results were achieved by simplifying the variances of the estimators based on the assumption that the graph is very large. We believe that such big data assumption can lead to interesting results not only in triangle estimation but also in other sampling problems. Next, we studied the estimation of C in both streaming and non-streaming sampling models. Despite numerous algorithms proposed in this area, the bias and variance of the estimators remain an open problem. We quantiļ¬ed the bias using Taylor expansion and found that the bias can be determined by the structure of the sampled data. Based on the understanding of the bias, we gave new estimators that correct the bias. The results were derived analytically and veriļ¬ed in 56 real networks ranging in diļ¬€erent sizes and structures. The experiments reveal that the bias ranges widely from data to data. The relative bias can be as high as 4% in non-streaming model and 2% in streaming model, or it can be negative. We also derived the variances of the estimators, and the estimators for the variances. Our simpliļ¬ed estimators can be used in practice to control the accuracy level of estimations
    corecore