5 research outputs found

    Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs

    Get PDF
    My thesis would focus on analyzing the estimation of node similarity in streaming bipartite graph. As an important model in many applications of data mining, the bipartite graph represents the relationships between two sets of non-interconnected nodes, e.g. customers and the products/services they buy, users and the events/groups they get involved in, individuals and the diseases that they are subject to, etc. In most of these cases, data is naturally streaming over time. The node similarity in my thesis is mainly referred to neighborhood-based similarity, i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN in terms of the CN score, its dense ranks, in which equal weight objects receive the same rank and ranks are consecutive, and its fraction in full projection graph, which is also called similarity graph. We find that, in real-world dataset, the pairs of nodes with large value of CN only constitute a relatively quite small fraction. With this property, real-world streaming bipartite graph provide an opportunity for space saving by weighted sampling, which can preferentially select high weighted edges. Therefore, in this thesis, we propose a new one pass scheme for sampling the projection graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of the CN similarity weights

    Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs

    Get PDF
    My thesis would focus on analyzing the estimation of node similarity in streaming bipartite graph. As an important model in many applications of data mining, the bipartite graph represents the relationships between two sets of non-interconnected nodes, e.g. customers and the products/services they buy, users and the events/groups they get involved in, individuals and the diseases that they are subject to, etc. In most of these cases, data is naturally streaming over time. The node similarity in my thesis is mainly referred to neighborhood-based similarity, i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN in terms of the CN score, its dense ranks, in which equal weight objects receive the same rank and ranks are consecutive, and its fraction in full projection graph, which is also called similarity graph. We find that, in real-world dataset, the pairs of nodes with large value of CN only constitute a relatively quite small fraction. With this property, real-world streaming bipartite graph provide an opportunity for space saving by weighted sampling, which can preferentially select high weighted edges. Therefore, in this thesis, we propose a new one pass scheme for sampling the projection graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of the CN similarity weights
    corecore