5 research outputs found
Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs
My thesis would focus on analyzing the estimation of node similarity in streaming bipartite
graph. As an important model in many applications of data mining, the bipartite
graph represents the relationships between two sets of non-interconnected nodes, e.g. customers
and the products/services they buy, users and the events/groups they get involved
in, individuals and the diseases that they are subject to, etc. In most of these cases, data is
naturally streaming over time.
The node similarity in my thesis is mainly referred to neighborhood-based similarity,
i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN
in terms of the CN score, its dense ranks, in which equal weight objects receive the same
rank and ranks are consecutive, and its fraction in full projection graph, which is also
called similarity graph. We find that, in real-world dataset, the pairs of nodes with large
value of CN only constitute a relatively quite small fraction. With this property, real-world
streaming bipartite graph provide an opportunity for space saving by weighted sampling,
which can preferentially select high weighted edges.
Therefore, in this thesis, we propose a new one pass scheme for sampling the projection
graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of
the CN similarity weights
Sample-Based Estimation of Node Similarity in Streaming Bipartite Graphs
My thesis would focus on analyzing the estimation of node similarity in streaming bipartite
graph. As an important model in many applications of data mining, the bipartite
graph represents the relationships between two sets of non-interconnected nodes, e.g. customers
and the products/services they buy, users and the events/groups they get involved
in, individuals and the diseases that they are subject to, etc. In most of these cases, data is
naturally streaming over time.
The node similarity in my thesis is mainly referred to neighborhood-based similarity,
i.e., Common Neighbors (CN) measure. We analyze the distributional properties of CN
in terms of the CN score, its dense ranks, in which equal weight objects receive the same
rank and ranks are consecutive, and its fraction in full projection graph, which is also
called similarity graph. We find that, in real-world dataset, the pairs of nodes with large
value of CN only constitute a relatively quite small fraction. With this property, real-world
streaming bipartite graph provide an opportunity for space saving by weighted sampling,
which can preferentially select high weighted edges.
Therefore, in this thesis, we propose a new one pass scheme for sampling the projection
graphs of streaming bipartite graph in fixed storage and providing unbiased estimates of
the CN similarity weights