91 research outputs found
Integration of Link and Semantic Relations for Information Recommendation
Information services on the Internet are being used as an important tool to facilitate discovery of the information that is of user interests. Many approaches have been proposed to discover the information on the Internet, while the search engines are the most common ones. However, most of the current approaches of information discovery can discover the keyword-matching information only but cannot recommend the most recent and relative information to users automatically. Sometimes users can give only a fuzzy keyword instead of an accurate one. Thus, some desired information would be ignored by the search engines. Moreover, the current search engines cannot discover the latent but logically relevant information or services for users. This paper measures the semantic-similarity and link-similarity between keywords. Based on that, it introduces the concept of similarity of web pages, and presents a method for information recommendation. The experimental evaluation and comparisons with the existing studies are finally performed
VERSE: Versatile Graph Embeddings from Similarity Measures
Embedding a web-scale information network into a low-dimensional vector space
facilitates tasks such as link prediction, classification, and visualization.
Past research has addressed the problem of extracting such embeddings by
adopting methods from words to graphs, without defining a clearly
comprehensible graph-related objective. Yet, as we show, the objectives used in
past works implicitly utilize similarity measures among graph nodes.
In this paper, we carry the similarity orientation of previous works to its
logical conclusion; we propose VERtex Similarity Embeddings (VERSE), a simple,
versatile, and memory-efficient method that derives graph embeddings explicitly
calibrated to preserve the distributions of a selected vertex-to-vertex
similarity measure. VERSE learns such embeddings by training a single-layer
neural network. While its default, scalable version does so via sampling
similarity information, we also develop a variant using the full information
per vertex. Our experimental study on standard benchmarks and real-world
datasets demonstrates that VERSE, instantiated with diverse similarity
measures, outperforms state-of-the-art methods in terms of precision and recall
in major data mining tasks and supersedes them in time and space efficiency,
while the scalable sampling-based variant achieves equally good results as the
non-scalable full variant.Comment: In WWW 2018: The Web Conference. 10 pages, 5 figure
SimRank*: effective and scalable pairwise similarity search based on graph topology
Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed to (have incoming edges) from similar nodes” can be regarded as an aggregation of similarities based on incoming paths. Despite its popularity in various applications (e.g., web search and social networks), SimRank has an undesirable trait, i.e., “zero-similarity”: it accommodates only the paths of equal length from a common “center” node, whereas a large portion of other paths are fully ignored. In this paper, we propose an effective and scalable similarity model, SimRank*, to remedy this problem. (1) We first provide a sufficient and necessary condition of the “zero-similarity” problem that exists in Jeh and Widom’s SimRank model, Li et al. ’s SimRank model, Random Walk with Restart (RWR), and ASCOS++. (2) We next present our treatment, SimRank*, which can resolve this issue while inheriting the merit of the simple SimRank philosophy. (3) We reduce the series form of SimRank* to a closed form, which looks simpler than SimRank but which enriches semantics without suffering from increased computational overhead. This leads to an iterative form of SimRank*, which requires O(Knm) time and O(n2) memory for computing all (n2) pairs of similarities on a graph of n nodes and m edges for K iterations. (4) To improve the computational time of SimRank* further, we leverage a novel clustering strategy via edge concentration. Due to its NP-hardness, we devise an efficient heuristic to speed up all-pairs SimRank* computation to O(Knm~) time, where m~ is generally much smaller than m. (5) To scale SimRank* on billion-edge graphs, we propose two memory-efficient single-source algorithms, i.e., ss-gSR* for geometric SimRank*, and ss-eSR* for exponential SimRank*, which can retrieve similarities between all n nodes and a given query on an as-needed basis. This significantly reduces the O(n2) memory of all-pairs search to either O(Kn+m~) for geometric SimRank*, or O(n+m~) for exponential SimRank*, without any loss of accuracy, where m~≪n2 . (6) We also compare SimRank* with another remedy of SimRank that adds self-loops on each node and demonstrate that SimRank* is more effective. (7) Using real and synthetic datasets, we empirically verify the richer semantics of SimRank*, and validate its high computational efficiency and scalability on large graphs with billions of edges
Sequence queries on temporal graphs
Graphs that evolve over time are called temporal graphs. They can be used to describe and represent real-world networks, including transportation networks, social networks, and communication networks, with higher fidelity and accuracy. However, research is still limited on how to manage large scale temporal graphs and execute queries over these graphs efficiently and effectively. This thesis investigates the problems of temporal graph data management related to node and edge sequence queries. In temporal graphs, nodes and edges can evolve over time. Therefore, sequence queries on nodes and edges can be key components in managing temporal graphs. In this thesis, the node sequence query decomposes into two parts: graph node similarity and subsequence matching. For node similarity, this thesis proposes a modified tree edit distance that is metric and polynomially computable and has a natural, intuitive interpretation. Note that the proposed node similarity works even for inter-graph nodes and therefore can be used for graph de-anonymization, network transfer learning, and cross-network mining, among other tasks. The subsequence matching query proposed in this thesis is a framework that can be adopted to index generic sequence and time-series data, including trajectory data and even DNA sequences for subsequence retrieval. For edge sequence queries, this thesis proposes an efficient storage and optimized indexing technique that allows for efficient retrieval of temporal subgraphs that satisfy certain temporal predicates. For this problem, this thesis develops a lightweight data management engine prototype that can support time-sensitive temporal graph analytics efficiently even on a single PC
Computational Approaches for Estimating Life Cycle Inventory Data
Data gaps in life cycle inventory (LCI) are stumbling blocks for
investigating the life cycle performance and impact of emerging technologies. It
can be tedious, expensive and time consuming for LCI practitioners to collect LCI
data or to wait for experime
ntal data become available.
I
propose a
computational approach to estimate missing LCI data using link prediction
techniques in network science.
LCI data in E
coinvent 3.1 is used to test the
method.
The proposed
approach is based on the similarities between different
processes or environmental intervention
s in the LCI database. By comparing two
processes’ material inputs and emission outputs,
I
measure the similarity of
these processes.
I
hypothesize that similar
processes tend to have similar
material inputs and emission outputs which are life cycle inventory data
I
want
to estimate. In particular,
I
measure similarity using four metrics, including
average difference, Pearson correlation coefficient,
Euclidean di
stance, and
SimRank with or without data normalization
.
I
test these four metrics
and
normalization method
for their performance of estimating missing LCI data.
The
results show that processes in the same industrial classification have
higher similarities,
which validat
e the
approach of measuring the similarity
between unit processes.
I
remove a small set of data (from one data point to 50)
for each process and then use the rest of LCI data as to train the model for
estimating the removed data.
I
t is found
that approximately 80% of removed
data can be successfully estimated with less than 10% errors. This st
udy is the
first attempt in the
searching for an effective computational method for
estimating missing LCI data.
I
t is
anticipate
d
that
this approach wil
l significantly
transform LCI compilation and LCA studies in future.Master of ScienceNatural Resources and EnvironmentUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/134693/3/Cai_Jiarui_Document.pd
struc2gauss: Structural role preserving network embedding via Gaussian embedding
Network embedding (NE) is playing a principal role in network mining, due to its ability to map nodes into efficient low-dimensional embedding vectors. However, two major limitations exist in state-of-the-art NE methods: role preservation and uncertainty modeling. Almost all previous methods represent a node into a point in space and focus on local structural information, i.e., neighborhood information. However, neighborhood information does not capture global structural information and point vector representation fails in modeling the uncertainty of node representations. In this paper, we propose a new NE framework, struc2gauss, which learns node representations in the space of Gaussian distributions and performs network embedding based on global structural information. struc2gauss first employs a given node similarity metric to measure the global structural information, then generates structural context for nodes and finally learns node representations via Gaussian embedding. Different structural similarity measures of networks and energy functions of Gaussian embedding are investigated. Experiments conducted on real-world networks demonstrate that struc2gauss effectively captures global structural information while state-of-the-art network embedding methods fail to, outperforms other methods on the structure-based clustering and classification task and provides more information on uncertainties of node representations
- …