91 research outputs found

    Integration of Link and Semantic Relations for Information Recommendation

    Get PDF
    Information services on the Internet are being used as an important tool to facilitate discovery of the information that is of user interests. Many approaches have been proposed to discover the information on the Internet, while the search engines are the most common ones. However, most of the current approaches of information discovery can discover the keyword-matching information only but cannot recommend the most recent and relative information to users automatically. Sometimes users can give only a fuzzy keyword instead of an accurate one. Thus, some desired information would be ignored by the search engines. Moreover, the current search engines cannot discover the latent but logically relevant information or services for users. This paper measures the semantic-similarity and link-similarity between keywords. Based on that, it introduces the concept of similarity of web pages, and presents a method for information recommendation. The experimental evaluation and comparisons with the existing studies are finally performed

    VERSE: Versatile Graph Embeddings from Similarity Measures

    Full text link
    Embedding a web-scale information network into a low-dimensional vector space facilitates tasks such as link prediction, classification, and visualization. Past research has addressed the problem of extracting such embeddings by adopting methods from words to graphs, without defining a clearly comprehensible graph-related objective. Yet, as we show, the objectives used in past works implicitly utilize similarity measures among graph nodes. In this paper, we carry the similarity orientation of previous works to its logical conclusion; we propose VERtex Similarity Embeddings (VERSE), a simple, versatile, and memory-efficient method that derives graph embeddings explicitly calibrated to preserve the distributions of a selected vertex-to-vertex similarity measure. VERSE learns such embeddings by training a single-layer neural network. While its default, scalable version does so via sampling similarity information, we also develop a variant using the full information per vertex. Our experimental study on standard benchmarks and real-world datasets demonstrates that VERSE, instantiated with diverse similarity measures, outperforms state-of-the-art methods in terms of precision and recall in major data mining tasks and supersedes them in time and space efficiency, while the scalable sampling-based variant achieves equally good results as the non-scalable full variant.Comment: In WWW 2018: The Web Conference. 10 pages, 5 figure

    SimRank*: effective and scalable pairwise similarity search based on graph topology

    Get PDF
    Given a graph, how can we quantify similarity between two nodes in an effective and scalable way? SimRank is an attractive measure of pairwise similarity based on graph topologies. Its underpinning philosophy that “two nodes are similar if they are pointed to (have incoming edges) from similar nodes” can be regarded as an aggregation of similarities based on incoming paths. Despite its popularity in various applications (e.g., web search and social networks), SimRank has an undesirable trait, i.e., “zero-similarity”: it accommodates only the paths of equal length from a common “center” node, whereas a large portion of other paths are fully ignored. In this paper, we propose an effective and scalable similarity model, SimRank*, to remedy this problem. (1) We first provide a sufficient and necessary condition of the “zero-similarity” problem that exists in Jeh and Widom’s SimRank model, Li et al. ’s SimRank model, Random Walk with Restart (RWR), and ASCOS++. (2) We next present our treatment, SimRank*, which can resolve this issue while inheriting the merit of the simple SimRank philosophy. (3) We reduce the series form of SimRank* to a closed form, which looks simpler than SimRank but which enriches semantics without suffering from increased computational overhead. This leads to an iterative form of SimRank*, which requires O(Knm) time and O(n2) memory for computing all (n2) pairs of similarities on a graph of n nodes and m edges for K iterations. (4) To improve the computational time of SimRank* further, we leverage a novel clustering strategy via edge concentration. Due to its NP-hardness, we devise an efficient heuristic to speed up all-pairs SimRank* computation to O(Knm~) time, where m~ is generally much smaller than m. (5) To scale SimRank* on billion-edge graphs, we propose two memory-efficient single-source algorithms, i.e., ss-gSR* for geometric SimRank*, and ss-eSR* for exponential SimRank*, which can retrieve similarities between all n nodes and a given query on an as-needed basis. This significantly reduces the O(n2) memory of all-pairs search to either O(Kn+m~) for geometric SimRank*, or O(n+m~) for exponential SimRank*, without any loss of accuracy, where m~≪n2 . (6) We also compare SimRank* with another remedy of SimRank that adds self-loops on each node and demonstrate that SimRank* is more effective. (7) Using real and synthetic datasets, we empirically verify the richer semantics of SimRank*, and validate its high computational efficiency and scalability on large graphs with billions of edges

    Sequence queries on temporal graphs

    Get PDF
    Graphs that evolve over time are called temporal graphs. They can be used to describe and represent real-world networks, including transportation networks, social networks, and communication networks, with higher fidelity and accuracy. However, research is still limited on how to manage large scale temporal graphs and execute queries over these graphs efficiently and effectively. This thesis investigates the problems of temporal graph data management related to node and edge sequence queries. In temporal graphs, nodes and edges can evolve over time. Therefore, sequence queries on nodes and edges can be key components in managing temporal graphs. In this thesis, the node sequence query decomposes into two parts: graph node similarity and subsequence matching. For node similarity, this thesis proposes a modified tree edit distance that is metric and polynomially computable and has a natural, intuitive interpretation. Note that the proposed node similarity works even for inter-graph nodes and therefore can be used for graph de-anonymization, network transfer learning, and cross-network mining, among other tasks. The subsequence matching query proposed in this thesis is a framework that can be adopted to index generic sequence and time-series data, including trajectory data and even DNA sequences for subsequence retrieval. For edge sequence queries, this thesis proposes an efficient storage and optimized indexing technique that allows for efficient retrieval of temporal subgraphs that satisfy certain temporal predicates. For this problem, this thesis develops a lightweight data management engine prototype that can support time-sensitive temporal graph analytics efficiently even on a single PC

    Computational Approaches for Estimating Life Cycle Inventory Data

    Full text link
    Data gaps in life cycle inventory (LCI) are stumbling blocks for investigating the life cycle performance and impact of emerging technologies. It can be tedious, expensive and time consuming for LCI practitioners to collect LCI data or to wait for experime ntal data become available. I propose a computational approach to estimate missing LCI data using link prediction techniques in network science. LCI data in E coinvent 3.1 is used to test the method. The proposed approach is based on the similarities between different processes or environmental intervention s in the LCI database. By comparing two processes’ material inputs and emission outputs, I measure the similarity of these processes. I hypothesize that similar processes tend to have similar material inputs and emission outputs which are life cycle inventory data I want to estimate. In particular, I measure similarity using four metrics, including average difference, Pearson correlation coefficient, Euclidean di stance, and SimRank with or without data normalization . I test these four metrics and normalization method for their performance of estimating missing LCI data. The results show that processes in the same industrial classification have higher similarities, which validat e the approach of measuring the similarity between unit processes. I remove a small set of data (from one data point to 50) for each process and then use the rest of LCI data as to train the model for estimating the removed data. I t is found that approximately 80% of removed data can be successfully estimated with less than 10% errors. This st udy is the first attempt in the searching for an effective computational method for estimating missing LCI data. I t is anticipate d that this approach wil l significantly transform LCI compilation and LCA studies in future.Master of ScienceNatural Resources and EnvironmentUniversity of Michiganhttp://deepblue.lib.umich.edu/bitstream/2027.42/134693/3/Cai_Jiarui_Document.pd

    struc2gauss: Structural role preserving network embedding via Gaussian embedding

    Get PDF
    Network embedding (NE) is playing a principal role in network mining, due to its ability to map nodes into efficient low-dimensional embedding vectors. However, two major limitations exist in state-of-the-art NE methods: role preservation and uncertainty modeling. Almost all previous methods represent a node into a point in space and focus on local structural information, i.e., neighborhood information. However, neighborhood information does not capture global structural information and point vector representation fails in modeling the uncertainty of node representations. In this paper, we propose a new NE framework, struc2gauss, which learns node representations in the space of Gaussian distributions and performs network embedding based on global structural information. struc2gauss first employs a given node similarity metric to measure the global structural information, then generates structural context for nodes and finally learns node representations via Gaussian embedding. Different structural similarity measures of networks and energy functions of Gaussian embedding are investigated. Experiments conducted on real-world networks demonstrate that struc2gauss effectively captures global structural information while state-of-the-art network embedding methods fail to, outperforms other methods on the structure-based clustering and classification task and provides more information on uncertainties of node representations
    corecore