9 research outputs found

    Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams

    Get PDF
    The name entity disambiguation task aims to partition the records of multiple real-life persons so that each partition contains records pertaining to a unique person. Most of the existing solutions for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task be performed in an online fashion, in addition to, being able to identify records of new ambiguous entities having no preexisting records. In this work, we propose a Bayesian non-exhaustive classification framework for solving online name disambiguation task. Our proposed method uses a Dirichlet process prior with a Normal * Normal * Inverse Wishart data model which enables identification of new ambiguous entities who have no records in the training data. For online classification, we use one sweep Gibbs sampler which is very efficient and effective. As a case study we consider bibliographic data in a temporal stream format and disambiguate authors by partitioning their papers into homogeneous groups. Our experimental results demonstrate that the proposed method is better than existing methods for performing online name disambiguation task.Comment: to appear in CIKM 201

    Name Disambiguation in Anonymized Graphs using Network Embedding

    Get PDF
    In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting

    Analysis and Actions on Graph Data.

    Full text link
    Graphs are commonly used for representing relations between entities and handling data processing in various research fields, especially in social, cyber and physical networks. Many data mining and inference tasks can be interpreted as certain actions on the associated graphs, including graph spectral decompositions, and insertions and removals of nodes or edges. For instance, the task of graph clustering is to group similar nodes on a graph, and it can be solved by graph spectral decompositions. The task of cyber attack is to find effective node or edge removals that lead to maximal disruption in network connectivity. In this dissertation, we focus on the following topics in graph data analytics: (1) Fundamental limits of spectral algorithms for graph clustering in single-layer and multilayer graphs. (2) Efficient algorithms for actions on graphs, including graph spectral decompositions and insertions and removals of nodes or edges. (3) Applications to deep community detection, event propagation in online social networks, and topological network resilience for cyber security. For (1), we established fundamental principles governing the performance of graph clustering for both spectral clustering and spectral modularity methods, which play an important role in unsupervised learning and data science. The framework is then extended to multilayer graphs entailing heterogeneous connectivity information. For (2), we developed efficient algorithms for large-scale graph data analytics with theoretical guarantees, and proposed theory-driven methods for automatic model order selection in graph clustering. For (3), we proposed a disruptive method for discovering deep communities in graphs, developed a novel method for analyzing event propagation on Twitter, and devised effective graph-theoretic approaches against explicit and lateral attacks in cyber systems.PHDElectrical & Computer Eng PhDUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/135752/1/pinyu_1.pd

    Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data

    Get PDF
    In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data
    corecore