89 research outputs found
Name Disambiguation from link data in a collaboration graph using temporal and topological features
In a social community, multiple persons may share the same name, phone number
or some other identifying attributes. This, along with other phenomena, such as
name abbreviation, name misspelling, and human error leads to erroneous
aggregation of records of multiple persons under a single reference. Such
mistakes affect the performance of document retrieval, web search, database
integration, and more importantly, improper attribution of credit (or blame).
The task of entity disambiguation partitions the records belonging to multiple
persons with the objective that each decomposed partition is composed of
records of a unique person. Existing solutions to this task use either
biographical attributes, or auxiliary features that are collected from external
sources, such as Wikipedia. However, for many scenarios, such auxiliary
features are not available, or they are costly to obtain. Besides, the attempt
of collecting biographical or external data sustains the risk of privacy
violation. In this work, we propose a method for solving entity disambiguation
task from link information obtained from a collaboration network. Our method is
non-intrusive of privacy as it uses only the time-stamped graph topology of an
anonymized network. Experimental results on two real-life academic
collaboration networks show that the proposed method has satisfactory
performance.Comment: The short version of this paper has been accepted to ASONAM 201
Global geometric graph kernels and applications
This thesis explores the topics of graph kernels and classification of graphs. Graph kernels have received considerable attention in the last decade, in part because of their value in many practical applications, such as chemo informatics and molecular biology, in which classification using graph kernels have become the standard model for several problems. Perhaps even more important is the inclusion of graph kernels in the rich field of kernel methods, making a large family of machine learning algorithms, including support vector machines, applicable to data naturally represented as graphs. Graph kernels are similarity functions defined on pairs of graphs. Traditionally, graph kernels compare graphs in terms of features of subgraphs such as walks, paths or tree patterns. For the kernels to remain computationally efficient, these subgraphs are often chosen to be small. Because of this fact, most graph kernels adopt an inherently local perspective on the graph and may fail to discern global properties, such as the girth or the chromatic number, that are not captured in local structure. Furthermore, existing work on graph kernels lack results justifying a particular choice of kernel for a given application. In this thesis we propose two new graph kernels, designed to capture global properties of graphs, as described above. At the core of these kernels is Lov ́asz number, an important concept in graph theory with strong connections to graph properties like the chromatic number and the size of the largest clique. We give efficient sampling approximations to both kernels, allowing them to scale to large graphs. We also show that we can characterize the separation margin induced by these kernels in certain classification tasks. This serves as initial progress towards making theory aid kernel choice. We make an extensive empirical evaluation of both kernels on synthetic data with known global properties, and on real graphs frequently used to benchmark graph kernels. Finally, we present a new application of graph kernels in the field of data mining by redefining an important subproblem of entity disambiguation as a graph classification problem. We show empirically that our proposed method improves on the state-of-the-art
Global geometric graph kernels and applications
This thesis explores the topics of graph kernels and classification of graphs. Graph kernels have received considerable attention in the last decade, in part because of their value in many practical applications, such as chemo informatics and molecular biology, in which classification using graph kernels have become the standard model for several problems. Perhaps even more important is the inclusion of graph kernels in the rich field of kernel methods, making a large family of machine learning algorithms, including support vector machines, applicable to data naturally represented as graphs. Graph kernels are similarity functions defined on pairs of graphs. Traditionally, graph kernels compare graphs in terms of features of subgraphs such as walks, paths or tree patterns. For the kernels to remain computationally efficient, these subgraphs are often chosen to be small. Because of this fact, most graph kernels adopt an inherently local perspective on the graph and may fail to discern global properties, such as the girth or the chromatic number, that are not captured in local structure. Furthermore, existing work on graph kernels lack results justifying a particular choice of kernel for a given application. In this thesis we propose two new graph kernels, designed to capture global properties of graphs, as described above. At the core of these kernels is Lov ́asz number, an important concept in graph theory with strong connections to graph properties like the chromatic number and the size of the largest clique. We give efficient sampling approximations to both kernels, allowing them to scale to large graphs. We also show that we can characterize the separation margin induced by these kernels in certain classification tasks. This serves as initial progress towards making theory aid kernel choice. We make an extensive empirical evaluation of both kernels on synthetic data with known global properties, and on real graphs frequently used to benchmark graph kernels. Finally, we present a new application of graph kernels in the field of data mining by redefining an important subproblem of entity disambiguation as a graph classification problem. We show empirically that our proposed method improves on the state-of-the-art
Name Disambiguation in Anonymized Graphs using Network Embedding
In real-world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesake of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensic. To resolve this issue, the name disambiguation task is designed which aims to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing solutions to this task substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable due to the risk of privacy violation. In this work, we propose a novel name disambiguation method. Our proposed method is non-intrusive of privacy because instead of using attributes pertaining to a real-life person, our method leverages only relational data in the form of anonymized graphs. In the methodological aspect, the proposed method uses a novel representation learning model to embed each document in a low dimensional vector space where name disambiguation can be solved by a hierarchical agglomerative clustering algorithm. Our experimental results demonstrate that the proposed method is significantly better than the existing name disambiguation methods working in a similar setting
Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data
In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data
Bayesian Non-Exhaustive Classification A Case Study: Online Name Disambiguation using Temporal Record Streams
The name entity disambiguation task aims to partition the records of multiple
real-life persons so that each partition contains records pertaining to a
unique person. Most of the existing solutions for this task operate in a batch
mode, where all records to be disambiguated are initially available to the
algorithm. However, more realistic settings require that the name
disambiguation task be performed in an online fashion, in addition to, being
able to identify records of new ambiguous entities having no preexisting
records. In this work, we propose a Bayesian non-exhaustive classification
framework for solving online name disambiguation task. Our proposed method uses
a Dirichlet process prior with a Normal * Normal * Inverse Wishart data model
which enables identification of new ambiguous entities who have no records in
the training data. For online classification, we use one sweep Gibbs sampler
which is very efficient and effective. As a case study we consider
bibliographic data in a temporal stream format and disambiguate authors by
partitioning their papers into homogeneous groups. Our experimental results
demonstrate that the proposed method is better than existing methods for
performing online name disambiguation task.Comment: to appear in CIKM 201
Learning with Geometric Embeddings of Graphs
Graphs are natural representations of problems and data in many fields. For example, in computational biology, interaction networks model the functional relationships between genes in living organisms; in the social sciences, graphs are used to represent friendships and business relations among people; in chemoinformatics, graphs represent atoms and molecular bonds. Fields like these are often rich in data, to the extent that manual analysis is not feasible and machine learning algorithms are necessary to exploit the wealth of available information. Unfortunately, in machine learning research, there is a huge bias in favor of algorithms operating only on continuous vector valued data, algorithms that are not suitable for the combinatorial structure of graphs. In this thesis, we show how to leverage both the expressive power of graphs and the strength of established machine learning tools by introducing methods that combine geometric embeddings of graphs with standard learning algorithms. We demonstrate the generality of this idea by developing embedding algorithms for both simple and weighted graphs and applying them in both supervised and unsupervised learning problems such as classification and clustering. Our results provide both theoretical support for the usefulness of graph embeddings in machine learning and empirical evidence showing that this framework is often more flexible and better performing than competing machine learning algorithms for graphs
Whois? Deep Author Name Disambiguation using Bibliographic Data
As the number of authors is increasing exponentially over years, the number
of authors sharing the same names is increasing proportionally. This makes it
challenging to assign newly published papers to their adequate authors.
Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in
digital libraries. This paper proposes an Author Name Disambiguation (AND)
approach that links author names to their real-world entities by leveraging
their co-authors and domain of research. To this end, we use a collection from
the DBLP repository that contains more than 5 million bibliographic records
authored by around 2.6 million co-authors. Our approach first groups authors
who share the same last names and same first name initials. The author within
each group is identified by capturing the relation with his/her co-authors and
area of research, which is represented by the titles of the validated
publications of the corresponding author. To this end, we train a neural
network model that learns from the representations of the co-authors and
titles. We validated the effectiveness of our approach by conducting extensive
experiments on a large dataset.Comment: Accepted for publication @ TPDL202
Generalized Shortest Path Kernel on Graphs
We consider the problem of classifying graphs using graph kernels. We define
a new graph kernel, called the generalized shortest path kernel, based on the
number and length of shortest paths between nodes. For our example
classification problem, we consider the task of classifying random graphs from
two well-known families, by the number of clusters they contain. We verify
empirically that the generalized shortest path kernel outperforms the original
shortest path kernel on a number of datasets. We give a theoretical analysis
for explaining our experimental results. In particular, we estimate
distributions of the expected feature vectors for the shortest path kernel and
the generalized shortest path kernel, and we show some evidence explaining why
our graph kernel outperforms the shortest path kernel for our graph
classification problem.Comment: Short version presented at Discovery Science 2015 in Banf
- …