147 research outputs found
Whois? Deep Author Name Disambiguation using Bibliographic Data
As the number of authors is increasing exponentially over years, the number
of authors sharing the same names is increasing proportionally. This makes it
challenging to assign newly published papers to their adequate authors.
Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in
digital libraries. This paper proposes an Author Name Disambiguation (AND)
approach that links author names to their real-world entities by leveraging
their co-authors and domain of research. To this end, we use a collection from
the DBLP repository that contains more than 5 million bibliographic records
authored by around 2.6 million co-authors. Our approach first groups authors
who share the same last names and same first name initials. The author within
each group is identified by capturing the relation with his/her co-authors and
area of research, which is represented by the titles of the validated
publications of the corresponding author. To this end, we train a neural
network model that learns from the representations of the co-authors and
titles. We validated the effectiveness of our approach by conducting extensive
experiments on a large dataset.Comment: Accepted for publication @ TPDL202
Name Disambiguation from link data in a collaboration graph using temporal and topological features
In a social community, multiple persons may share the same name, phone number
or some other identifying attributes. This, along with other phenomena, such as
name abbreviation, name misspelling, and human error leads to erroneous
aggregation of records of multiple persons under a single reference. Such
mistakes affect the performance of document retrieval, web search, database
integration, and more importantly, improper attribution of credit (or blame).
The task of entity disambiguation partitions the records belonging to multiple
persons with the objective that each decomposed partition is composed of
records of a unique person. Existing solutions to this task use either
biographical attributes, or auxiliary features that are collected from external
sources, such as Wikipedia. However, for many scenarios, such auxiliary
features are not available, or they are costly to obtain. Besides, the attempt
of collecting biographical or external data sustains the risk of privacy
violation. In this work, we propose a method for solving entity disambiguation
task from link information obtained from a collaboration network. Our method is
non-intrusive of privacy as it uses only the time-stamped graph topology of an
anonymized network. Experimental results on two real-life academic
collaboration networks show that the proposed method has satisfactory
performance.Comment: The short version of this paper has been accepted to ASONAM 201
Author Matching Classification with Anomaly Detection Approach for Bibliomethric Repository Data
Authors name disambiguation (AND) is a complex problem in the process of identifying an author in a digital library (DL). The AND data classification process is very much determined by the grouping process and data processing techniques before entering the classifier algorithm. In general, the data pre-processing technique used is pairwise and similarity to do author matching. In a large enough data set scale, the pairwise technique used in this study is to do a combination of each attribute in the AND dataset and by defining a binary class for each author matching combination, where the unequal author is given a value of 0 and the same author is given a value of 1. The technique produces very high imbalance data where class 0 becomes 98.9% of the amount of data compared to 1.1% of class 1. The results bring up an analysis in which class 1 can be considered and processed as data anomaly of the whole data. Therefore, anomaly detection is the method chosen in this study using the Isolation Forest algorithm as its classifier. The results obtained are very satisfying in terms of accuracy which can reach 99.5%
Deep Neural Network Structure to Improve Individual Performance based Author Classification
This paper proposed an improved method for author name disambiguation problem, both homonym and synonym. The data prepared is the distance data of each pair of author’s attributes, Levenshtein distance are used. Using Deep Neural Networks, we found large gains on performance. The result shows that level of accuracy is 99.6% with a low number of hidden layer
A Semantic Graph-Based Approach for Mining Common Topics From Multiple Asynchronous Text Streams
In the age of Web 2.0, a substantial amount of unstructured
content are distributed through multiple text streams in an
asynchronous fashion, which makes it increasingly difficult
to glean and distill useful information. An effective way to
explore the information in text streams is topic modelling,
which can further facilitate other applications such as search,
information browsing, and pattern mining. In this paper, we
propose a semantic graph based topic modelling approach
for structuring asynchronous text streams. Our model in-
tegrates topic mining and time synchronization, two core
modules for addressing the problem, into a unified model.
Specifically, for handling the lexical gap issues, we use global
semantic graphs of each timestamp for capturing the hid-
den interaction among entities from all the text streams.
For dealing with the sources asynchronism problem, local
semantic graphs are employed to discover similar topics of
different entities that can be potentially separated by time
gaps. Our experiment on two real-world datasets shows that
the proposed model significantly outperforms the existing
ones
Comparison and analysis of supervised machine learning algorithms
When investigating a network for signs of infiltration, intrusion detection is used. An intrusion detection system is designed to prevent unwanted access to the system. Data mining techniques have been employed by a number of researchers to detect infiltrations in this field. Based on distance measurements, this study proposes algorithms for supervised machine learning. In terms of detection rate, accuracy, false alarm rate, and Matthews correlation coefficient, supervised machine learning techniques surpass other algorithms. When it comes to serial execution time, the supervised machine learning algorithms surpassed all other Actions in terms of serial execution performance
NLP Driven Models for Automatically Generating Survey Articles for Scientific Topics.
This thesis presents new methods that use natural language processing (NLP) driven models for summarizing research in scientific fields. Given a topic query in the form of a text string, we present methods for finding research articles relevant to the topic as well as summarization algorithms that use lexical and discourse information present in the text of these articles to generate coherent and readable extractive summaries of past research on the topic. In addition to summarizing prior research, good survey articles should also forecast future trends. With this motivation, we present work on forecasting future impact of scientific publications using NLP driven features.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113407/1/rahuljha_1.pd
Entity-Oriented Search
This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms
- …