2,353 research outputs found

    Distantly Labeling Data for Large Scale Cross-Document Coreference

    Full text link
    Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on ``distantly-labeling'' a data set from which we can train a discriminative cross-document coreference model. In particular we build a dataset of more than a million people mentions extracted from 3.5 years of New York Times articles, leverage Wikipedia for distant labeling with a generative model (and measure the reliability of such labeling); then we train and evaluate a conditional random field coreference model that has factors on cross-document entities as well as mention-pairs. This coreference model obtains high accuracy in resolving mentions and entities that are not present in the training data, indicating applicability to non-Wikipedia data. Given the large amount of data, our work is also an exercise demonstrating the scalability of our approach.Comment: 16 pages, submitted to ECML 201

    Author identification in bibliographic data using deep neural networks

    Get PDF
    Author name disambiguation (AND) is a challenging task for scholars who mine bibliographic information for scientific knowledge. A constructive approach for resolving name ambiguity is to use computer algorithms to identify author names. Some algorithm-based disambiguation methods have been developed by computer and data scientists. Among them, supervised machine learning has been stated to produce decent to very accurate disambiguation results. This paper presents a combination of principal component analysis (PCA) as a feature reduction and deep neural networks (DNNs), as a supervised algorithm for classifying AND problems. The raw data is grouped into four classes, i.e., synonyms, homonyms, homonyms-synonyms, and non-homonyms-synonyms classification. We have taken into account several hyperparameters tuning, such as learning rate, batch size, number of the neuron and hidden units, and analyzed their impact on the accuracy of results. To the best of our knowledge, there are no previous studies with such a scheme. The proposed DNNs are validated with other ML techniques such as Naïve Bayes, random forest (RF), and support vector machine (SVM) to produce a good classifier. By exploring the result in all data, our proposed DNNs classifier has an outperformed other ML technique, with accuracy, precision, recall, and F1-score, which is 99.98%, 97.98%, 97.86%, and 99.99%, respectively. In the future, this approach can be easily extended to any dataset and any bibliographic records provider

    Ariadne's Thread - Interactive Navigation in a World of Networked Information

    Full text link
    This work-in-progress paper introduces an interface for the interactive visual exploration of the context of queries using the ArticleFirst database, a product of OCLC. We describe a workflow which allows the user to browse live entities associated with 65 million articles. In the on-line interface, each query leads to a specific network representation of the most prevailing entities: topics (words), authors, journals and Dewey decimal classes linked to the set of terms in the query. This network represents the context of a query. Each of the network nodes is clickable: by clicking through, a user traverses a large space of articles along dimensions of authors, journals, Dewey classes and words simultaneously. We present different use cases of such an interface. This paper provides a link between the quest for maps of science and on-going debates in HCI about the use of interactive information visualisation to empower users in their search.Comment: CHI'15 Extended Abstracts, April 18-23, 2015, Seoul, Republic of Korea. ACM 978-1-4503-3146-3/15/0

    Whois? Deep Author Name Disambiguation using Bibliographic Data

    Full text link
    As the number of authors is increasing exponentially over years, the number of authors sharing the same names is increasing proportionally. This makes it challenging to assign newly published papers to their adequate authors. Therefore, Author Name Ambiguity (ANA) is considered a critical open problem in digital libraries. This paper proposes an Author Name Disambiguation (AND) approach that links author names to their real-world entities by leveraging their co-authors and domain of research. To this end, we use a collection from the DBLP repository that contains more than 5 million bibliographic records authored by around 2.6 million co-authors. Our approach first groups authors who share the same last names and same first name initials. The author within each group is identified by capturing the relation with his/her co-authors and area of research, which is represented by the titles of the validated publications of the corresponding author. To this end, we train a neural network model that learns from the representations of the co-authors and titles. We validated the effectiveness of our approach by conducting extensive experiments on a large dataset.Comment: Accepted for publication @ TPDL202

    Towards Name Disambiguation: Relational, Streaming, and Privacy-Preserving Text Data

    Get PDF
    In the real world, our DNA is unique but many people share names. This phenomenon often causes erroneous aggregation of documents of multiple persons who are namesakes of one another. Such mistakes deteriorate the performance of document retrieval, web search, and more seriously, cause improper attribution of credit or blame in digital forensics. To resolve this issue, the name disambiguation task 1 is designed to partition the documents associated with a name reference such that each partition contains documents pertaining to a unique real-life person. Existing algorithms for this task mainly suffer from the following drawbacks. First, the majority of existing solutions substantially rely on feature engineering, such as biographical feature extraction, or construction of auxiliary features from Wikipedia. However, for many scenarios, such features may be costly to obtain or unavailable in privacy sensitive domains. Instead we solve the name disambiguation task in restricted setting by leveraging only the relational data in the form of anonymized graphs. Second, most of the existing works for this task operate in a batch mode, where all records to be disambiguated are initially available to the algorithm. However, more realistic settings require that the name disambiguation task should be performed in an online streaming fashion in order to identify records of new ambiguous entities having no preexisting records. Finally, we investigate the potential disclosure risk of textual features used in name disambiguation and propose several algorithms to tackle the task in a privacy-aware scenario. In summary, in this dissertation, we present a number of novel approaches to address name disambiguation tasks from the above three aspects independently, namely relational, streaming, and privacy preserving textual data
    • …
    corecore