2,987 research outputs found

    LAGOS-AND: A Large Gold Standard Dataset for Scholarly Author Name Disambiguation

    Full text link
    In this paper, we present a method to automatically build large labeled datasets for the author ambiguity problem in the academic world by leveraging the authoritative academic resources, ORCID and DOI. Using the method, we built LAGOS-AND, two large, gold-standard datasets for author name disambiguation (AND), of which LAGOS-AND-BLOCK is created for clustering-based AND research and LAGOS-AND-PAIRWISE is created for classification-based AND research. Our LAGOS-AND datasets are substantially different from the existing ones. The initial versions of the datasets (v1.0, released in February 2021) include 7.5M citations authored by 798K unique authors (LAGOS-AND-BLOCK) and close to 1M instances (LAGOS-AND-PAIRWISE). And both datasets show close similarities to the whole Microsoft Academic Graph (MAG) across validations of six facets. In building the datasets, we reveal the variation degrees of last names in three literature databases, PubMed, MAG, and Semantic Scholar, by comparing author names hosted to the authors' official last names shown on the ORCID pages. Furthermore, we evaluate several baseline disambiguation methods as well as the MAG's author IDs system on our datasets, and the evaluation helps identify several interesting findings. We hope the datasets and findings will bring new insights for future studies. The code and datasets are publicly available.Comment: 33 pages, 7 tables, 7 figure

    We are not alone ! (at least, most of us). Homonymy in large scale social groups

    Full text link
    This article brings forward an estimation of the proportion of homonyms in large scale groups based on the distribution of first names and last names in a subset of these groups. The estimation is based on the generalization of the "birthday paradox problem". The main results is that, in societies such as France or the United States, identity collisions (based on first + last names) are frequent. The large majority of the population has at least one homonym. But in smaller settings, it is much less frequent : even if small groups of a few thousand people have at least one couple of homonyms, only a few individuals have an homonym

    Knowledge management for TEXPROS

    Get PDF
    Most of the document processing systems today have applied Al technologies to support their system intelligent behaviors. For the application of Al technologies in such systems, the core problem is how to represent and manage different kinds of knowledge to support their inference engine components\u27 functionalities. In other words, knowledge management has become a critical issue in the document processing systems. In this dissertation, within the scope of the TEXt PROcessing System (TEXPROS), we identify knowledge of various kinds that are applicable in the system. We investigate several problems of managing this knowledge and then develop a knowledge base for TEXPROS. In developing this knowledge base, we present approaches to representing and managing different kinds of knowledge to support its inference engine components\u27 functionalities. In TEXPROS, a dual-model paradigm is used, which contains the folder organization and the document type hierarchy, to represent and manage documents. We introduce a new System Catalog structure to represent and manage the knowledge for TEXPROS. This knowledge includes the system-level information of the folder organization and the document type hierarchy, and the operational level information of the document base itself. A unified storage approach is employed to store both the operational level information and system level information. Such storage is to house the frame template base and frame instance base. An enhanced two-level thesaurus model is presented in this dissertation. When dealing with special kinds of data in processing documents, a new structure DataDomain is presented, which supports the extended thesaurus functionalities, pattern recognition and data type operations. Based on the dual-model paradigm of TEXPROS, a concept of “Semantic Range” is presented to solve the sense ambiguity problems. In this dissertation, we also present the approaches to implement the general KeyTerm transformation and approximate term matching of TEXPROS. Finally, a new component “Registration Center” at the knowledge management level of TEXPROS is presented. The registration center aims to help users handle knowledge packages for specific working domain and to solve the knowledge porting problem for TEXPROS. This dissertation is concluded with the future research work

    An ontology-based approach to systems biology literature retrieval and processing

    Get PDF
    This paper details the SysBio Explorer, a Systems Biology Literature Retrieval and Processing Framework, whose aim relies on the automatic inference of regulatory and metabolic networks based on biomedical literature. The SysBio Explorer does not focus on any organism or problem in particular and encompasses a number of processing and analysis techniques. It works over full-text documents, applying Natural Language Processing techniques and using biomedical dictionaries and ontologies together with hand-made rules. Besides biological entity recognition and relation extraction, document classification, relevance assessment and authoring networks are also within its present scope. The framework is described in terms of its design requirements and implementation decisions, exposing current achievements, but also highlighting present obstacles and future work. Experiments over realworld problems concerning the organisms E. coli, S. cerevisiae and H. pylori are used in its validation.Fundação para a Ciência e a Tecnologia (FCT) - POCI/BIO/60139/2004; POSI/PLP/43931/2001; POSC project POSC/339/1.3/C/NAC.Fundação para a Computação Científica Nacional

    In the Name of the Name : RDF Literals, ER Attributes, and the Potential to Rethink the Structures and Visualizations of Catalogs

    Get PDF
    The aim of this study is to contribute to the field of machine-processable bibliographic data that is suitable for the Semantic Web. We examine the Entity Relationship (ER) model, which has been selected by IFLA as a “conceptual framework” in order to model the FR family (FRBR, FRAD, and RDA), and the problems ER causes as we move towards the Semantic Web. Subsequently, while maintaining the semantics of the aforementioned standards but rejecting the ER as a conceptual framework for bibliographic data, this paper builds on the RDF (Resource Description Framework) potential and documents how both the RDF and Linked Data’s rationale can affect the way we model bibliographic data. In this way, a new approach to bibliographic data emerges where the distinction between description and authorities is obsolete. Instead, the integration of the authorities with descriptive information becomes fundamental so that a network of correlations can be established between the entities and the names by which the entities are known. Naming is a vital issue for human cultures because names are not random sequences of characters or sounds that stand just as identifiers for the entities—they also have socio-cultural meanings and interpretations. Thus, instead of describing indivisible resources, we could describe entities that appear in a variety of names on various resources. In this study, a method is proposed to connect the names with the entities they represent and, in this way, to document the provenance of these names by connecting specific resources with specific names

    Studying Pub M ed usages in the field for complex problem solving: Implications for tool design

    Full text link
    Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/97504/1/asi22796.pd

    A data driven approach to mapping urban neighbourhoods

    Get PDF
    Neighbourhoods have been described by the UK Secretary of State for Communities and Local Government as the “building blocks of public service society”. Despite this, difficulties in data collection combined with the concept’s subjective nature have left most countries lacking official neighbourhood definitions. This issue has implications not only for policy, but for the field of computational social science as a whole (with many studies being forced to use administrative units as proxies despite the fact that these bear little connection to resident perceptions of social boundaries). In this paper we illustrate that the mass linguistic datasets now available on the internet need only be combined with relatively simple linguistic computational models to produce definitions that are not only probabilistic and dynamic, but do not require a priori knowledge of neighbourhood names

    Master of Science

    Get PDF
    thesisILIAD is a microcomputer based program that has been developed to facilitate the construction and management of bibliographic knowledge bases. The contents of these knowledge bases are selected bibliographic citations to the literature. The knowledge base contents are organized by knowledge models that permit access to and the retrieval of relevant information. A knowledge structure referred to as a 'relation' is the fundamental unit of this knowledge model. It is shown that conceptual networks of relations may be built and utilized to perform high quality, efficient information queries of the knowledge base. Quality and efficiency are assessed by the measures of recall and precision, functions of the general relevance of information retrieved, and time
    • …
    corecore