15 research outputs found

    Rule-based deduplication of article records from bibliographic databases

    Get PDF
    We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments

    A Suit of Record Normalization Methods, From Naive Ones, Globally Mine a Group of Duplicate Records

    Get PDF
    The promise of Big Data pivots after tending to a few big data integration challenges, for example, record linkage at scale, continuous data combination, and incorporating Deep Web. Although much work has been directed on these issues, there is restricted work on making a uniform, standard record from a gathering of records comparing to a similar genuine element. We allude to this errand as record normalization. Such a record portrayal, instituted normalized record, is significant for both front-end and back-end applications. In this paper, we formalize the record normalization issue, present top to bottom examination of normalization granularity levels (e.g., record, field, and worth segment) and of normalization structures (e.g., common versus complete). We propose an exhaustive structure for registering the normalized record. The proposed system incorporates a suit of record normalization techniques, from guileless ones, which utilize just the data accumulated from records themselves, to complex methodologies, which all around mine a gathering of copy records before choosing an incentive for a quality of a normalized record

    srBERT: automatic article classification model for systematic review using BERT

    Get PDF
    Background Systematic reviews (SRs) are recognized as reliable evidence, which enables evidence-based medicine to be applied to clinical practice. However, owing to the significant efforts required for an SR, its creation is time-consuming, which often leads to out-of-date results. To support SR tasks, tools for automating these SR tasks have been considered; however, applying a general natural language processing model to domain-specific articles and insufficient text data for training poses challenges. Methods The research objective is to automate the classification of included articles using the Bidirectional Encoder Representations from Transformers (BERT) algorithm. In particular, srBERT models based on the BERT algorithm are pre-trained using abstracts of articles from two types of datasets, and the resulting model is then fine-tuned using the article titles. The performances of our proposed models are compared with those of existing general machine-learning models. Results Our results indicate that the proposed srBERTmy model, pre-trained with abstracts of articles and a generated vocabulary, achieved state-of-the-art performance in both classification and relation-extraction tasks; for the first task, it achieved an accuracy of 94.35% (89.38%), F1 score of 66.12 (78.64), and area under the receiver operating characteristic curve of 0.77 (0.9) on the original and (generated) datasets, respectively. In the second task, the model achieved an accuracy of 93.5% with a loss of 27%, thereby outperforming the other evaluated models, including the original BERT model. Conclusions Our research shows the possibility of automatic article classification using machine-learning approaches to support SR tasks and its broad applicability. However, because the performance of our model depends on the size and class ratio of the training dataset, it is important to secure a dataset of sufficient quality, which may pose challenges.The authors received no fnancial support for the research, authorship, and publication of this article

    Miary podobieństw łańcuchów znakowych a deduplikacja rekordów w bibliograficznych bazach danych

    Get PDF
    Thesis/Objective – The article presents the method of deduplicating/linking bibliographic records in databases based on the string similarity metrics. The proposal is based on the author’s own experience acquired while building a bibliographic database and conducting bibliometric research based on data acquired from publicly available bibliographic databases. The formal description of the method is illustrated with data obtained from the CYTBIN database. Research methods – The development of the method required a review of information architecture of selected Polish bibliographic databases and an identification of problems that affect them, resulting not only from data models but also from the construction of their graphical user interfaces. Several string similarity metrics were analyzed and some of them were used as components of the finally proposed compound method. The method enables the evaluation of bibliographic record similarity based on their attributes. Results – The results presented on the example of data acquired from CYTBIN database enabled the empirical verification of the proposed method. In addition, the author performed the analysis of the similarity distribution of bibliographic records from the CYTBIN database calculated for the proposed method and for Jaro-Winkler algorithm based on the titles of bibliographic units. Conclusions – The proposed method, after adjusting its parameters to the specificity of selected bibliographic databases, can be used to improve the quality of bibliographic data. Depending on the performance of the computer system, the proactive model (the verification before adding a given record to a database) or/and reactive model (the verification of all or just recently added records, performed for instance during a minor system load at daily intervals) can be implemented

    Evidence surveillance to keep up to date with new research

    Get PDF

    Evaluation of unique identifiers used as keys to match identical publications in Pure and SciVal:a case study from health science

    Get PDF
    Unique identifiers (UID) are seen as an effective key to match identical publications across databases or identify duplicates in a database. The objective of the present study is to investigate how well UIDs work as match keys in the integration between Pure and SciVal, based on a case with publications from the health sciences. We evaluate the matching process based on information about coverage, precision, and characteristics of publications matched versus not matched with UIDs as the match keys. We analyze this information to detect errors, if any, in the matching process. As an example we also briefly discuss how publication sets formed by using UIDs as the match keys may affect the bibliometric indicators number of publications, number of citations, and the average number of citations per publication.  The objective is addressed in a literature review and a case study. The literature review shows that only a few studies evaluate how well UIDs work as a match key. From the literature we identify four error types: Duplicate digital object identifiers (DOI), incorrect DOIs in reference lists and databases, DOIs not registered by the database where a bibliometric analysis is performed, and erroneous optical or special character recognition. The case study explores the use of UIDs in the integration between the databases Pure and SciVal. Specifically journal publications in English are matched between the two databases. We find all error types except erroneous optical or special character recognition in our publication sets. In particular the duplicate DOIs constitute a problem for the calculation of bibliometric indicators as both keeping the duplicates to improve the reliability of citation counts and deleting them to improve the reliability of publication counts will distort the calculation of average number of citations per publication. The use of UIDs as a match key in citation linking is implemented in many settings, and the availability of UIDs may become critical for the inclusion of a publication or a database in a bibliometric analysis

    Implementación de un prototipo funcional para integrar bases de datos bibliográficas heterogéneas

    Get PDF
    Biblored (Sistema de bibliotecas públicas de Bogotá) en conjunto con Fundalectura (Fundación para el Fomento de la Lectura), han planteado la necesidad de implementar una herramienta que permita a los usuarios externos consultar la oferta de material bibliográfico disponible para préstamo y la realización de reservas desde una única aplicación, aunque se cuente con varias bases de datos independientes y autónomas. La información consultada corresponde a las 4 bibliotecas mayores de Biblored, 4 bibliotecas locales y 9 de barrio, así como a las Bibloestaciones. Estos catálogos corresponden a los principales actores del sistema distrital de bibliotecas públicas, conformado por Biblored y sus bibliotecas, Fundalectura con las Bibloestaciones, y la Secretaría de Cultura, Recreación y Deporte (Biblored) quien administra las bibliotecas Comunitarias. De igual manera el sistema debe estar diseñado para que a corto, mediano o largo plazo se adhieran más actores al sistema, tales como las bibliotecas comunitarias o colecciones especializadas. El problema a resolver consiste en realizar una integración de estas bases de datos bibliográficas heterogéneas, para lo cual se realizan las siguientes actividades: En la primera parte se presentan las tendencias en las cuales está trabajando la comunidad académica y la industria en cuanto al de sus catálogos e integraciones, todo esto bajo el marco definido por los requerimientos del sistema acordados con Biblored y Fundalectura. Se revisan las principales técnicas aplicadas para la integración de aplicaciones empresariales (EAI), los procesos de extracción, transformación y lectura de los datos (ETL), así como los dos principales mecanismos de integración consistentes en unificación centralizada, indexación y metabúsqueda Se presentan las ventajas de cada uno de los métodos, como se pueden combinar, generando un prototipo funcional y sostenible.Abstract. Biblored (public library system of Bogota) in conjunction with Fundalectura (Foundation for the Promotion of Reading), have raised the need to implement a tool that allows users to search for books and reserve them for loaning, using a single application, even if there are several independent databases and autonomous data. These catalogs belong to the main actors of the district public library system, which are Biblored and his libraries, Fundalectura with "Bibloestaciones” , and community libraries. The system must be capable to receive more libraries belonging to other private or public entities in the near future. The problem to solve is to perform an integration of these heterogeneous bibliographic databases. In the first part of this article the industry tendencies in heterogeneous bibliographic database integration will be explained, as well as the specific requirements determined by Biblored and Fundalectura. In the second part, the main techniques used for enterprise application integration (EAI) are reviewed, the extraction process, processing and reading of data (ETL), as well as federated or distributed integration techniques will be revised, in order to determine their individual advantages and how they can be combined. Finally this strategy is described and the implementation approach will be shown.Maestrí

    Rule-based deduplication of article records from bibliographic databases

    Get PDF
    We recently designed and deployed a metasearch engine, Metta, that sends queries and retrieves search results from five leading biomedical databases: PubMed, EMBASE, CINAHL, PsycINFO and the Cochrane Central Register of Controlled Trials. Because many articles are indexed in more than one of these databases, it is desirable to deduplicate the retrieved article records. This is not a trivial problem because data fields contain a lot of missing and erroneous entries, and because certain types of information are recorded differently (and inconsistently) in the different databases. The present report describes our rule-based method for deduplicating article records across databases and includes an open-source script module that can be deployed freely. Metta was designed to satisfy the particular needs of people who are writing systematic reviews in evidence-based medicine. These users want the highest possible recall in retrieval, so it is important to err on the side of not deduplicating any records that refer to distinct articles, and it is important to perform deduplication online in real time. Our deduplication module is designed with these constraints in mind. Articles that share the same publication year are compared sequentially on parameters including PubMed ID number, digital object identifier, journal name, article title and author list, using text approximation techniques. In a review of Metta searches carried out by public users, we found that the deduplication module was more effective at identifying duplicates than EndNote without making any erroneous assignments

    Automating Systematic Reviews

    Get PDF
    corecore