104 research outputs found

    BERT WEAVER: Using WEight AVERaging to enable lifelong learning for transformer-based models in biomedical semantic search engines

    Full text link
    Recent developments in transfer learning have boosted the advancements in natural language processing tasks. The performance is, however, dependent on high-quality, manually annotated training data. Especially in the biomedical domain, it has been shown that one training corpus is not enough to learn generic models that are able to efficiently predict on new data. Therefore, in order to be used in real world applications state-of-the-art models need the ability of lifelong learning to improve performance as soon as new data are available - without the need of re-training the whole model from scratch. We present WEAVER, a simple, yet efficient post-processing method that infuses old knowledge into the new model, thereby reducing catastrophic forgetting. We show that applying WEAVER in a sequential manner results in similar word embedding distributions as doing a combined training on all data at once, while being computationally more efficient. Because there is no need of data sharing, the presented method is also easily applicable to federated learning settings and can for example be beneficial for the mining of electronic health records from different clinics

    preVIEW: from a fast prototype towards a sustainable semantic search system for central access to COVID-19 preprints

    Get PDF
    The current COVID-19 pandemic emphasizes the use of so-called preprints - a type of publication that is not subject to peer review. Due to its global relevance, there is an immense number of COVID-19-related preprints every day. To help researchers find relevant information, we have developed the semantic search engine preVIEW, it integrates preprints from currently seven different preprint servers. For semantic indexing, we implemented various text mining components to tag, for example, diseases or SARS-CoV-2 specific proteins. While the service initially served as a prototype developed together with users, we present a re-engineering towards a sustainable semantic search system, which was inevitable due to the continuously growing number of preprint publications. This enables easy reuse of the components and allows rapid adaptation of the service to further user needs

    The Autoimmune Disease Database: a dynamically compiled literature-derived database

    Get PDF
    BACKGROUND: Autoimmune diseases are disorders caused by an immune response directed against the body's own organs, tissues and cells. In practice more than 80 clinically distinct diseases, among them systemic lupus erythematosus and rheumatoid arthritis, are classified as autoimmune diseases. Although their etiology is unclear these diseases share certain similarities at the molecular level i.e. susceptibility regions on the chromosomes or the involvement of common genes. To gain an overview of these related diseases it is not feasible to do a literary review but it requires methods of automated analyses of the more than 500,000 Medline documents related to autoimmune disorders. RESULTS: In this paper we present the first version of the Autoimmune Disease Database which to our knowledge is the first comprehensive literature-based database covering all known or suspected autoimmune diseases. This dynamically compiled database allows researchers to link autoimmune diseases to the candidate genes or proteins through the use of named entity recognition which identifies genes/proteins in the corresponding Medline abstracts. The Autoimmune Disease Database covers 103 autoimmune disease concepts. This list was expanded to include synonyms and spelling variants yielding a list of over 1,200 disease names. The current version of the database provides links to 541,690 abstracts and over 5,000 unique genes/proteins. CONCLUSION: The Autoimmune Disease Database provides the researcher with a tool to navigate potential gene-disease relationships in Medline abstracts in the context of autoimmune diseases

    ProMiner: rule-based protein and gene entity recognition

    Get PDF
    doi:10.1186/1471-2105-6-S1-S14 <supplement> <title> <p>A critical assessment of text mining methods in molecular biology</p> </title> <editor>Christian Blaschke, Lynette Hirschman, Alfonso Valencia, Alexander Yeh</editor> <note>Report</note> </supplement> Background: Identification of gene and protein names in biomedical text is a challenging task as the corresponding nomenclature has evolved over time. This has led to multiple synonyms for individual genes and proteins, as well as names that may be ambiguous with other gene names or with general English words. The Gene List Task of the BioCreAtIvE challenge evaluation enables comparison of systems addressing the problem of protein and gene name identification on common benchmark data. Methods: The ProMiner system uses a pre-processed synonym dictionary to identify potential name occurrences in the biomedical text and associate protein and gene database identifiers with the detected matches. It follows a rule-based approach and its search algorithm is geared towards recognition of multi-word names [1]. To account for the large number of ambiguous synonyms in the considered organisms, the system has been extended to use specific variants of the detection procedure for highly ambiguous and case-sensitive synonyms. Based on all detected synonyms fo

    Interactive cohort exploration for spinocerebellar ataxias using synthetic cohort data for visualization

    Full text link
    Motivation: Visualization of data is a crucial step to understanding and deriving hypotheses from clinical data. However, for clinicians, visualization often comes with great effort due to the lack of technical knowledge about data handling and visualization. The application offers an easy-to-use solution with an intuitive design that enables various kinds of plotting functions. The aim was to provide an intuitive solution with a low entrance barrier for clinical users. Little to no onboarding is required before creating plots, while the complexity of questions can grow up to specific corner cases. To allow for an easy start and testing with SCAview, we incorporated a synthetic cohort dataset based on real data of rare neurological movement disorders: the most common autosomal-dominantly inherited spinocerebellar ataxias (SCAs) type 1, 2, 3, and 6 (SCA1, 2, 3 and 6). Methods: We created a Django-based backend application that serves the data to a React-based frontend that uses Plotly for plotting. A synthetic cohort was created to deploy a version of SCAview without violating any data protection guidelines. Here, we added normal distributed noise to the data and therefore prevent re-identification while keeping distributions and general correlations. Results: This work presents SCAview, an user-friendly, interactive web-based service that enables data visualization in a clickable interface allowing intuitive graphical handling that aims to enable data visualization in a clickable interface. The service is deployed and can be tested with a synthetic cohort created based on a large, longitudinal dataset from observational studies in the most common SCAs

    Patent Retrieval in Chemistry based on semantically tagged Named Entities

    Get PDF
    Gurulingappa H, Müller B, Klinger R, et al. Patent Retrieval in Chemistry based on semantically tagged Named Entities. In: Voorhees EM, Buckland LP, eds. The Eighteenth Text RETrieval Conference (TREC 2009) Proceedings. Gaithersburg, Maryland, USA; 2009.This paper reports on the work that has been conducted by Fraunhofer SCAI for Trec Chemistry (Trec-Chem) track 2009. The team of Fraunhofer SCAI participated in two tasks, namely Technology Survey and Prior Art Search. The core of the framework is an index of 1.2 million chemical patents provided as a data set by Trec. For the technology survey, three runs were submitted based on semantic dictionaries and noun phrases. For the prior art search task, several elds were introduced into the index that contained normalized noun phrases, biomedical as well as chemical entities. Altogether, 36 runs were submitted for this task that were based on automatic querying with tokens, noun phrases and entities along with dierent search strategies

    Overview of BioCreative II gene normalization

    Get PDF
    Background: The goal of the gene normalization task is to link genes or gene products mentioned in the literature to biological databases. This is a key step in an accurate search of the biological literature. It is a challenging task, even for the human expert; genes are often described rather than referred to by gene symbol and, confusingly, one gene name may refer to different genes (often from different organisms). For BioCreative II, the task was to list the Entrez Gene identifiers for human genes or gene products mentioned in PubMed/MEDLINE abstracts. We selected abstracts associated with articles previously curated for human genes. We provided 281 expert-annotated abstracts containing 684 gene identifiers for training, and a blind test set of 262 documents containing 785 identifiers, with a gold standard created by expert annotators. Inter-annotator agreement was measured at over 90%. Results: Twenty groups submitted one to three runs each, for a total of 54 runs. Three systems achieved F-measures (balanced precision and recall) between 0.80 and 0.81. Combining the system outputs using simple voting schemes and classifiers obtained improved results; the best composite system achieved an F-measure of 0.92 with 10-fold cross-validation. A 'maximum recall' system based on the pooled responses of all participants gave a recall of 0.97 (with precision 0.23), identifying 763 out of 785 identifiers. Conclusion: Major advances for the BioCreative II gene normalization task include broader participation (20 versus 8 teams) and a pooled system performance comparable to human experts, at over 90% agreement. These results show promise as tools to link the literature with biological databases

    COVID-19-Forschungsdaten leichter zugänglich machen – Aufbau einer bundesweiten Informationsinfrastruktur

    Get PDF
    Public-Health-Forschung, epidemiologische und klinische Studien sind erforderlich, um die COVID-19-Pandemie besser zu verstehen und geeignete Maßnahmen zu ergreifen. Daher wurden auch in Deutschland zahlreiche Forschungsprojekte initiiert. Zum heutigen Zeitpunkt ist es ob der Fülle an Informationen jedoch kaum noch möglich, einen Überblick über die vielfältigen Forschungsaktivitäten und deren Ergebnisse zu erhalten. Im Rahmen der Initiative „Nationale Forschungsdateninfrastruktur für personenbezogene Gesundheitsdaten“ (NFDI4Health) schafft die „Task Force COVID-19“ einen leichteren Zugang zu SARS-CoV-2- und COVID-19-bezogenen klinischen, epidemiologischen und Public-Health-Forschungsdaten. Dabei werden die sogenannten FAIR-Prinzipien (Findable, Accessible, Interoperable, Reusable) berücksichtigt, die eine schnellere Kommunikation von Ergebnissen befördern sollen. Zu den wesentlichen Arbeitsinhalten der Taskforce gehören die Erstellung eines Studienportals mit Metadaten, Erhebungsinstrumenten, Studiendokumenten, Studienergebnissen und Veröffentlichungen sowie einer Suchmaschine für Preprint-Publikationen. Weitere Inhalte sind ein Konzept zur Verknüpfung von Forschungs- und Routinedaten, Services zum verbesserten Umgang mit Bilddaten und die Anwendung standardisierter Analyseroutinen für harmonisierte Qualitätsbewertungen. Die im Aufbau befindliche Infrastruktur erleichtert die Auffindbarkeit von und den Umgang mit deutscher COVID-19-Forschung. Die im Rahmen der NFDI4Health Task Force COVID-19 begonnenen Entwicklungen sind für weitere Forschungsthemen nachnutzbar, da die adressierten Herausforderungen generisch für die Auffindbarkeit von und den Umgang mit Forschungsdaten sind.Public health research and epidemiological and clinical studies are necessary to understand the COVID-19 pandemic and to take appropriate action. Therefore, since early 2020, numerous research projects have also been initiated in Germany. However, due to the large amount of information, it is currently difficult to get an overview of the diverse research activities and their results. Based on the “Federated research data infrastructure for personal health data” (NFDI4Health) initiative, the “COVID-19 task force” is able to create easier access to SARS-CoV-2- and COVID-19-related clinical, epidemiological, and public health research data. Therefore, the so-called FAIR data principles (findable, accessible, interoperable, reusable) are taken into account and should allow an expedited communication of results. The most essential work of the task force includes the generation of a study portal with metadata, selected instruments, other study documents, and study results as well as a search engine for preprint publications. Additional contents include a concept for the linkage between research and routine data, a service for an enhanced practice of image data, and the application of a standardized analysis routine for harmonized quality assessment. This infrastructure, currently being established, will facilitate the findability and handling of German COVID-19 research. The developments initiated in the context of the NFDI4Health COVID-19 task force are reusable for further research topics, as the challenges addressed are generic for the findability of and the handling with research data.Peer Reviewe
    corecore