23 research outputs found

    Overview of LiLAS 2020 -- Living Labs for Academic Search

    Full text link
    Academic Search is a timeless challenge that the field of Information Retrieval has been dealing with for many years. Even today, the search for academic material is a broad field of research that recently started working on problems like the COVID-19 pandemic. However, test collections and specialized data sets like CORD-19 only allow for system-oriented experiments, while the evaluation of algorithms in real-world environments is only available to researchers from industry. In LiLAS, we open up two academic search platforms to allow participating research to evaluate their systems in a Docker-based research environment. This overview paper describes the motivation, infrastructure, and two systems LIVIVO and GESIS Search that are part of this CLEF lab.Comment: Manuscript version of the CLEF 2020 proceedings pape

    In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central

    Get PDF
    Motivation Although full-text articles are provided by the publishers in electronic formats, it remains a challenge to find related work beyond the title and abstract context. Identifying related articles based on their abstract is indeed a good starting point; this process is straightforward and does not consume as many resources as full-text based similarity would require. However, further analyses may require in-depth understanding of the full content. Two articles with highly related abstracts can be substantially different regarding the full content. How similarity differs when considering title-and-abstract versus full-text and which semantic similarity metric provides better results when dealing with full-text articles are the main issues addressed in this manuscript. Methods We have benchmarked three similarity metrics – BM25, PMRA, and Cosine, in order to determine which one performs best when using concept-based annotations on full-text documents. We also evaluated variations in similarity values based on title-and-abstract against those relying on full-text. Our test dataset comprises the Genomics track article collection from the 2005 Text Retrieval Conference. Initially, we used an entity recognition software to semantically annotate titles and abstracts as well as full-text with concepts defined in the Unified Medical Language System (UMLS®). For each article, we created a document profile, i.e., a set of identified concepts, term frequency, and inverse document frequency; we then applied various similarity metrics to those document profiles. We considered correlation, precision, recall, and F1 in order to determine which similarity metric performs best with concept-based annotations. For those full-text articles available in PubMed Central Open Access (PMC-OA), we also performed dispersion analyses in order to understand how similarity varies when considering full-text articles. Results We have found that the PubMed Related Articles similarity metric is the most suitable for full-text articles annotated with UMLS concepts. For similarity values above 0.8, all metrics exhibited an F1 around 0.2 and a recall around 0.1; BM25 showed the highest precision close to 1; in all cases the concept-based metrics performed better than the word-stem-based one. Our experiments show that similarity values vary when considering only title-and-abstract versus full-text similarity. Therefore, analyses based on full-text become useful when a given research requires going beyond title and abstract, particularly regarding connectivity across articles. Availability Visualization available at ljgarcia.github.io/semsim.benchmark/, data available at http://dx.doi.org/10.5281/zenodo.13323.The authors acknowledge the support from the members of Temporal Knowledge Bases Group at Universitat Jaume I. Funding: LJGC and AGC are both self-funded, RB is funded by the “Ministerio de Economía y Competitividad” with contract number TIN2011-24147

    An open annotation ontology for science on web 3.0

    Get PDF
    Background: There is currently a gap between the rich and expressive collection of published biomedical ontologies, and the natural language expression of biomedical papers consumed on a daily basis by scientific researchers. The purpose of this paper is to provide an open, shareable structure for dynamic integration of biomedical domain ontologies with the scientific document, in the form of an Annotation Ontology (AO), thus closing this gap and enabling application of formal biomedical ontologies directly to the literature as it emerges. Methods: Initial requirements for AO were elicited by analysis of integration needs between biomedical web communities, and of needs for representing and integrating results of biomedical text mining. Analysis of strengths and weaknesses of previous efforts in this area was also performed. A series of increasingly refined annotation tools were then developed along with a metadata model in OWL, and deployed for feedback and additional requirements the ontology to users at a major pharmaceutical company and a major academic center. Further requirements and critiques of the model were also elicited through discussions with many colleagues and incorporated into the work. Results: This paper presents Annotation Ontology (AO), an open ontology in OWL-DL for annotating scientific documents on the web. AO supports both human and algorithmic content annotation. It enables "stand-off" or independent metadata anchored to specific positions in a web document by any one of several methods. In AO, the document may be annotated but is not required to be under update control of the annotator. AO contains a provenance model to support versioning, and a set model for specifying groups and containers of annotation. AO is freely available under open source license at http://purl.org/ao/, and extensive documentation including screencasts is available on AO's Google Code page: http://code.google.com/p/annotation-ontology/. Conclusions: The Annotation Ontology meets critical requirements for an open, freely shareable model in OWL, of annotation metadata created against scientific documents on the Web. We believe AO can become a very useful common model for annotation metadata on Web documents, and will enable biomedical domain ontologies to be used quite widely to annotate the scientific literature. Potential collaborators and those with new relevant use cases are invited to contact the authors

    Introducing the FAIR Principles for research software

    Get PDF
    Research software is a fundamental and vital part of research, yet significant challenges to discoverability, productivity, quality, reproducibility, and sustainability exist. Improving the practice of scholarship is a common goal of the open science, open source, and FAIR (Findable, Accessible, Interoperable and Reusable) communities and research software is now being understood as a type of digital object to which FAIR should be applied. This emergence reflects a maturation of the research community to better understand the crucial role of FAIR research software in maximising research value. The FAIR for Research Software (FAIR4RS) Working Group has adapted the FAIR Guiding Principles to create the FAIR Principles for Research Software (FAIR4RS Principles). The contents and context of the FAIR4RS Principles are summarised here to provide the basis for discussion of their adoption. Examples of implementation by organisations are provided to share information on how to maximise the value of research outputs, and to encourage others to amplify the importance and impact of this work

    UniProt: the universal protein knowledgebase in 2021

    No full text
    The aim of the UniProt Knowledgebase is to provide users with a comprehensive, high-quality and freely accessible set of protein sequences annotated with functional information. In this article, we describe significant updates that we have made over the last two years to the resource. The number of sequences in UniProtKB has risen to approximately 190 million, despite continued work to reduce sequence redundancy at the proteome level. We have adopted new methods of assessing proteome completeness and quality. We continue to extract detailed annotations from the literature to add to reviewed entries and supplement these in unreviewed entries with annotations provided by automated systems such as the newly implemented Association-Rule-Based Annotator (ARBA). We have developed a credit-based publication submission interface to allow the community to contribute publications and annotations to UniProt entries. We describe how UniProtKB responded to the COVID-19 pandemic through expert curation of relevant entries that were rapidly made available to the research community through a dedicated portal. UniProt resources are available under a CC-BY (4.0) license via the web at https://www.uniprot.org/

    Connections across scientific publications based on semantic annotations

    Full text link
    Abstract. The core information from scientific publications is encoded in natural language text and monolithic documents; therefore it is not well integrated with other structured and unstructured data resources. The text format requires additional processing to semantically interlink the publications and to finally reach interoperability of contained data. Data infrastructures such as the Linked Open Data initiative based on the Resource Description Framework support the connectivity of data from scientific publications once the identification of concepts and relations has been achieved, and the content has been interconnected semantically. In this manuscript we produce and analyze the semantic annotations in scientific articles to investigate on the interconnectivity across the articles. In our initial experiment based on articles from PubMed Central we demonstrate the means and the results leading to the interconnectivity using annotations of Medical Subject Headings concepts, Unified Medical Language System terms, and semantic abstractions of relations. We conclude that the different methods would contribute to different types of relatedness between articles that could be later used in recommendation systems based on semantic links across a network of scientific publications

    Biolinks, datasets and algorithms supporting semantic-based distribution and similarity for scientific publications

    No full text
    <p><strong>Background: </strong>Finding articles related to a publication of interest remains a challenge in the Life Sciences domain as the number of scientific publications grows day by day. Publication repositories such as PubMed and Elsevier provides a list of similar articles. There, similarity is commonly calculated based on title, abstract and some keywords assigned to articles. Here we present the datasets and algorithms used in Biolinks. Biolinks uses ontological concepts extracted from publication and makes it possible to calculate a distribution score according to semantic groups as well as a semantic similarity based on either all identified annotations or narrowed to one or more particular semantic groups. Biolinks supports both title and abstract only as well as full-text.</p> <p><strong>Materials: </strong>In a previous work [1], 4,240 articles from the TREC-05 collection [2] were selected. The title-and-abstract for those 4,240 articles were annotated with Unified Medical Language System (UMLS) concepts, such annotations are refer to as our TA-dataset and correspond to the JSON files under the pubmed folder in the JSON-LD.zip file. From those 4,240 articles, full-text was available for only 62. The title-and-abstract annotations for those 62 articles, TAFT-dataset, are located under the pubmed-pmc folder in the JSON-LD.zip file, which also contains the full-text annotations under the folder pmc, FT-dataset. The list corresponding to articles with title-and-abstract is found in the genomics.qrels.large.pubmed.onlyRelevants.titleAndAbstract.tsv file, while those with full-text are recorded in the genomics.qrels.large.pmc.onlyRelevants.fullContent.tsv file.</p> <p>Here we include the annotations on title and abstract as well as those for full-text for all our datasets (profiles.zip). We also provide the global similarity matrices (similarity.zip).</p> <p><strong>Methods:</strong> The TA-dataset was used to calculate the Information Gain (IG) according to the UMLS semantic groups, see IG_umls_groups.PMID.xlsx. A new grouping is proposed for Biolinks, see biolinks_groups.tsv. The IG was calculated for Biolinks groups as well, IG_biolinks_groups.PMID.xlsx, showing a improvement around 5%.</p> <p>In order to assess the similarity metric regarding the cohesion of TREC-05 groups, we used Silhouette Coefficient analyses. An additional dataset Stem-TAFT-dataset was used and compared to TAFT and FT datasets.</p> <p>Biolinks groups were used to calculate a semantic group distribution score for each article in all our datasets. A semantic similarity metric based on PubMed related articles [3] is also provided; the Biolinks groups can be used to narrow the similarity to one or more selected groups. All the corresponding algorithms are open-access and available on GitHub under the license Apache-2.0, a frozen version, biotea-io-parser-master.zip, is provided here. In order to facilitate the analysis of our datasets based on the annotations as well as the distribution and similarity scores, some web-based visualization components were created. All of them open-access and available in GitHub under the license Apache-2.0; frozen versions are provided here, see files biotea-vis-annotation-master.zip, biotea-vis-similarity-master.zip, biotea-vis-tooltip-master.zip and biotea-vis-topicDistribution-master.zip. These components are brought together by biotea-vis-biolinks-master.zip. A demo is provided at http://ljgarcia.github.io/biotea-biolinks/; this demo was built on top of GitHub pages, a frozen version of the gh-pages branch is provided here, see biotea-biolinks-gh-pages.zip.</p> <p><strong>Conclusions: </strong>Biolinks assigns a weight to each semantic group based on the annotations extracted from either title-and-abstract or full-text articles. It also measures similarity for a pair of documents using the semantic information. The distribution and similarity metrics can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them.</p> <p> </p> <p>[1] Garcia Castro, L.J., R. Berlanga, and A. Garcia, <em>In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access.</em> Journal of Biomedical Informatics, 2015. <strong>57</strong>: p. 204-218</p> <p>[2] Text Retrieval Conference 2005 - Genomics Track. <em>TREC-05 Genomics Track ad hoc relevance judgement</em>. 2005  [cited 2016 23rd August]; Available from: http://trec.nist.gov/data/genomics/05/genomics.qrels.large.txt</p> <p>[3] Lin, J. and W.J. Wilbur, <em>PubMed related articles: a probabilistic topic-based model for content similarity.</em> BMC Bioinformatics, 2007. <strong>8</strong>(1): p. 423</p
    corecore