197 research outputs found

    The strength of co-authorship in gene name disambiguation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A biomedical entity mention in articles and other free texts is often ambiguous. For example, 13% of the gene names (aliases) might refer to more than one gene. The task of Gene Symbol Disambiguation (GSD) – a special case of Word Sense Disambiguation (WSD) – is to assign a unique gene identifier for all identified gene name aliases in biology-related articles. Supervised and unsupervised machine learning WSD techniques have been applied in the biomedical field with promising results. We examine here the utilisation potential of the fact – one of the special features of biological articles – that the authors of the documents are known through graph-based semi-supervised methods for the GSD task.</p> <p>Results</p> <p>Our key hypothesis is that a biologist refers to each particular gene by a fixed gene alias and this holds for the co-authors as well. To make use of the co-authorship information we decided to build the inverse co-author graph on MedLine abstracts. The nodes of the inverse co-author graph are articles and there is an edge between two nodes if and only if the two articles have a mutual author. We introduce here two methods using distances (based on the graph) of abstracts for the GSD task. We found that a disambiguation decision can be made in 85% of cases with an extremely high (99.5%) precision rate just by using information obtained from the inverse co-author graph. We incorporated the co-authorship information into two GSD systems in order to attain full coverage and in experiments our procedure achieved precision of 94.3%, 98.85%, 96.05% and 99.63% on the human, mouse, fly and yeast GSD evaluation sets, respectively.</p> <p>Conclusion</p> <p>Based on the promising results obtained so far we suggest that the co-authorship information and the circumstances of the articles' release (like the title of the journal, the year of publication) can be a crucial building block of any sophisticated similarity measure among biological articles and hence the methods introduced here should be useful for other biomedical natural language processing tasks (like organism or target disease detection) as well.</p

    Evaluation of SpliceAI for Improved Genetic Variant Classification in Inherited Ophthalmic Disease Genes

    Get PDF
    ABSTRACT EVALUATION OF SPLICEAI FOR IMPROVED GENETIC VARIANT CLASSIFICATION IN INHERITED OPHTHALMIC DISEASE GENES By Melissa Jean Reeves, Ph.D. A dissertation submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy at Virginia Commonwealth University. Virginia Commonwealth University, 2023 Major Director: Melissa Jamerson, PhD, MLS(ASCP) Associate Professor, Department of Medical Laboratory Sciences Inherited ophthalmic diseases impact individuals around the globe. Inherited retinal diseases (IRDs) are the leading cause of blindness in individuals aged 15 to 45. The personal, social, and economic impact of vision loss is profound. Due to individual differences, symptoms can be variable, and it may be difficult to diagnose some diseases based on phenotype alone. Clinicians often seek out genetic testing to confirm clinical diagnoses when other avenues have failed. Clinical laboratories use all available data, such as frequency, population, or computational data, to evaluate genetic variants and determine their classification. Clinical laboratories may not have enough evidence to classify a genetic variant as pathogenic or benign when testing is performed, so variants may be classified of uncertain significance. Because inherited retinal diseases are considered rare, there are limited treatments available, and most treatment is offered through clinical trials. Clinical trials often have stringent inclusion and exclusion criteria to ensure the most optimum outcome for the study. Due to constraints of a study, patients often must have definitive genetic results to qualify for a trial. A variant of uncertain significance would likely disqualify an individual for a clinical trial. Functional assays, such as the minigene assay, have been used extensively across multiple genes and diseases with ease. This study aimed to investigate a novel methodology for the minigene assay and establish the sensitivity of SpliceAI for predicting synonymous splice effects in variants with a SpliceAI change (∆) score ≥ 0.8 in inherited ophthalmic disease genes. This study used the “P” or process component of the Structure-Process-Outcome (SPO) Donabedian model to evaluate the addition of the minigene assay to the clinical testing workflow. This study also highlights the importance of using a well-validated framework, such as Donabedian, in conjunction with clinical laboratory quality improvements. Of the 617 synonymous variants in 20 ophthalmic disease genes targeted in the database, 86 synonymous variants in 14 genes were scored ≥ 0.8. Twenty synonymous variants in two ophthalmic disease genes (ABCA4 and CHD7) were selected for this preliminary study. Twenty wildtype and variant pairs were assessed using the novel minigene test to review splice outcomes. This study established that this novel minigene test could be used in a clinical laboratory as a part of the clinical testing pipeline. Of the 20 variants targeted, 14 variants could be evaluated by minigene. Six variants did not produce high-quality data and will need to be repeated. Eleven of the 14 variants reviewed showed aberrant splice effects through the minigene assay, matching the SpliceAI prediction. Three variants matched the wildtype transcript and were therefore considered discordant. Based on these results, the sensitivity of SpliceAI for predicting splice effects in synonymous variants in inherited ophthalmic diseases is approximately 79%, slightly less than the expected 80%. The shift in sensitivity is likely due to the small sample size in this study. A Fisher’s exact test was performed to evaluate the concordance rate between minigene outcomes and SpliceAI predictions with a p value of 0.2222, indicating no statistical difference between SpliceAI predictions and minigene outcomes. The results of this study indicate that SpliceAI has a predictive efficiency in ophthalmic disease genes of 79%, which is well below what would be needed (\u3e 95%) for a clinical laboratory to rely solely for variant classification. Though the predictive efficiency is less than expected, this preliminary study offers insight into the predictive value of SpliceAI for synonymous variants in inherited ophthalmic disease genes. This study also introduces a novel minigene method that other clinical laboratories across other diseases and genes can reliably use

    Ontology-based knowledge management for technology intensive industries

    Get PDF
    EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Information management applied to bioinformatics

    Get PDF
    Bioinformatics, the discipline concerned with biological information management is essential in the post-genome era, where the complexity of data processing allows for contemporaneous multi level research including that at the genome level, transcriptome level, proteome level, the metabolome level, and the integration of these -omic studies towards gaining an understanding of biology at the systems level. This research is also having a major impact on disease research and drug discovery, particularly through pharmacogenomics studies. In this study innovative resources have been generated via the use of two case studies. One was of the Research & Development Genetics (RDG) department at AstraZeneca, Alderley Park and the other was of the Pharmacogenomics Group at the Sanger Institute in Cambridge UK. In the AstraZeneca case study senior scientists were interviewed using semi-structured interviews to determine information behaviour through the study scientific workflows. Document analysis was used to generate an understanding of the underpinning concepts and fonned one of the sources of context-dependent information on which the interview questions were based. The objectives of the Sanger Institute case study were slightly different as interviews were carried out with eight scientists together with the use of participation observation, to collect data to develop a database standard for one process of their Pharmacogenomics workflow. The results indicated that AstraZeneca would benefit through upgrading their data management solutions in the laboratory and by development of resources for the storage of data from larger scale projects such as whole genome scans. These studies will also generate very large amounts of data and the analysis of these will require more sophisticated statistical methods. At the Sanger Institute a minimum information standard was reported for the manual design of primers and included in a decision making tree developed for Polymerase Chain Reactions (PCRs). This tree also illustrates problems that can be encountered when designing primers along with procedures that can be taken to address such issues.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Development and operations of the astrophysics data system

    Get PDF
    Monthly progress reports are given for the period October 1993 through March 1994. Each month's progress includes a general summary and overviews of Administrative functions, Systems Engineering, User Committee, User Support, Test and QA, System Integration, Development, Operations, and Suppliers of Data. These overviews include user and query statistics for the month

    Data integration strategies for informing computational design in synthetic biology

    Get PDF
    PhD ThesisThe potential design space for biological systems is complex, vast and multidimensional. Therefore, effective large-scale synthetic biology requires computational design and simulation. By constraining this design space, the time- and cost-efficient design of biological systems can be facilitated. One way in which a tractable design space can be achieved is to use the extensive and growing amount of biological data available to inform the design process. By using existing knowledge design efforts can be focused on biologically plausible areas of design space. However, biological data is large, incomplete, heterogeneous, and noisy. Data must be integrated in a systematic fashion in order to maximise its benefit. To date, data integration has not been widely applied to design in synthetic biology. The aim of this project is to apply data integration techniques to facilitate the efficient design of novel biological systems. The specific focus is on the development and application of integration techniques for the design of genetic regulatory networks in the model bacterium Bacillus subtilis. A dataset was constructed by integrating data from a range of sources in order to capture existing knowledge about B. subtilis 168. The dataset is represented as a computationally-accessible, semantically-rich network which includes information concerning biological entities and their relationships. Also included are sequence-based features mined from the B. subtilis genome, which are a useful source of parts for synthetic biology. In addition, information about the interactions of these parts has been captured, in order to facilitate the construction of circuits with desired behaviours. This dataset was also modelled in the form of an ontology, providing a formal specification of parts and their interactions. The ontology is a major step towards the unification of the data required for modelling with a range of part catalogues specifically designed for synthetic biology. The data from the ontology is available to existing reasoners for implicit knowledge extraction. The ontology was applied to the automated identification of promoters, operators and coding sequences. Information from the ontology was also used to generate dynamic models of parts. The work described here contributed to the development of a formalism called Standard Virtual Parts (SVPs), which aims to represent models of biological parts in a standardised manner. SVPs comprise a mapping between biological parts and modular computational models. A genetic circuit designed at a part-level abstraction can be investigated in detail by analysing a circuit model composed of SVPs. The ontology was used to construct SVPs in the form of standard Systems Biology Markup Language models. These models are publicly available from a computationally-accessible repository, and include metadata which facilitates the computational composition of SVPs in order to create models of larger biological systems. To test a genetic circuit in vitro or in vivo, the genetics elements necessary to encode the enitites in the in silico model, and their associated behaviour, must be derived. Ultimately, this process results in the specification for synthesisable DNA sequence. For large models, particularly those that are produced computationally, the transformation process is challenging. To automate this process, a model-to-sequence conversion algorithm was developed. The algorithm was implemented as a Java application called MoSeC. Using MoSeC, both CellML and SBML models built with SVPs can be converted into DNA sequences ready to synthesise. Selection of the host bacterial cell for a synthetic genetic circuit is very important. In order not to interfere with the existing cellular machinery, orthogonal parts from other species are used since these parts are less likely to have undesired interactions with the host. In order to find orthogonal transcription factors (OTFs), and their target binding sequences, a subset of the data from the integrated B. subtilis dataset was used. B. subtilis gene regulatory networks were used to re-construct regulatory networks in closely related Bacillus species. The system, called BacillusRegNet, stores both experimental data for B. subtilis and homology predictions in other species. BacillusRegNet was mined to extract OTFs and their binding sequences, in order to facilitate the engineering of novel regulatory networks in other Bacillus species. Although the techniques presented here were demonstrated using B. subtilis, they can be applied to any other organism. The approaches and tools developed as part of this project demonstrate the utility of this novel integrated approach to synthetic biology.EPSRC: NSF: The Newcastle University School of Computing Science

    Aggregation of biological knowledge for immunological and virological applications

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    A framework for analyzing changes in health care lexicons and nomenclatures

    Get PDF
    Ontologies play a crucial role in current web-based biomedical applications for capturing contextual knowledge in the domain of life sciences. Many of the so-called bio-ontologies and controlled vocabularies are known to be seriously defective from both terminological and ontological perspectives, and do not sufficiently comply with the standards to be considered formai ontologies. Therefore, they are continuously evolving in order to fix the problems and provide valid knowledge. Moreover, many problems in ontology evolution often originate from incomplete knowledge about the given domain. As our knowledge improves, the related definitions in the ontologies will be altered. This problem is inadequately addressed by available tools and algorithms, mostly due to the lack of suitable knowledge representation formalisms to deal with temporal abstract notations, and the overreliance on human factors. Also most of the current approaches have been focused on changes within the internal structure of ontologies, and interactions with other existing ontologies have been widely neglected. In this research, alter revealing and classifying some of the common alterations in a number of popular biomedical ontologies, we present a novel agent-based framework, RLR (Represent, Legitimate, and Reproduce), to semi-automatically manage the evolution of bio-ontologies, with emphasis on the FungalWeb Ontology, with minimal human intervention. RLR assists and guides ontology engineers through the change management process in general, and aids in tracking and representing the changes, particularly through the use of category theory. Category theory has been used as a mathematical vehicle for modeling changes in ontologies and representing agents' interactions, independent of any specific choice of ontology language or particular implementation. We have also employed rule-based hierarchical graph transformation techniques to propose a more specific semantics for analyzing ontological changes and transformations between different versions of an ontology, as well as tracking the effects of a change in different levels of abstractions. Thus, the RLR framework enables one to manage changes in ontologies, not as standalone artifacts in isolation, but in contact with other ontologies in an openly distributed semantic web environment. The emphasis upon the generality and abstractness makes RLR more feasible in the multi-disciplinary domain of biomedical Ontology change management

    Representing and Redefining Specialised Knowledge: Medical Discourse

    Get PDF
    This volume brings together five selected papers on medical discourse which show how specialised medical corpora provide a framework that helps those engaging with medical discourse to determine how the everyday and the specialised combine to shape the discourse of medical professionals and non-medical communities in relation to both long and short-term factors. The papers contribute, in an exemplary way, to illustrating the shifting boundaries in today’s society between the two major poles making up the medical discourse cline: healthcare discourse at the one end, which records the demand for personalised therapies and individual medical services; and clinical discourse the other, which documents research into society’s collective medical needs
    corecore