54 research outputs found
Application of Machine Learning in Microbiology
Microorganisms are ubiquitous and closely related to people’s daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology
A Novel Human Microbe-Disease Association Prediction Method Based on the Bidirectional Weighted Network
The survival of human beings is inseparable from microbes. More and more studies have proved that microbes can affect human physiological processes in various aspects and are closely related to some human diseases. In this paper, based on known microbe-disease associations, a bidirectional weighted network was constructed by integrating the schemes of normalized Gaussian interactions and bidirectional recommendations firstly. And then, based on the newly constructed bidirectional network, a computational model called BWNMHMDA was developed to predict potential relationships between microbes and diseases. Finally, in order to evaluate the superiority of the new prediction model BWNMHMDA, the framework of LOOCV and 5-fold cross validation were implemented, and simulation results indicated that BWNMHMDA could achieve reliable AUCs of 0.9127 and 0.8967 ± 0.0027 in these two different frameworks respectively, which is outperformed some state-of-the-art methods. Moreover, case studies of asthma, colorectal carcinoma, and chronic obstructive pulmonary disease were implemented to further estimate the performance of BWNMHMDA. Experimental results showed that there are 10, 9, and 8 out of the top 10 predicted microbes having been confirmed by related literature in these three kinds of case studies separately, which also demonstrated that our new model BWNMHMDA could achieve satisfying prediction performance
Novel Algorithm Development for ‘NextGeneration’ Sequencing Data Analysis
In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation.
This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data.
The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives.
Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4).
Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection
Systems Analytics and Integration of Big Omics Data
A “genotype"" is essentially an organism's full hereditary information which is obtained from its parents. A ""phenotype"" is an organism's actual observed physical and behavioral properties. These may include traits such as morphology, size, height, eye color, metabolism, etc. One of the pressing challenges in computational and systems biology is genotype-to-phenotype prediction. This is challenging given the amount of data generated by modern Omics technologies. This “Big Data” is so large and complex that traditional data processing applications are not up to the task. Challenges arise in collection, analysis, mining, sharing, transfer, visualization, archiving, and integration of these data. In this Special Issue, there is a focus on the systems-level analysis of Omics data, recent developments in gene ontology annotation, and advances in biological pathways and network biology. The integration of Omics data with clinical and biomedical data using machine learning is explored. This Special Issue covers new methodologies in the context of gene–environment interactions, tissue-specific gene expression, and how external factors or host genetics impact the microbiome
Grand Celebration: 10th Anniversary of the Human Genome Project
In 1990, scientists began working together on one of the largest biological research projects ever proposed. The project proposed to sequence the three billion nucleotides in the human genome. The Human Genome Project took 13 years and was completed in April 2003, at a cost of approximately three billion dollars. It was a major scientific achievement that forever changed the understanding of our own nature. The sequencing of the human genome was in many ways a triumph for technology as much as it was for science. From the Human Genome Project, powerful technologies have been developed (e.g., microarrays and next generation sequencing) and new branches of science have emerged (e.g., functional genomics and pharmacogenomics), paving new ways for advancing genomic research and medical applications of genomics in the 21st century. The investigations have provided new tests and drug targets, as well as insights into the basis of human development and diagnosis/treatment of cancer and several mysterious humans diseases. This genomic revolution is prompting a new era in medicine, which brings both challenges and opportunities. Parallel to the promising advances over the last decade, the study of the human genome has also revealed how complicated human biology is, and how much remains to be understood. The legacy of the understanding of our genome has just begun. To celebrate the 10th anniversary of the essential completion of the Human Genome Project, in April 2013 Genes launched this Special Issue, which highlights the recent scientific breakthroughs in human genomics, with a collection of papers written by authors who are leading experts in the field
Discovery and characterization of novel non-coding 3′ UTR mutations in NFKBIZ and their functional implications in diffuse large B-cell lymphoma
Diffuse large B-cell lymphoma (DLBCL) is a very heterogenous disease that has historically been divided into two subtypes driven by distinct molecular mechanisms. The activated B-cell (ABC) subtype of DLBCL has the worst overall survival and is characterized by activation of the NF-κB signaling pathway. Although many genetic alterations have been identified in DLBCL, there remain cases with few or no known genetic drivers. This suggests that there are still novel drivers of DLBCL yet to be discovered. In this thesis I aimed to leverage whole genome sequencing data to identify novel regions of the genome that were recurrently mutated, with a specific focus on non-coding regions. Through this analysis we identified numerous novel putative driver mutations within the non-coding genome. One of the most highly recurrently mutated regions was in the 3′ untranslated region (UTR) of the NFKBIZ gene. Amplifications of this gene have been previously discovered in ABC DLBCL and this gene is known to activate NF-κB signaling. Therefore, we hypothesized that these 3′ UTR mutations were acting as drivers in DLBCL. The remaining portion of this thesis is focused on the functional characterization of NFKBIZ 3′ UTR mutations and how they drive DLBCL and contribute to treatment resistance. To this end, I induced NFKBIZ 3′ UTR mutations into DLBCL cell lines and determined that they cause both elevated mRNA and protein expression. These mutations conferred a selective growth advantage to DLBCL cell lines both in vitro and in vivo and overexpression of NFKBIZ in primary germinal center B-cells also provided cells a growth advantage. Lastly, I found that NFKBIZ-mutant cell lines were more resistant to a selection of targeted therapeutics (ibrutinib, idelalisib and masitinib). Taken together, this thesis highlights the importance of surveying the entire cancer genome, including non-coding regions, when searching for novel drivers. I demonstrated that mutations in the 3′ UTR of a gene can act as driver mutations conferring cell growth advantages and treatment resistance. This work also implicates NFKBIZ 3′ UTR mutations as potentially useful biomarkers for predicting treatment response and informing on the most effective treatment options for patients
Recommended from our members
Using molecular QTLs to identify cell types and causal variants for complex traits
Genetic associations have been discovered for many human complex traits, and yet for most associated loci the causal variants and molecular mechanisms remain unknown. Studies mapping quantitative trait loci (QTLs) for molecular phenotypes, such as gene expression, RNA splicing, and chromatin accessibility, provide rich data that can link variant effects in specific cell types with complex traits. These genetic effects can also now be modeled in vitro by differentiating human induced pluripotent stem cells (iPSCs) into specific cell types, including inaccessible cell types such as those of the brain. In this thesis, I explore a range of approaches for using QTLs to identify causal variants and to link these with molecular functions and complex traits.
In Chapter 2, I describe QTL mapping in 123 sensory neuronal cell lines differentiated from human iPSCs. I observed that gene expression was highly variable across iPSC-derived neuronal cultures in specific gene categories, and that a portion of this variability was explained by commonly used iPSC culture conditions, which influenced differentiation efficiency. A number of QTLs overlapped with common disease associations; however, using simulations I showed that identifying causal regulatory variants with a recall-by- genotype approach in iPSC-derived neurons is likely to require large sample sizes, even for variants with moderately large effect sizes.
In Chapter 3, I developed a computational model that uses publicly available gene expression QTL data, along with molecular annotations, to generate cell type-specific probability of regulatory function (PRF) scores for each variant. I found that predictive power was improved when the model was modified to use the quantitative value of annotations. PRF scores outperformed other genome-wide scores, including CADD and GWAVA, in identifying likely causal eQTL variants.
In Chapter 4, I used PRF scores to identify relevant cell types and to fine map potential causal variants using summary association statistics in six complex traits. By examining individual loci in detail, I showed how the enrichments contributing to a high PRF score are transparent, which can help to distinguish plausible causal variant predictions from model misspecification.Wellcome Trust Sanger Institut
Computational Methods for the Analysis of Genomic Data and Biological Processes
In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality
The Technological Emergence of AutoML: A Survey of Performant Software and Applications in the Context of Industry
With most technical fields, there exists a delay between fundamental academic
research and practical industrial uptake. Whilst some sciences have robust and
well-established processes for commercialisation, such as the pharmaceutical
practice of regimented drug trials, other fields face transitory periods in
which fundamental academic advancements diffuse gradually into the space of
commerce and industry. For the still relatively young field of
Automated/Autonomous Machine Learning (AutoML/AutonoML), that transitory period
is under way, spurred on by a burgeoning interest from broader society. Yet, to
date, little research has been undertaken to assess the current state of this
dissemination and its uptake. Thus, this review makes two primary contributions
to knowledge around this topic. Firstly, it provides the most up-to-date and
comprehensive survey of existing AutoML tools, both open-source and commercial.
Secondly, it motivates and outlines a framework for assessing whether an AutoML
solution designed for real-world application is 'performant'; this framework
extends beyond the limitations of typical academic criteria, considering a
variety of stakeholder needs and the human-computer interactions required to
service them. Thus, additionally supported by an extensive assessment and
comparison of academic and commercial case-studies, this review evaluates
mainstream engagement with AutoML in the early 2020s, identifying obstacles and
opportunities for accelerating future uptake
- …