7,094 research outputs found

    Epidemiological Prediction using Deep Learning

    Get PDF
    Department of Mathematical SciencesAccurate and real-time epidemic disease prediction plays a significant role in the health system and is of great importance for policy making, vaccine distribution and disease control. From the SIR model by Mckendrick and Kermack in the early 1900s, researchers have developed a various mathematical model to forecast the spread of disease. With all attempt, however, the epidemic prediction has always been an ongoing scientific issue due to the limitation that the current model lacks flexibility or shows poor performance. Owing to the temporal and spatial aspect of epidemiological data, the problem fits into the category of time-series forecasting. To capture both aspects of the data, this paper proposes a combination of recent Deep Leaning models and applies the model to ILI (influenza like illness) data in the United States. Specifically, the graph convolutional network (GCN) model is used to capture the geographical feature of the U.S. regions and the gated recurrent unit (GRU) model is used to capture the temporal dynamics of ILI. The result was compared with the Deep Learning model proposed by other researchers, demonstrating the proposed model outperforms the previous methods.clos

    Mining Host-Pathogen Interactions

    Get PDF

    Joint learning from multiple information sources for biological problems

    Get PDF
    Thanks to technological advancements, more and more biological data havebeen generated in recent years. Data availability offers unprecedented opportunities to look at the same problem from multiple aspects. It also unveils a more global view of the problem that takes into account the intricated inter-play between the involved molecules/entities. Nevertheless, biological datasets are biased, limited in quantity, and contain many false-positive samples. Such challenges often drastically downgrade the performance of a predictive model on unseen data and, thus, limit its applicability in real biological studies. Human learning is a multi-stage process in which we usually start with simple things. Through the accumulated knowledge over time, our cognition ability extends to more complex concepts. Children learn to speak simple words before being able to formulate sentences. Similarly, being able to speak correct sentences supports our learning to speak correct and meaningful paragraphs, etc. Generally, knowledge acquired from related learning tasks would help boost our learning capability in the current task. Motivated by such a phenomenon, in this thesis, we study supervised machine learning models for bioinformatics problems that can improve their performance through exploiting multiple related knowledge sources. More specifically, we concern with ways to enrich the supervised models’ knowledge base with publicly available related data to enhance the computational models’ prediction performance. Our work shares commonality with existing works in multimodal learning, multi-task learning, and transfer learning. Nevertheless, there are certain differences in some cases. Besides the proposed architectures, we present large-scale experiment setups with consensus evaluation metrics along with the creation and release of large datasets to showcase our approaches’ superiority. Moreover, we add case studies with detailed analyses in which we place no simplified assumptions to demonstrate the systems’ utilities in realistic application scenarios. Finally, we develop and make available an easy-to-use website for non-expert users to query the model’s generated prediction results to facilitate field experts’ assessments and adaptation. We believe that our work serves as one of the first steps in bridging the gap between “Computer Science” and “Biology” that will open a new era of fruitful collaboration between computer scientists and biological field experts

    An explainable model of host genetic interactions linked to COVID-19 severity

    Get PDF
    We employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We coupled a stratified k-fold screening, to rank variants more associated with severity, with the training of multiple supervised classifiers, to predict severity based on screened features. Feature importance analysis from tree-based models allowed us to identify 16 variants with the highest support which, together with age and gender covariates, were found to be most predictive of COVID-19 severity. When tested on a follow-up cohort, our ensemble of models predicted severity with high accuracy (ACC = 81.88%; AUCROC = 96%; MCC = 61.55%). Our model recapitulated a vast literature of emerging molecular mechanisms and genetic factors linked to COVID-19 response and extends previous landmark Genome-Wide Association Studies (GWAS). It revealed a network of interplaying genetic signatures converging on established immune system and inflammatory processes linked to viral infection response. It also identified additional processes cross-talking with immune pathways, such as GPCR signaling, which might offer additional opportunities for therapeutic intervention and patient stratification. Publicly available PheWAS datasets revealed that several variants were significantly associated with phenotypic traits such as "Respiratory or thoracic disease", supporting their link with COVID-19 severity outcome.A multifaceted computational strategy identifies 16 genetic variants contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing dataset of a cohort of Italian patients

    An explainable model of host genetic interactions linked to COVID-19 severity

    Get PDF
    We employed a multifaceted computational strategy to identify the genetic factors contributing to increased risk of severe COVID-19 infection from a Whole Exome Sequencing (WES) dataset of a cohort of 2000 Italian patients. We coupled a stratified k-fold screening, to rank variants more associated with severity, with the training of multiple supervised classifiers, to predict severity based on screened features. Feature importance analysis from tree-based models allowed us to identify 16 variants with the highest support which, together with age and gender covariates, were found to be most predictive of COVID-19 severity. When tested on a follow-up cohort, our ensemble of models predicted severity with high accuracy (ACC = 81.88%; AUCROC = 96%; MCC = 61.55%). Our model recapitulated a vast literature of emerging molecular mechanisms and genetic factors linked to COVID-19 response and extends previous landmark Genome-Wide Association Studies (GWAS). It revealed a network of interplaying genetic signatures converging on established immune system and inflammatory processes linked to viral infection response. It also identified additional processes cross-talking with immune pathways, such as GPCR signaling, which might offer additional opportunities for therapeutic intervention and patient stratification. Publicly available PheWAS datasets revealed that several variants were significantly associated with phenotypic traits such as “Respiratory or thoracic disease”, supporting their link with COVID-19 severity outcome

    Expression profiles of switch-like genes accurately classify tissue and infectious disease phenotypes in model-based classification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large-scale compilation of gene expression microarray datasets across diverse biological phenotypes provided a means of gathering a priori knowledge in the form of identification and annotation of bimodal genes in the human and mouse genomes. These switch-like genes consist of 15% of known human genes, and are enriched with genes coding for extracellular and membrane proteins. It is of interest to determine the prediction potential of bimodal genes for class discovery in large-scale datasets.</p> <p>Results</p> <p>Use of a model-based clustering algorithm accurately classified more than 400 microarray samples into 19 different tissue types on the basis of bimodal gene expression. Bimodal expression patterns were also highly effective in differentiating between infectious diseases in model-based clustering of microarray data. Supervised classification with feature selection restricted to switch-like genes also recognized tissue specific and infectious disease specific signatures in independent test datasets reserved for validation. Determination of "on" and "off" states of switch-like genes in various tissues and diseases allowed for the identification of activated/deactivated pathways. Activated switch-like genes in neural, skeletal muscle and cardiac muscle tissue tend to have tissue-specific roles. A majority of activated genes in infectious disease are involved in processes related to the immune response.</p> <p>Conclusion</p> <p>Switch-like bimodal gene sets capture genome-wide signatures from microarray data in health and infectious disease. A subset of bimodal genes coding for extracellular and membrane proteins are associated with tissue specificity, indicating a potential role for them as biomarkers provided that expression is altered in the onset of disease. Furthermore, we provide evidence that bimodal genes are involved in temporally and spatially active mechanisms including tissue-specific functions and response of the immune system to invading pathogens.</p

    A Computational Framework for Host-Pathogen Protein-Protein Interactions

    Get PDF
    Infectious diseases cause millions of illnesses and deaths every year, and raise great health concerns world widely. How to monitor and cure the infectious diseases has become a prevalent and intractable problem. Since the host-pathogen interactions are considered as the key infection processes at the molecular level for infectious diseases, there have been a large amount of researches focusing on the host-pathogen interactions towards the understanding of infection mechanisms and the development of novel therapeutic solutions. For years, the continuously development of technologies in biology has benefitted the wet lab-based experiments, such as small-scale biochemical, biophysical and genetic experiments and large-scale methods (for example yeast-two-hybrid analysis and cryogenic electron microscopy approach). As a result of past decades of efforts, there has been an exploded accumulation of biological data, which includes multi omics data, for example, the genomics data and proteomics data. Thus, an initiative review of omics data has been conducted in Chapter 2, which has exclusively demonstrated the recent update of ‘omics’ study, particularly focusing on proteomics and genomics. With the high-throughput technologies, the increasing amount of ‘omics’ data, including genomics and proteomics, has even further boosted. An upsurge of interest for data analytics in bioinformatics comes as no surprise to the researchers from a variety of disciplines. Specifically, the astonishing rate at which genomics and proteomics data are generated leads the researchers into the realm of ‘Big Data’ research. Chapter 2 is thus developed to providing an update of the omics background and the state-of-the-art developments in the omics area, with a focus on genomics data, from the perspective of big data analytics..

    Disease diagnosis in smart healthcare: Innovation, technologies and applications

    Get PDF
    To promote sustainable development, the smart city implies a global vision that merges artificial intelligence, big data, decision making, information and communication technology (ICT), and the internet-of-things (IoT). The ageing issue is an aspect that researchers, companies and government should devote efforts in developing smart healthcare innovative technology and applications. In this paper, the topic of disease diagnosis in smart healthcare is reviewed. Typical emerging optimization algorithms and machine learning algorithms are summarized. Evolutionary optimization, stochastic optimization and combinatorial optimization are covered. Owning to the fact that there are plenty of applications in healthcare, four applications in the field of diseases diagnosis (which also list in the top 10 causes of global death in 2015), namely cardiovascular diseases, diabetes mellitus, Alzheimer’s disease and other forms of dementia, and tuberculosis, are considered. In addition, challenges in the deployment of disease diagnosis in healthcare have been discussed

    Computational Methods for Omics Sequence Data with Focus on Non-Model Organisms

    Get PDF
    Sequence data are the backbone for many biological research areas including but not limited to genomics, proteomics as well as proteogenomics. Sequence acquisition is facilitated by a wide selection of advanced technologies such as Next Generation Sequencing and Mass Spectrometry. These high-throughput methods produce substantial volumes of data with decreasing financial and time-based expenditures. These volumes of data render manual processing impossible and therefore require state-of-the-art computational methods for adequate analysis and interpretation. In proteogenomics the potential of combining omics methods to improve on sequence quality and availability is frequently emphasized, in particular for non-model organisms. In this thesis, we highlight and address several challenges in the “life cycle” of omics sequence data, from genome sequence acquisition through integrated evaluation to extensive utilization of comprehensive sequence collections. We describe several methods with applications in different omics areas and emphasize means of potential integrative analysis. First, we introduce a method for \textit{de novo} assembly contig quality ranking based on machine learning. Thereby, we demonstrate special potential for the application on metagenomic sequence data which usually feature a variety of previously sequenced as well as unsequenced, non-model organisms. Next, we elaborate on sequence availability of target sequences in databases considered for taxonomic classification of tandem MS spectra. Thereby, the effect of different sequence sources as well as different search strategies on taxonomic depth is taken in account. Finally, we introduce a novel approach for extensive taxonomic classification by iteratively processing recent and comprehensive protein sequence databases. We discuss diverse possibilities as well as the limits of our methods with respect to current public data basis. Thereby, we illustrate potential benefits of the presented methods for non-model organisms.Sequenzdaten bilden das RĂŒckrad fĂŒr viele biologische Forschungsbereiche, einschließlich (aber nicht beschrĂ€nkt auf) Genomik, Proteomik sowie Proteogenomik. Sequenzierung wird durch eine breite Auswahl an modernen Technologien ermöglicht, wie beispielsweise Next-Generation-Sequenzierung und Massenspektrometrie. Diese Hochdurchsatzverfahren erzeugen erhebliche Datenmengen mit immer geringerem zeitlichen und finanziellen Aufwand. Die anfallenden Datenvolumina lassen manuelle Aufbereitung nicht mehr zu und benötigen deshalb modernste rechnerische Methoden fĂŒr eine adĂ€quate Analyse und Interpretation. In der Proteogenomik wird das Potential die verschiedene Omik-Technologien zu kombinieren hĂ€ufig betont, insbesondere fĂŒr Non-Model-Organismen. In dieser Dissertation möchten wir einige Herausforderungen im „Lebenszyklus“ der Sequenzdaten hervorheben und uns eingehender mit ihnen befassen, von Genomsequenzierung ĂŒber integrative Evaluierung zu extensiver Anwendung umfangreicher Sequenzdatenbanken. Wir beschreiben einige Methoden mit ihrer Anwendung in unterschiedlichen Omik-Gebieten und betrachten zusĂ€tzlich die Möglichkeiten einer potentiell integrativen Analyse. ZunĂ€chst stellen wir eine Methode fĂŒr das Ranking von \textit{de novo} assemblierten Contigs basierend auf maschinellem Lernen vor. Dabei heben wir das besondere Potential fĂŒr die Anwendung auf metagenomische Sequenzdaten hervor, welche fĂŒr gewöhnlich ein große Vielfalt an zuvor sequenzierten als auch unsequenzierten Non-Model-Organismen aufweisen. Des Weiteren untersuchen wir den Einfluss von Sequenz-VerfĂŒgbarkeit in angewendeten Datenbanken in Bezug auf taxonomischem Klassifizierungspotential von Tandem-MS-Spektren. Dabei analysieren wir die Effekte verschiedener Sequenzquellen und Such-Strategien auf die taxonomische Tiefe. Abschließend stellen wir einen neuen Ansatz fĂŒr eine extensive taxonomische Klassifizierung durch iterativer Aufarbeitung möglichst aktueller und umfangreicher Protein-Sequenz-Datenbanken. Wir diskutieren Potential und Grenzen unserer Methoden mit Hinblick auf aktuelle Sequenzdaten-VerfĂŒgbarkeit. Dabei zeigen wir potentiellen Nutzen fĂŒr Non-Model Organismen auf
    • 

    corecore