1,865 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Interpretable machine learning for genomics

    Get PDF
    High-throughput technologies such as next-generation sequencing allow biologists to observe cell function with unprecedented resolution, but the resulting datasets are too large and complicated for humans to understand without the aid of advanced statistical methods. Machine learning (ML) algorithms, which are designed to automatically find patterns in data, are well suited to this task. Yet these models are often so complex as to be opaque, leaving researchers with few clues about underlying mechanisms. Interpretable machine learning (iML) is a burgeoning subdiscipline of computational statistics devoted to making the predictions of ML models more intelligible to end users. This article is a gentle and critical introduction to iML, with an emphasis on genomic applications. I define relevant concepts, motivate leading methodologies, and provide a simple typology of existing approaches. I survey recent examples of iML in genomics, demonstrating how such techniques are increasingly integrated into research workflows. I argue that iML solutions are required to realize the promise of precision medicine. However, several open challenges remain. I examine the limitations of current state-of-the-art tools and propose a number of directions for future research. While the horizon for iML in genomics is wide and bright, continued progress requires close collaboration across disciplines

    Comparisons of seven algorithms for pathway analysis using the WTCCC Crohn's Disease dataset

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Though rooted in genomic expression studies, pathway analysis for genome-wide association studies (GWAS) has gained increasing popularity, since it has the potential to discover hidden disease pathogenic mechanisms by combining statistical methods with biological knowledge. Generally, algorithms or programs proposed recently can be categorized by different types of input data, null hypothesis or counts of analysis stages. Due to complexity caused by SNP, gene and pathway relationships, re-sampling strategies like permutation are always utilized to derive an empirical distribution for test statistics for evaluating the significance of candidate pathways. However, evaluation of these algorithms on real GWAS datasets and real biological pathway databases needs to be addressed before we apply them widely with confidence.</p> <p>Findings</p> <p>Two algorithms which use summary statistics from GWAS as input were implemented in KGG, a novel and user-friendly software tool for GWAS pathway analysis. Comparisons of these two algorithms as well as the other five selected algorithms were conducted by analyzing the WTCCC Crohn's Disease dataset utilizing the MsigDB canonical pathways. As a result of using permutation to obtain empirical p-value, most of these methods could control Type I error rate well, although some are conservative. However, the methods varied greatly in terms of power and running time, with the PLINK truncated set-based test being the most powerful and KGG being the fastest.</p> <p>Conclusions</p> <p>Raw data-based algorithms, such as those implemented in PLINK, are preferable for GWAS pathway analysis as long as computational capacity is available. It may be worthwhile to apply two or more pathway analysis algorithms on the same GWAS dataset, since the methods differ greatly in their outputs and might provide complementary findings for the studied complex disease.</p

    Use of data mining and artificial intelligence to derive public health evidence from large datasets

    Get PDF
    This thesis explores the use of data mining and AI-tailored frameworks for extracting public health evidence from large health datasets. The research presented in this thesis demonstrates the potential of these tools for automating and simplifying the data mining process, and for providing valuable insights into various public health issues.In Paper I, we used data mining and natural language processing to analyze the characteristics of genomic research on non-communicable diseases (NCDs) from the GWAS Catalog (2005 to 2022). We found that the majority of research institutions leading the work are often US-based and the majority of first, senior and all authors were male. The vast majority of complex trait GWAS has been performed in European ancestry populations, with cohorts and scientists predominantly located in medium-to-high socioeconomically ranked countries. This lack of diversity in both the data and the authorship of GWAS research has potential implications for the generalizability of genetic discoveries and the development of future interventions.In Paper II, we analyzed data collected through the app-based COVID Symptom Study in Sweden. We then created a symptom-based model to estimate the individual probability of symptomatic COVID-19 and employed this to estimate daily regional COVID-19 prevalence. We also used this data to predict next week COVID-19 hospital admissions and compared it to a model based on case notifications. We found that the symptom-based model had a lower median absolute percentage error during the first wave of the pandemic and that the model was transferable to an English dataset. The findings of this study demonstrate the feasibility of large-scale syndromic surveillance and the potential for population-based participatory surveillance initiatives in future pandemics and epidemics.In Paper III, we used data from over 500,000 participants in the COVID Symptom Study to investigate the impact of obesity and diabetes on the symptoms and duration of long-COVID. Using advanced data mining techniques, we found that individuals with higher BMI and diabetes had a higher burden of symptoms during the initial COVID-19 infection and a prolonged duration of long-COVID symptoms. We also found that vaccination had a protective effect against both COVID-19 symptoms and long-COVID symptoms in these at-risk groups. Our results demonstrate the disproportionate impact of COVID-19 on certain populations and the utility of app-based syndromic surveillance in providing timely and accurate information on the spread and impact of the virus

    The eMERGE Network: A consortium of biorepositories linked to electronic medical records data for conducting genomic studies

    Get PDF
    <p>Abstract</p> <p>Introduction</p> <p>The eMERGE (electronic MEdical Records and GEnomics) Network is an NHGRI-supported consortium of five institutions to explore the utility of DNA repositories coupled to Electronic Medical Record (EMR) systems for advancing discovery in genome science. eMERGE also includes a special emphasis on the ethical, legal and social issues related to these endeavors.</p> <p>Organization</p> <p>The five sites are supported by an Administrative Coordinating Center. Setting of network goals is initiated by working groups: (1) Genomics, (2) Informatics, and (3) Consent & Community Consultation, which also includes active participation by investigators outside the eMERGE funded sites, and (4) Return of Results Oversight Committee. The Steering Committee, comprised of site PIs and representatives and NHGRI staff, meet three times per year, once per year with the External Scientific Panel.</p> <p>Current progress</p> <p>The primary site-specific phenotypes for which samples have undergone genome-wide association study (GWAS) genotyping are cataract and HDL, dementia, electrocardiographic QRS duration, peripheral arterial disease, and type 2 diabetes. A GWAS is also being undertaken for resistant hypertension in ≈2,000 additional samples identified across the network sites, to be added to data available for samples already genotyped. Funded by ARRA supplements, secondary phenotypes have been added at all sites to leverage the genotyping data, and hypothyroidism is being analyzed as a cross-network phenotype. Results are being posted in dbGaP. Other key eMERGE activities include evaluation of the issues associated with cross-site deployment of common algorithms to identify cases and controls in EMRs, data privacy of genomic and clinically-derived data, developing approaches for large-scale meta-analysis of GWAS data across five sites, and a community consultation and consent initiative at each site.</p> <p>Future activities</p> <p>Plans are underway to expand the network in diversity of populations and incorporation of GWAS findings into clinical care.</p> <p>Summary</p> <p>By combining advanced clinical informatics, genome science, and community consultation, eMERGE represents a first step in the development of data-driven approaches to incorporate genomic information into routine healthcare delivery.</p

    Action detection using a neural network elucidates the genetics of mouse grooming behavior.

    Get PDF
    Automated detection of complex animal behaviors remains a challenging problem in neuroscience, particularly for behaviors that consist of disparate sequential motions. Grooming is a prototypical stereotyped behavior that is often used as an endophenotype in psychiatric genetics. Here, we used mouse grooming behavior as an example and developed a general purpose neural network architecture capable of dynamic action detection at human observer-level performance and operating across dozens of mouse strains with high visual diversity. We provide insights into the amount of human annotated training data that are needed to achieve such performance. We surveyed grooming behavior in the open field in 2457 mice across 62 strains, determined its heritable components, conducted GWAS to outline its genetic architecture, and performed PheWAS to link human psychiatric traits through shared underlying genetics. Our general machine learning solution that automatically classifies complex behaviors in large datasets will facilitate systematic studies of behavioral mechanisms

    An overview of data integration in neuroscience with focus on Alzheimer's Disease

    Get PDF
    : This work represents the first attempt to provide an overview of how to face data integration as the result of a dialogue between neuroscientists and computer scientists. Indeed, data integration is fundamental for studying complex multifactorial diseases, such as the neurodegenerative diseases. This work aims at warning the readers of common pitfalls and critical issues in both medical and data science fields. In this context, we define a road map for data scientists when they first approach the issue of data integration in the biomedical domain, highlighting the challenges that inevitably emerge when dealing with heterogeneous, large-scale and noisy data and proposing possible solutions. Here, we discuss data collection and statistical analysis usually seen as parallel and independent processes, as cross-disciplinary activities. Finally, we provide an exemplary application of data integration to address Alzheimer's Disease (AD), which is the most common multifactorial form of dementia worldwide. We critically discuss the largest and most widely used datasets in AD, and demonstrate how the emergence of machine learning and deep learning methods has had a significant impact on disease's knowledge particularly in the perspective of an early AD diagnosis

    The Genetic Interacting Landscape of 63 Candidate Genes in Major Depressive Disorder: An Explorative Study

    Get PDF
    Background: Genetic contributions to major depressive disorder (MDD) are thought to result from multiple genes interacting with each other. Different procedures have been proposed to detect such interactions. Which approach is best for explaining the risk of developing disease is unclear. This study sought to elucidate the genetic interaction landscape in candidate genes for MDD by conducting a SNP-SNP interaction analysis using an exhaustive search through 3,704 SNP-markers in 1,732 cases and 1,783 controls provided from the GAIN MDD study. We used three different methods to detect interactions, two logistic regressions models (multiplicative and additive) and one data mining and machine learning (MDR) approach. Results: Although none of the interaction survived correction for multiple comparisons, the results provide important information for future genetic interaction studies in complex disorders. Among the 0.5% most significant observations, none had been reported previously for risk to MDD. Within this group of interactions, less than 0.03% would have been detectable based on main effect approach or an a priori algorithm. We evaluated correlations among the three different models and conclude that all three algorithms detected the same interactions to a low degree. Although the top interactions had a surprisingly large effect size for MDD (e.g. additive dominant model Puncorrected = 9.10E-9 with attributable proportion (AP) value = 0.58 and multiplicative recessive model with Puncorrected = 6.95E-5 with odds ratio (OR estimated from β3) value = 4.99) the area under the curve (AUC) estimates were low (\u3c 0.54). Moreover, the population attributable fraction (PAF) estimates were also low (\u3c 0.15). Conclusions: We conclude that the top interactions on their own did not explain much of the genetic variance of MDD. The different statistical interaction methods we used in the present study did not identify the same pairs of interacting markers. Genetic interaction studies may uncover previously unsuspected effects that could provide novel insights into MDD risk, but much larger sample sizes are needed before this strategy can be powerfully applied
    • …
    corecore