3,112 research outputs found

    Machine Learning Methods To Identify Hidden Phenotypes In The Electronic Health Record

    Get PDF
    The widespread adoption of Electronic Health Records (EHRs) means an unprecedented amount of patient treatment and outcome data is available to researchers. Research is a tertiary priority in the EHR, where the priorities are patient care and billing. Because of this, the data is not standardized or formatted in a manner easily adapted to machine learning approaches. Data may be missing for a large variety of reasons ranging from individual input styles to differences in clinical decision making, for example, which lab tests to issue. Few patients are annotated at a research quality, limiting sample size and presenting a moving gold standard. Patient progression over time is key to understanding many diseases but many machine learning algorithms require a snapshot, at a single time point, to create a usable vector form. In this dissertation, we develop new machine learning methods and computational workflows to extract hidden phenotypes from the Electronic Health Record (EHR). In Part 1, we use a semi-supervised deep learning approach to compensate for the low number of research quality labels present in the EHR. In Part 2, we examine and provide recommendations for characterizing and managing the large amount of missing data inherent to EHR data. In Part 3, we present an adversarial approach to generate synthetic data that closely resembles the original data while protecting subject privacy. We also introduce a workflow to enable reproducible research even when data cannot be shared. In Part 4, we introduce a novel strategy to first extract sequential data from the EHR and then demonstrate the ability to model these sequences with deep learning

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Functional characterization of single amino acid variants

    Get PDF
    Single amino acid variants (SAVs) are one of the main causes of Mendelian disorders, and play an important role in the development of many complex diseases. At the same time, they are the most common kind of variation affecting coding DNA, without generally presenting any damaging effect. With the advent of next generation sequencing technologies, the detection of these variants in patients and the general population is easier than ever, but the characterization of the functional effects of each variant remains an open challenge. It is our objective in this work to tackle this problem by developing machine learning based in silico SAVs pathology predictors. Having the PMut classic predictor as a starting point, we have rethought the entire supervised learning pipeline, elaborating new training sets, features and classifiers. PMut2017 is the first result of these efforts, a new general-purpose predictor based on SwissVar and trained on 12 different conservation scores. Its performance, evaluated bothby cross-validation and different blind tests, was in line with the best predictors published to date. Continuing our efforts in search for more accurate predictors, especially for those cases were general predictors tend to fail, we developed PMut-S, a suite of 215 protein-specific predictors. Similar to PMut in nature, Pmut-S introduced the use of co-evolution conservation features and balanced training sets, and showed improved performance, specially for those proteins that were more commonly misclassified by PMut. Comparing PMut-S to other specific predictors we proved that it is possible to train specific predictors using a unique automated pipeline and match the results of most gene specific predictors released to date. The implementation of the machine learning pipeline of both PMut and PMut-S was released as an open source Python module: PyMut, which bundles functions implementing the features computation and selection, classifier training and evaluation, plots drawing, among others. Their predictions were also made available in a rich web portal, which includes a precomputed repository with analyses of more than 700 million variants on over 100,000 human proteins, together with relevant contextual information such as 3D visualizationsof protein structures, links to databases, functional annotations, and more.Les mutacions puntuals d’aminoàcids són la principal causa de moltes malalties mendelianes, i juguen un paper important en el desenvolupament de moltes malalties complexes. Alhora, són el tipus de variant més comuna que afecta l’ADN codificant de proteïnes, sense provocar, en general, cap efecte advers. Amb l’adveniment de la seqüenciació de nova generació, la detecció d’aquestes variants en pacients i en la població general és més fàcil que mai, però la caracterització dels efectes funcionals de cada variant segueix sent un repte. El nostre objectiu en aquest treball és abordar aquest problema desenvolupant predictors de patologia in silico basats en l’aprenentatge automàtic. Prenent el predictor clàssic PMut com a punt de partida, hem repensat tot el procés d’aprenentatge supervisat, elaborant nous conjunts d’entrenament, descriptors i classificadors. PMut2017 és el primer resultat d’aquests esforços, un nou predictor basat en SwissVar i entrenat amb 12 mètriques de conservació de seqüència. La seva precisió, mesurada mitjançant validació creuada i amb tests cecs, s’ha mostrar en línia amb els millors predictors publicats a dia d’avui. Continuant els nostres esforços en la cerca de predictors més acurats, hem desenvolupat PMut-S, un conjunt de 215 predictors específics per cada proteïna. Similar a PMut en la seva concepció, PMut-S introdueix l’ús de descriptors basats en la coevolució i conjunts d’entrenament balancejats, millorant el rendiment de PMut2017 en 0.1 punts del coeficient de correlació de Matthews. Comparant PMut-S a d’altres predictors específics hem provat que és possible entrenar predictors específics seguint un únic procediment automatitzat i assolir uns resultats tan bon com els de la majoria de predictors específics publicats. La implementació del procediment d’aprenentatge automàtic tant de PMut com de PMut-S ha sigut publicat com a un mòdul de Python de codi obert: PyMut, el qual inclou les funcions que implementen el càlcul dels descriptors i la seva selecció, l’entrenament i avaluació dels classificadors, el dibuix de diverses gràfiques... Les prediccions també estan disponibles en un portal web que inclou un repositori precalculat amb els anàlisis de més de 700 milions de variants en més de 100 mil proteïnes humanes, junt a rellevant informació de context com visualitzacions 3D de les proteïnes, enllaços a bases de dades, anotacions funcionals i molt més

    A genetic ensemble approach for gene-gene interaction identification

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It has now become clear that gene-gene interactions and gene-environment interactions are ubiquitous and fundamental mechanisms for the development of complex diseases. Though a considerable effort has been put into developing statistical models and algorithmic strategies for identifying such interactions, the accurate identification of those genetic interactions has been proven to be very challenging.</p> <p>Methods</p> <p>In this paper, we propose a new approach for identifying such gene-gene and gene-environment interactions underlying complex diseases. This is a hybrid algorithm and it combines genetic algorithm (GA) and an ensemble of classifiers (called genetic ensemble). Using this approach, the original problem of SNP interaction identification is converted into a data mining problem of combinatorial feature selection. By collecting various single nucleotide polymorphisms (SNP) subsets as well as environmental factors generated in multiple GA runs, patterns of gene-gene and gene-environment interactions can be extracted using a simple combinatorial ranking method. Also considered in this study is the idea of combining identification results obtained from multiple algorithms. A novel formula based on pairwise <it>double fault </it>is designed to quantify the degree of complementarity.</p> <p>Conclusions</p> <p>Our simulation study demonstrates that the proposed genetic ensemble algorithm has comparable identification power to Multifactor Dimensionality Reduction (MDR) and is slightly better than Polymorphism Interaction Analysis (PIA), which are the two most popular methods for gene-gene interaction identification. More importantly, the identification results generated by using our genetic ensemble algorithm are highly complementary to those obtained by PIA and MDR. Experimental results from our simulation studies and real world data application also confirm the effectiveness of the proposed genetic ensemble algorithm, as well as the potential benefits of combining identification results from different algorithms.</p

    Machine learning for microbial ecology: predicting interactions and identifying their putative mechanisms

    Get PDF
    Microbial communities are key components of Earth’s ecosystems and they play important roles in human health and industrial processes. These communities and their functions can strongly depend on the diverse interactions between constituent species, posing the question of how such interactions can be predicted, measured and controlled. This challenge is particularly relevant for the many practical applications enabled by the rising field of synthetic microbial ecology, which includes the design of microbiome therapies for human diseases. Advances in sequencing technologies and genomic databases provide valuable datasets and tools for studying inter-microbial interactions, but the capacity to characterize the strength and mechanisms of interactions between species in large consortia is still an unsolved challenge. In this thesis, I show how machine learning methods can be used to help address these questions. The first portion of my thesis work was focused on predicting the outcome of pairwise interactions between microbial species. By integrating genomic information and observed experimental data, I used machine learning algorithms to explore the predictive relationship between single-species traits and inter-species interaction phenotypes. I found that organismal traits (e.g. annotated functions of genomic elements) are sufficient to predict the qualitative outcome of interactions between microbes. I also found that the relative fraction of possible experiments needed to build acceptable models drastically shrinks as the combinatorial space grows. In the second part of my thesis work, I developed an algorithmic method for identifying putative interaction mechanisms by scoring combinations of variables that random forest uses in order to predict interaction outcomes. I applied this method to a study of the human microbiome and identified a previously unreported combination of microbes that are strongly associated with Crohn’s disease. In the last part of my thesis, I utilized a regression approach to first identify and then quantify interactions between microbial species relevant to community function. The work I present in this dissertation provides a general framework for understanding the myriad interactions that occur in natural and synthetic microbial consortia
    • …
    corecore