750 research outputs found

    Machine Learning and Integrative Analysis of Biomedical Big Data.

    Get PDF
    Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

    Clustering and Alignment of Polymorphic Sequences for HLA-DRB1 Genotyping

    Get PDF
    Located on Chromosome 6p21, classical human leukocyte antigen genes are highly polymorphic. HLA alleles associate with a variety of phenotypes, such as narcolepsy, autoimmunity, as well as immunologic response to infectious disease. Moreover, high resolution genotyping of these loci is critical to achieving long-term survival of allogeneic transplants. Development of methods to obtain high resolution analysis of HLA genotypes will lead to improved understanding of how select alleles contribute to human health and disease risk. Genomic DNAs were obtained from a cohort of n = 383 subjects recruited as part of an Ulcerative Colitis study and analyzed for HLA-DRB1. HLA genotypes were determined using sequence specific oligonucleotide probes and by next-generation sequencing using the Roche/454 GSFLX instrument. The Clustering and Alignment of Polymorphic Sequences (CAPSeq) software application was developed to analyze next-generation sequencing data. The application generates HLA sequence specific 6-digit genotype information from next-generation sequencing data using MUMmer to align sequences and the R package diffusionMap to classify sequences into their respective allelic groups. The incorporation of Bootstrap Aggregating, Bagging to aid in sorting of sequences into allele classes resulted in improved genotyping accuracy. Using Bagging iterations equal to 60, the genotyping results obtained using CAPSeq when compared with sequence specific oligonucleotide probe characterized 4-digit genotypes exhibited high rates of concordance, matching at 759 out of 766 (99.1%) alleles. © 2013 Ringquist et al

    Computational methods to analyze molecular determinants behind phenotypes

    Get PDF
    Phenotype is a collection of an organism's observable features that can be characterized both on individual level and on single cell level. Phenotypes are largely determined by their molecular processes which also explains their inheritance and plasticity. Some of the molecular background of phenotypes can be characterized by inherited genetic variations and alterations in gene expression. The high-throughput measurement technologies enable the measurement of molecular determinants in cells. However, measurement technologies produce remarkable large data sets and the research questions have become increasingly complex. Thus computational methods are needed to discover molecular mechanisms behind the phenotypes. In many cases, analysis of molecular determinants that contribute to the phenotype proceeds by first identifying putative candidates by using a priori information and high-throughput measurements. Then further analysis can focus on most promising molecules. In many cases, the aim is to identify relevant markers or targets from a set of candidate molecules. Often biomedical studies result in a long list of candidate genes, and to interpret these candidates, information on their context in cell functions is needed. This context information can give insight to synergistic effects of molecular machinery in cells when functions of individual molecules do not explain the observed phenotype. In addition, the context information can be used to generate candidates. One of the methods in this thesis provides a computational data integration method that provides a link in between candidate genes from molecular pathways and genetic variants. It uses publicly available biological knowledge bases to systematically create functional context of candidate genes. This approach is especially important when studying cancer, that is dependent of complex molecular signaling. Genotypes associated with inherited disease predispositions have been studied successfully in the past, however, traditional methods are not applicable in wide variety of analysis conditions. Thus, this thesis introduces a method that uses haplotype sharing to identify genetic loci inherited by multiple distantly related individuals. It is flexible and can be used in various settings, also with very limited number of samples. Increasing the number of biological replicates in gene expression analysis increases the reliability of the results. In many cases, however, the number of samples is limited. Therefore, pooling gene expression data from multiple published studies can increase the understanding of the molecular background behind cell types. This is shown in this thesis by an analysis that identifies gene expression differences in two cell types using publicly available gene expression samples from previous studies. Finally, when candidate molecules are available to characterize phenotypes, they can be compiled into biomarkers. In many cases, a combination of multiple molecules serves as a better biomarker than a single molecule. This thesis also includes a machine learning approach that is used to discover a classifier that predicts the phenotype.Fenotyyppi on joukko organismin piirteitä, jotka ovat havaittavissa joko yksilön tasolla tai yksittäisten solujen tasolla. Molekulaariset prosessit määräävät pitkälti fenotyyppien ilmentymistä, joten taustalla vaikuttavat molekulaariset prosessit myös selittävät fenotyyppien perinnöllisyyttä sekä niiden mukautumista. Fenotyyppien molekulaarista taustaa voidaan kartoittaa tunnistamalla geneettistä variaatiota sekä muutoksia geenien aktiivisuudessa. Määrääviä molekulaarisia tekijöitä voidaan havaita soluissa käyttämällä high-throughput -mittausteknologioita. Nämä mittausteknologiat tuottavat erittäin suuria data-aineistoja ja samalla tutkimuskysymykset ovat tulleet entistä monimutkaisemmiksi. Nämä seikat ovat johtaneet siihen, että laskennallisia menetelmiä tarvitaan fenotyyppien molekulaarisen mekanismien tunnistamisessa. Usein tutkimus etenee ensin tunnistamalla lupaavia kandidaatteja käyttämällä a priori tietoa sekä high-throughput -mittauksia. Jatkoanalyysit voivat keskittyä lupaavimpiin molekyyleihin. Tällöin tavoitteena saattaa olla käyttökelpoisimpien biomarkkereiden tunnistaminen tai kohdegeenien valitseminen kandidaattien joukosta. Usein biolääketieteen tutkimus tuottaa joukon kandidaattigeenejä, jolloin tulosten tulkinta vaatii tietoa kandidaattigeenien suhteesta solun muuhun molekulaariseen toimintaan. Kun tämä molekulaarinen toiminta kontekstina otetaan huomioon, on mahdollista ymmärtää geenien yhteisvaikutuksia solun toimintaan silloin kun yksittäiset geenit eivät selitä havaittua fenotyyppiä. Solun molekulaarista kontekstia voi käyttää myös kandidaattigeenien luomiseen. Yksi väitöskirjassa esitelty menetelmä tarjoaa laskennallisen menetelmän, jolla voidaan yhdistää kandidaatit tunnetuilta pathwaylta geneettisiin variantteihin. Tämä menetelmä käyttää julkisia tietokantoja, joista se systemaattisesti kerää molekulaarisen kontekstin kandidaattigeeneille. Tällainen lähestymistapa on erityisen hyödyllinen syöpätutkimuksessa, sillä syöpä on tyypillisesti riippuvainen monimutkaisista molekyylien signalointiverkoista. Perittyjen genotyyppien ja sairauksien välisiä yhteyksiä on tutkittu pitkään menestyksekkäästi, mutta perinteisesti käytetyt menetelmät soveltuvat vain tiettyihin tapauksiin. Tässä väitöskirjassa esitellään menetelmä, joka käyttää haplotyyppien jakamista tunnistaakseen genomiset alueet, jotka ovat periytyneet useille kaukaisesti sukua oleville henkilöille. Tätä menetelmää voi käyttää useissa erilaisissa tutkimuskysymyksissä, ja se tuottaa luotettavia tuloksia myös hyvin vähäisellä näytemäärällä. Geeniekspressioanalyysin tulosten luotettavuus kasvaa samalla kun biologisten kopioiden määrä aineistossa kasvaa. Huolimatta tästä, näytemäärät ovat usein rajallisia. Tämän vuoksi geeniekspressiomittausten yhdistäminen useista jo julkaistuista tutkimuksista voi lisätä ymmärrystä solutyypin määräävistä biologisista prosesseista. Tässä väitöskirjassa esitellään analyysi, jolla tunnistetaan geeniekspressioeroja käyttäen geeniekspressioainestoa, joka on yhdistetty julkaistuista tutkimuksista. Viimein, kun fenotyyppiä selittävät kandidaattimolekyylit on tunnistettu, niistä voidaan luoda biomarkkereita. Monesti useamman molekyylin mittaus on parempi biomarkkeri kuin yksikään molekyyli yksinään. Tässä väitöskirjassa esitellään myös koneoppimisanalyysi, jolla luodaan geeniekspressiomittauksista fenotyyppiä ennustava luokittelija

    Joint assembly and genetic mapping of the Atlantic horseshoe crab genome reveals ancient whole genome duplication

    Get PDF
    Horseshoe crabs are marine arthropods with a fossil record extending back approximately 450 million years. They exhibit remarkable morphological stability over their long evolutionary history, retaining a number of ancestral arthropod traits, and are often cited as examples of "living fossils." As arthropods, they belong to the Ecdysozoa}, an ancient super-phylum whose sequenced genomes (including insects and nematodes) have thus far shown more divergence from the ancestral pattern of eumetazoan genome organization than cnidarians, deuterostomes, and lophotrochozoans. However, much of ecdysozoan diversity remains unrepresented in comparative genomic analyses. Here we use a new strategy of combined de novo assembly and genetic mapping to examine the chromosome-scale genome organization of the Atlantic horseshoe crab Limulus polyphemus. We constructed a genetic linkage map of this 2.7 Gbp genome by sequencing the nuclear DNA of 34 wild-collected, full-sibling embryos and their parents at a mean redundancy of 1.1x per sample. The map includes 84,307 sequence markers and 5,775 candidate conserved protein coding genes. Comparison to other metazoan genomes shows that the L. polyphemus genome preserves ancestral bilaterian linkage groups, and that a common ancestor of modern horseshoe crabs underwent one or more ancient whole genome duplications (WGDs) ~ 300 MYA, followed by extensive chromosome fusion

    Computational methods to improve genome assembly and gene prediction

    Get PDF
    DNA sequencing is used to read the nucleotides composing the genetic material that forms individual organisms. As 2nd generation sequencing technologies offering high throughput at a feasible cost have matured, sequencing has permeated nearly all areas of biological research. By a combination of large-scale projects led by consortiums and smaller endeavors led by individual labs, the flood of sequencing data will continue, which should provide major insights into how genomes produce physical characteristics, including disease, and evolve. To realize this potential, computer science is required to develop the bioinformatics pipelines to efficiently and accurately process and analyze the data from large and noisy datasets. Here, I focus on two crucial bioinformatics applications: the assembly of a genome from sequencing reads and protein-coding gene prediction. In genome assembly, we form large contiguous genomic sequences from the short sequence fragments generated by current machines. Starting from the raw sequences, we developed software called Quake that corrects sequencing errors more accurately than previous programs by using coverage of k-mers and probabilistic modeling of sequencing errors. My experiments show correcting errors with Quake improves genome assembly and leads to the detection of more polymorphisms in re-sequencing studies. For post-assembly analysis, we designed a method to detect a particular type of mis-assembly where the two copies of each chromosome in diploid genomes diverge. We found thousands of examples in each of the chimpanzee, cow, and chicken public genome assemblies that created false segmental duplications. Shotgun sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to both discover unknown microbes and explore complex environments. We developed software called Scimm that clusters metagenomic sequences based on composition in an unsupervised fashion more accurately than previous approaches. Finally, we extended an approach for predicting protein-coding genes on whole genomes to metagenomic sequences by adding new discriminative features and augmenting the task with taxonomic classification and clustering of the sequences. The program, called Glimmer-MG, predicts genes more accurately than all previous methods. By adding a model for sequencing errors that also allows the program to predict insertions and deletions, accuracy significantly improves on error-prone sequences
    corecore