312 research outputs found

    The EM Algorithm and the Rise of Computational Biology

    Get PDF
    In the past decade computational biology has grown from a cottage industry with a handful of researchers to an attractive interdisciplinary field, catching the attention and imagination of many quantitatively-minded scientists. Of interest to us is the key role played by the EM algorithm during this transformation. We survey the use of the EM algorithm in a few important computational biology problems surrounding the "central dogma"; of molecular biology: from DNA to RNA and then to proteins. Topics of this article include sequence motif discovery, protein sequence alignment, population genetics, evolutionary models and mRNA expression microarray data analysis.Comment: Published in at http://dx.doi.org/10.1214/09-STS312 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Inferring genome-scale rearrangement phylogeny and ancestral gene order: a Drosophila case study

    Get PDF
    A simple, fast, and biologically-inspired computational approach to infer genome-scale rearrangement phylogeny and ancestral gene order has been developed and applied to eight Drosophila genomes, providing insights into evolutionary chromosomal dynamics

    Finding regions of aberrant DNA copy number associated with tumor phenotype

    Get PDF
    DNA copy number alterations are a hallmark of cancer. Understanding their role in tumor progression can help improve diagnosis, prognosis and therapy selection for cancer patients. High-resolution, genome-wide measurements of DNA copy number changes for large cohorts of tumors are currently available, owing to technologies like microarray-based array comparative hybridization (arrayCGH). In this thesis, we present a computational pipeline for statistical analysis of tumor cohorts, which can help extract relevant patterns of copy number aberrations and infer their association with various phenotypical indicators. The main challenges are the instability of classification models due to the high dimensionality of the arrays compared to the small number of tumor samples, as well as the large correlations between copy number estimates measured at neighboring loci. We show that the feature ranking given by several widely-used methods for feature selection is biased due to the large correlations between features. In order to correct for the bias and instability of the feature ranking, we introduce methods for consensus segmentation of the set of arrays. We present three algorithms for consensus segmentation, which are based on identifying recurrent DNA breakpoints or DNA regions of constant copy number profile. The segmentation constitutes the basis for computing a set of super-features, corresponding to the regions. We use the super-features for supervised classification and we compare the models to baseline models trained on probe data. We validated the methods by training models for prediction of the phenotype of breast cancers and neuroblastoma tumors. We show that the multivariate segmentation affords higher model stability, in general improves prediction accuracy and facilitates model interpretation. One of our most important biological results refers to the classification of neuroblastoma tumors. We show that patients belonging to different age subgroups are characterized by distinct copy number patterns, with largest discrepancy when the subgroups are defined as older or younger than 16-18 months. We thereby confirm the recommendation for a higher age cutoff than 12 months (current clinical practice) for differential diagnosis of neuroblastoma.Die abnormale MultiplizitĂ€t bestimmter Segmente der DNS (copy number aberrations) ist eines der hervorstechenden Merkmale von Krebs. Das VerstĂ€ndnis der Rolle dieses Merkmals fĂŒr das Tumorwachstum könnte massgeblich zur Verbesserung von Krebsdiagnose,-prognose und -therapie beitragen und somit bei der Auswahl individueller Therapien helfen. Micoroarray-basierte Technologien wie 'Array Comparative Hybridization' (array-CGH) erlauben es, hochauflösende, genomweite Kopiezahl-Karten von Tumorgeweben zu erstellen. Gegenstand dieser Arbeit ist die Entwicklung einer Software-Pipeline fĂŒr die statistische Analyse von Tumorkohorten, die es ermöglicht, relevante Muster abnormaler Kopiezahlen abzuleiten und diese mit diversen phĂ€notypischen Merkmalen zu assoziieren. Dies geschieht mithilfe maschineller Lernmethoden fĂŒr Klassifikation und Merkmalselektion mit Fokus auf die Interpretierbarkeit der gelernten Modelle (regularisierte lineare Methoden sowie Entscheidungsbaum-basierte Modelle). Herausforderungen an die Methoden liegen vor allem in der hohen DimensionalitĂ€t der Daten, denen lediglich eine vergleichsweise geringe Anzahl von gemessenen Tumorproben gegenĂŒber steht, sowie der hohen Korrelation zwischen den gemessenen Kopiezahlen in benachbarten genomischen Regionen. Folglich hĂ€ngen die Resultate der Merkmalselektion stark von der Auswahl des Trainingsdatensatzes ab, was die Reproduzierbarkeit bei unterschiedlichen klinischen DatensĂ€tzen stark einschrĂ€nkt. Diese Arbeit zeigt, dass die von diversen gĂ€ngigen Methoden bestimmte Rangfolge von Features in Folge hoher Korrelationskoefizienten einzelner PrĂ€diktoren stark verfĂ€lscht ist. Um diesen 'Bias' sowie die InstabilitĂ€t der Merkmalsrangfolge zu korrigieren, fĂŒhren wir in unserer Pipeline einen dimensions-reduzierenden Schritt ein, der darin besteht, die Arrays gemeinsam multivariat zu segmentieren. Wir prĂ€sentieren drei Algorithmen fĂŒr diese multivariate Segmentierung,die auf der Identifikation rekurrenter DNA Breakpoints oder genomischer Regionen mit konstanten Kopiezahl-Profilen beruhen. Durch Zusammenfassen der DNA Kopiezahlwerte innerhalb einer Region bildet die multivariate Segmentierung die Grundlage fĂŒr die Berechnung einer kleineren Menge von 'Super-Merkmalen'. Im Vergleich zu Klassifikationsverfahren,die auf Ebene einzelner Arrayproben beruhen, verbessern wir durch ĂŒberwachte Klassifikation basierend auf den Super-Merkmalen die Interpretierbarkeit sowie die StabilitĂ€t der Modelle. Wir validieren die Methoden in dieser Arbeit durch das Trainieren von Vorhersagemodellen auf Brustkrebs und Neuroblastoma DatensĂ€tzen. Hier zeigen wir, dass der multivariate Segmentierungsschritt eine erhöhte ModellstabilitĂ€t erzielt, wobei die VorhersagequalitĂ€t nicht abnimmt. Die Dimension des Problems wird erheblich reduziert (bis zu 200-fach weniger Merkmale), welches die multivariate Segmentierung nicht nur zu einem probaten Mittel fĂŒr die Vorhersage von PhĂ€notypen macht.Vielmehr eignet sich das Verfahren darĂŒberhinaus auch als Vorverarbeitungschritt fĂŒr spĂ€tere integrative Analysen mit anderen Datentypen. Auch die Interpretierbarkeit der Modelle wird verbessert. Dies ermöglicht die Identifikation von wichtigen Relationen zwischen Änderungen der Kopiezahl und PhĂ€notyp. Beispielsweise zeigen wir, dass eine Koamplifikation in direkter Nachbarschaft des ERBB2 Genlokus einen höchst informativen PrĂ€diktor fĂŒr die Unterscheidung von entzĂŒndlichen und nicht-entzĂŒndlichen Brustkrebsarten darstellt. Damit bestĂ€tigen wir die in der Literatur gĂ€ngige Hypothese, dass die Grösse eines Amplikons mit dem Krebssubtyp zusammenhĂ€ngt. Im Fall von Neuroblastoma Tumoren zeigen wir, dass Untergruppen, die durch das Alter des Patienten deniert werden, durch Kopiezahl-Muster charakterisiert werden können. Insbesondere ist dies möglich, wenn ein Altersschwellenwert von 16 bis 18 Monaten zur Definition der Gruppen verwandt wird, bei dem ausserdem auch die höchste Vorhersagegenauigkeit vorliegt. Folglich geben wir weitere Evidenz fĂŒr die Empfehlung, einen höheren Schwellenwert als zwölf Monate fĂŒr die differentielle Diagnose von Neuroblastoma zu verwenden

    A NOVEL COMPUTATIONAL FRAMEWORK FOR TRANSCRIPTOME ANALYSIS WITH RNA-SEQ DATA

    Get PDF
    The advance of high-throughput sequencing technologies and their application on mRNA transcriptome sequencing (RNA-seq) have enabled comprehensive and unbiased profiling of the landscape of transcription in a cell. In order to address the current limitation of analyzing accuracy and scalability in transcriptome analysis, a novel computational framework has been developed on large-scale RNA-seq datasets with no dependence on transcript annotations. Directly from raw reads, a probabilistic approach is first applied to infer the best transcript fragment alignments from paired-end reads. Empowered by the identification of alternative splicing modules, this framework then performs precise and efficient differential analysis at automatically detected alternative splicing variants, which circumvents the need of full transcript reconstruction and quantification. Beyond the scope of classical group-wise analysis, a clustering scheme is further described for mining prominent consistency among samples in transcription, breaking the restriction of presumed grouping. The performance of the framework has been demonstrated by a series of simulation studies and real datasets, including the Cancer Genome Atlas (TCGA) breast cancer analysis. The successful applications have suggested the unprecedented opportunity in using differential transcription analysis to reveal variations in the mRNA transcriptome in response to cellular differentiation or effects of diseases

    Integration and mining of malaria molecular, functional and pharmacological data: how far are we from a chemogenomic knowledge space?

    Get PDF
    The organization and mining of malaria genomic and post-genomic data is highly motivated by the necessity to predict and characterize new biological targets and new drugs. Biological targets are sought in a biological space designed from the genomic data from Plasmodium falciparum, but using also the millions of genomic data from other species. Drug candidates are sought in a chemical space containing the millions of small molecules stored in public and private chemolibraries. Data management should therefore be as reliable and versatile as possible. In this context, we examined five aspects of the organization and mining of malaria genomic and post-genomic data: 1) the comparison of protein sequences including compositionally atypical malaria sequences, 2) the high throughput reconstruction of molecular phylogenies, 3) the representation of biological processes particularly metabolic pathways, 4) the versatile methods to integrate genomic data, biological representations and functional profiling obtained from X-omic experiments after drug treatments and 5) the determination and prediction of protein structures and their molecular docking with drug candidate structures. Progresses toward a grid-enabled chemogenomic knowledge space are discussed.Comment: 43 pages, 4 figures, to appear in Malaria Journa

    Novel Sequence-Based Method for Identifying Transcription Factor Binding Sites in Prokaryotic Genomes

    Get PDF
    Computational techniques for microbial genomic sequence analysis are becoming increasingly important. With next–generation sequencing technology and the human microbiome project underway, current sequencing capacity is significantly greater than the speed at which organisms of interest can be experimentally probed. We have developed a method that will primarily use available sequence data in order to determine prokaryotic transcription factor binding specificities. The prototypical prokaryotic transcription factor: TF) contains a helix–turn–helix: HTH) fold and bind DNA as homodimers, leading to their palindromic motif specificities. The connection between the TF and its promoter is based on the autoregulation phenomenon noticed in E. coli. Approximately 55% of the TFs analyzed were estimated to be autoregulated. Our preliminary analysis using RegulonDB indicates that this value increases to 79% if one considers the neighboring operons. Given the TF family of interest, it is necessary to find the relevant TF proteins and their associated genomes. Due to the scale–free network topology of prokaryotic systems, many of the transcriptional regulators regulate only one or a few operons. Within a single genome, there would not be enough sequence–based signal to determine the binding site using standard computational methods. Therefore, multiple bacterial genomes are used to overcome this lack of signal within a single genome. We use a distance–based criteria to define the operon boundaries and their respective promoters. Several TF–DNA crystal structures are then used to determine the residues that interact with the DNA. These key residues are the basis for the TF comparison metric; the assumption being that similar residues should impart similar DNA binding specificities. After defining the sets of TF clusters using this metric, their respective promoters are used as input to a motif finding procedure. This method has currently been tested on the LacI and TetR TF families with successful results. On external validation sets, the specificity of prediction is ∌80%. These results are important in developing methods to define the DNA binding preferences of the TF protein residues, known as the “recognition code”. This “recognition code” would allow computational design and prediction of novel DNA–binding specificities, enabling protein-engineering and synthetic biology applications

    Somatic mutations and tissue turnover in glioblastoma and hematopoiesis

    Get PDF
    Somatic mutations accumulate in tissues primarily through cell divisions. This observation opens the opportunity to use somatic mutations as clonal markers for inferring the past dynamics of cell turnover during tissue growth and homeostasis. In this thesis, I develop mathematical approaches to the inference problems that are stimulated by deep genome sequencing data of malignant growth and physiological tissue turnover. In the first part of this thesis I reconstruct the evolutionary history of adult glioblastoma, a highly aggressive brain cancer, prior to and after standard therapy. To this end, I develop a likelihood-based multinomial model that jointly infers genetic subclones and their phylogenetic relationships from whole genome sequencing data. Applied to 21 sample pairs from primary and recurrent glioblastomas, the model infers a common path of early tumorigenesis characterized by three pervasive copy number changes on chromosome 7 (gain), chromosome 9p (loss) and chromosome 10 (loss). TERT promoter mutations are subclonal in one third of the tumor pairs and are thus placed at a later stage of tumorigenesis. Our data indicate that recurrent tumors typically re-grow from multiple subclones of the primary tumor with no evidence for a 'resistance genotype' induced by therapy. Combining the results from phylogenetic inference with population dynamics models of tumor growth, I estimate that glioblastomas originate several years prior to initial diagnosis but reach detectable sizes only after TERT promoter mutations stabilized cellular survival. This project provides new insights into the evolutionary history of glioblastoma that may ultimately aid early diagnosis. In the second part I analyze the mutation frequency distribution in normal tissues. To this end, I extend existing theory on mutation accumulation in exponentially growing tissues to a two-stage situation of initial embryonic expansion and subsequent homeostasis during adulthood. Based on stochastic simulations I show that the theoretical framework recovers the average mutation frequency spectrum in stem cell populations. Whole genome sequencing data from murine granulocytes and human leukocytes from subjects of different ages without diagnosed leukemia confirm the model prediction in the majority of cases but reveal an unexpectedly high mutational burden in a smaller subgroup. These cases were associated with one or several leukemic driver mutations, suggesting that perturbed hematopoiesis or pre-leukemic expansions caused the deviation of the mutation frequency spectrum from neutrality. The comprehensive analysis of the mutation frequency spectra in normal and perturbed hematopoiesis may aid the understanding of tumor initiation in vivo

    Modeling correlation in binary count data with application to fragile site identification

    Get PDF
    Available fragile site identification software packages (FSM and FSM3) assume that all chromosomal breaks occur independently. However, under a Mendelian model of inheritance, homozygosity at fragile loci implies pairwise correlation between homologous sites. We construct correlation models for chromosomal breakage data in situations where either partitioned break count totals (per-site single-break and doublebreak totals) are known or only overall break count totals are known. We derive a likelihood ratio test and NeymanñÂÂs C( ĂƒĂ‚Â±) test for correlation between homologs when partitioned break count totals are known and outline a likelihood ratio test for correlation using only break count totals. Our simulation studies indicate that the C( ĂƒĂ‚Â±) test using partitioned break count totals outperforms the other two tests for correlation in terms of both power and level. These studies further suggest that the power for detecting correlation is low when only break count totals are reported. Results of the C( ĂƒĂ‚Â±) test for correlation applied to chromosomal breakage data from 14 human subjects indicate that detection of correlation between homologous fragile sites is problematic due to sparseness of breakage data. Simulation studies of the FSM and FSM3 algorithms using parameter values typical for fragile site data demonstrate that neither algorithm is significantly affected by fragile site correlation. Comparison of simulated fragile site misclassification rates in the presence of zero-breakage data supports previous studies (Olmsted 1999) that suggested FSM has lower false-negative rates and FSM3 has lower false-positive rates

    Probabilistic analysis of the human transcriptome with side information

    Get PDF
    Understanding functional organization of genetic information is a major challenge in modern biology. Following the initial publication of the human genome sequence in 2001, advances in high-throughput measurement technologies and efficient sharing of research material through community databases have opened up new views to the study of living organisms and the structure of life. In this thesis, novel computational strategies have been developed to investigate a key functional layer of genetic information, the human transcriptome, which regulates the function of living cells through protein synthesis. The key contributions of the thesis are general exploratory tools for high-throughput data analysis that have provided new insights to cell-biological networks, cancer mechanisms and other aspects of genome function. A central challenge in functional genomics is that high-dimensional genomic observations are associated with high levels of complex and largely unknown sources of variation. By combining statistical evidence across multiple measurement sources and the wealth of background information in genomic data repositories it has been possible to solve some the uncertainties associated with individual observations and to identify functional mechanisms that could not be detected based on individual measurement sources. Statistical learning and probabilistic models provide a natural framework for such modeling tasks. Open source implementations of the key methodological contributions have been released to facilitate further adoption of the developed methods by the research community.Comment: Doctoral thesis. 103 pages, 11 figure
    • 

    corecore