51 research outputs found

    Data structures and algorithms for analysis of alternative splicing with RNA-Seq data

    Get PDF

    USING MACHINE LEARNING TECHNIQUES FOR FINDING MEANINGFUL TRANSCRIPTS IN PROSTATE CANCER PROGRESSION

    Get PDF
    Prostate Cancer is one of the most common types of cancer among Canadian men. Next generation sequencing that uses RNA-Seq can be valuable in studying cancer, since it provides large amounts of data as a source for information about biomarkers. For these reasons, we have chosen RNA-Seq data for prostate cancer progression in our study. In this research, we propose a new method for finding transcripts that can be used as genomic features. In this regard, we have gathered a very large amount of transcripts. There are a large number of transcripts that are not quite relevant, and we filter them by applying a feature selection algorithm. The results are then processed through a machine learning technique for classification such as the support vector machine which is used to classify different stages of prostate cancer. Finally, we have identified potential transcripts associated with prostate cancer progression. Ideally, these transcripts can be used for improving diagnosis, treatment, and drug development

    Improving the Performance and Precision of Bioinformatics Algorithms

    Get PDF
    Recent advances in biotechnology have enabled scientists to generate and collect huge amounts of biological experimental data. Software tools for analyzing both genomic (DNA) and proteomic (protein) data with high speed and accuracy have thus become very important in modern biological research. This thesis presents several techniques for improving the performance and precision of bioinformatics algorithms used by biologists. Improvements in both the speed and cost of automated DNA sequencers have allowed scientists to sequence the DNA of an increasing number of organisms. One way biologists can take advantage of this genomic DNA data is to use it in conjunction with expressed sequence tag (EST) and cDNA sequences to find genes and their splice sites. This thesis describes ESTmapper, a tool designed to use an eager write-only top-down (WOTD) suffix tree to efficiently align DNA sequences against known genomes. Experimental results show that ESTmapper can be much faster than previous techniques for aligning and clustering DNA sequences, and produces alignments of comparable or better quality. Peptide identification by tandem mass spectrometry (MS/MS) is becoming the dominant high-throughput proteomics workflow for protein characterization in complex samples. Biologists currently rely on protein database search engines to identify peptides producing experimentally observed mass spectra. This thesis describes two approaches for improving peptide identification precision using statistical machine learning. HMMatch (HMM MS/MS Match) is a hidden Markov model approach to spectral matching, in which many examples of a peptide fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. Experimental results show that HMMatch can identify many peptides missed by traditional spectral matching and search engines. PepArML (Peptide Identification Arbiter by Machine Learning) is a machine learning based framework for improving the precision of peptide identification. It uses classification algorithms to effectively utilize spectra features and scores from multiple search engines in a single model-free framework that can be trained in an unsupervised manner. Experimental results show that PepArML can improve the sensitivity of peptide identification for several synthetic protein mixtures compared with individual search engines

    Methods for Epigenetic Analyses from Long-Read Sequencing Data

    Get PDF
    Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease. DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity. Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads. With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another. Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures. Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data. The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies. Methods for storage, retrieval, and analysis of such data therefore require careful consideration. Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation. These avenues had not been considered in existing tools. In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods. I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information. This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing. It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface. Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing. This implementation takes advantage of the performance benefits provided by my high performance storage container. It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties. Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction. I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions. Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma. I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures. Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation. These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding. In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing

    Novel Algorithm Development for ‘NextGeneration’ Sequencing Data Analysis

    Get PDF
    In recent years, the decreasing cost of ‘Next generation’ sequencing has spawned numerous applications for interrogating whole genomes and transcriptomes in research, diagnostic and forensic settings. While the innovations in sequencing have been explosive, the development of scalable and robust bioinformatics software and algorithms for the analysis of new types of data generated by these technologies have struggled to keep up. As a result, large volumes of NGS data available in public repositories are severely underutilised, despite providing a rich resource for data mining applications. Indeed, the bottleneck in genome and transcriptome sequencing experiments has shifted from data generation to bioinformatics analysis and interpretation. This thesis focuses on development of novel bioinformatics software to bridge the gap between data availability and interpretation. The work is split between two core topics – computational prioritisation/identification of disease gene variants and identification of RNA N6 -adenosine Methylation from sequencing data. The first chapter briefly discusses the emergence and establishment of NGS technology as a core tool in biology and its current applications and perspectives. Chapter 2 introduces the problem of variant prioritisation in the context of Mendelian disease, where tens of thousands of potential candidates are generated by a typical sequencing experiment. Novel software developed for candidate gene prioritisation is described that utilises data mining of tissue-specific gene expression profiles (Chapter 3). The second part of chapter investigates an alternative approach to candidate variant prioritisation by leveraging functional and phenotypic descriptions of genes and diseases from multiple biomedical domain ontologies (Chapter 4). Chapter 5 discusses N6 AdenosineMethylation, a recently re-discovered posttranscriptional modification of RNA. The core of the chapter describes novel software developed for transcriptome-wide detection of this epitranscriptomic mark from sequencing data. Chapter 6 presents a case study application of the software, reporting the previously uncharacterised RNA methylome of Kaposi’s Sarcoma Herpes Virus. The chapter further discusses a putative novel N6-methyl-adenosine -RNA binding protein and its possible roles in the progression of viral infection

    Global gene expression profiling of healthy human brain and its application in studying neurological disorders

    Get PDF
    The human brain is the most complex structure known to mankind and one of the greatest challenges in modern biology is to understand how it is built and organized. The power of the brain arises from its variety of cells and structures, and ultimately where and when different genes are switched on and off throughout the brain tissue. In other words, brain function depends on the precise regulation of gene expression in its sub-anatomical structures. But, our understanding of the complexity and dynamics of the transcriptome of the human brain is still incomplete. To fill in the need, we designed a gene expression model that accurately defines the consistent blueprint of the brain transcriptome; thereby, identifying the core brain specific transcriptional processes conserved across individuals. Functionally characterizing this model would provide profound insights into the transcriptional landscape, biological pathways and the expression distribution of neurotransmitter systems. Here, in this dissertation we developed an expression model by capturing the similarly expressed gene patterns across congruently annotated brain structures in six individual brains by using data from the Allen Brain Atlas (ABA). We found that 84% of genes are expressed in at least one of the 190 brain structures. By employing hierarchical clustering we were able to show that distinct structures of a bigger brain region can cluster together while still retaining their expression identity. Further, weighted correlation network analysis identified 19 robust modules of coexpressing genes in the brain that demonstrated a wide range of functional associations. Since signatures of local phenomena can be masked by larger signatures, we performed local analysis on each distinct brain structure. Pathway and gene ontology enrichment analysis on these structures showed, striking enrichment for brain region specific processes. Besides, we also mapped the structural distribution of the gene expression profiles of genes associated with major neurotransmission systems in the human. We also postulated the utility of healthy brain tissue gene expression to predict potential genes involved in a neurological disorder, in the absence of data from diseased tissues. To this end, we developed a supervised classification model, which achieved an accuracy of 84% and an AUC (Area Under the Curve) of 0.81 from ROC plots, for predicting autism-implicated genes using the healthy expression model as the baseline. This study represents the first use of healthy brain gene expression to predict the scope of genes in autism implication and this generic methodology can be applied to predict genes involved in other neurological disorders

    Deletions and Functional Assessment of Exonic Variants affecting Splicing in Genes underlying Inherited Retinal Dystrophies

    Get PDF
    Diese Dissertation beschĂ€ftigt sich mit der Erforschung der molekularen Ätiologie bestimmter mendelscher Erkrankungen, die durch Netzhautfunktionsstörungen und Sehstörungen gekennzeichnet sind. Ich fĂŒhrte eine detaillierte Untersuchung von de novo Mutationsereignisse durch, die zu genomischen Deletionen oder zur Umwandlung in pathogene Haplotypen in Exon 3 von OPN1LW und OPN1MW fĂŒhren und dadurch eine Blauzapfenmonochromasie verursachen. Der Fokus dieser Arbeit geht jedoch ĂŒber die Identifizierung von genetischen Varianten hinaus und fokussiert sich auf die funktionale Charakterisierung putative pathogene (Splicing-) Varianten und der Etablierung einer robusten Genotyp-PhĂ€notyp-Korrelation. Der Schwerpunkt der Untersuchungen galt der Wirkung exonische Varianten auf die Transkriptprozessierung, namentlich auf die Retention bzw. das Skipping des betroffenen Exons. So induziert die exonische Variante c.1684C>T/p.Arg562Trp im murinen Pde6a-Gen ein In-Frame-Exon-Skipping in >60% der Transkripte in der MĂ€usenetzhaut. Die homologe Missense-Mutation bei RP-Patienten verursacht zudem eine verminderte enzymatische AktivitĂ€t der PDE6A auf Proteinebene. Vergleiche zwischen analogen Mutationen in Ortholog- und Paralog-Genen belegen den potenziellen Einfluss synonymer Varianten auf den Spleißprozess. In diesem Sinne habe ich retrospektiv den Effekt von seltenen Variantenkombinationen (d.h. Haplotypen) in Exon 3 von OPN1LW und OPN1MW bei Blauzapfenmonochromasie Patienten auf ihre Wirkung bzgl. Transkriptsplicing untersucht. In neun von zwölf untersuchten Haplotypen, die individuell mittels eines semi-quantitativen Minigene SpleiÎČ-Assay analysiert wurden, wurde ein Anteil von ≄20% anomal gespleißter Transkripte nachgewiesen. Um den Einfluß von Exon 3 Haplotypen auf das Splicing systematisch zu studieren, wurde in dieser Arbeit ein parallelisierte Minigen-Assay entwickelt und damit fĂŒr mehr als 200 Haplotypen sowohl das Ausmaß des von jedem Haplotyp verursachten Spleißdefekts als auch den Gesamteffekt jeder exonischen Variante innerhalb des Haplotyps bestimmt. Die Varianten c.532A>G und c.538T>G in Exon 3 von OPN1LW/MW zeigten die grĂ¶ĂŸten Auswirkung auf die Exon-Retention. Mittels eines differentiellen RNA-Pulldown-Assays konnte der hnRNPF-Splicing-Faktor als möglicher Kandidat identifziert werden, der an Guanosin-Tripletts bindet, die durch die Varianten c.532A>G und c.538T>G gebildet werden.The present dissertation comprises research about the molecular etiology of certain Mendelian disorders characterized by retinal malfunction and visual impairment. I performed a detailed investigation of de novo mutation events entailing genomic deletions or the conversion to pathogenic haplotypes in exon 3 of OPN1LW and OPN1MW underlying the occurrence of Blue Cone Monochromacy. Yet, the focus of this thesis goes beyond variant identification, including functional characterization to assess putative pathogenic variants and establish robust genotype-phenotype correlation. Exonic variants were assessed at the transcript level ‒ more precisely, attending to the splicing efficiency in terms of exon retention or exon skipping. Namely, the exonic variant c.1684C>T/p.Arg562Trp in Pde6a induces in-frame exon skipping in >60% of murine retinal transcripts. The homologous missense mutation in a human RP patient exerts a second pathomechanism at the protein level, reducing the enzymatic activity of PDE6A. Comparisons with analogous mutations in ortholog and paralog genes underscore the potential influence of synonymous variants in splicing. In this sense, other than single exonic variants, I retrospectively characterized the effect of haplotypes confined to exon 3 of OPN1LW and OPN1MW in Blue Cone Monochromacy patients on splicing efficiency. Nine out of twelve haplotypes individually assessed by a semi-quantitative minigene splicing assay resulted in a fraction of ≄20% of aberrantly spliced transcripts. To explore the full breadth of exon 3 haplotype induced splicing defects I developed a parallelized minigene assay leveraging the newest sequencing technologies to quantify for more than 200 haplotypes both the extent of splicing defect exerted by each haplotype and the overall effect of each exonic variant within the haplotype. These experiments showed that c.532A>G and c.538T>G in exon 3 of OPN1LW/MW are the two variants with the highest impact on exon retention during transcript splicing. An RNA-pulldown assay including the haplotype region identified the hnRNPF splicing factor as a putative candidate, which binds to guanosine triplets created by the variants c.532A>G and c.538T>G

    Resolving Biological Trajectories in Single-cell Data using Feature Selection and Multi-modal Integration

    Get PDF
    Single-cell technologies can readily measure the expression of thousands of molecular features from individual cells undergoing dynamic biological processes, such as cellular differentiation, immune response, and disease progression. While computational trajectory inference methods and RNA velocity approaches have been developed to study how subtle changes in gene or protein expression impact cell fate decision-making, identifying characteristic features that drive continuous biological processes remains difficult to detect due to the inherent biological or technical challenges associated with single-cell data. Here, we developed two data representation-based approaches for improving inference of cellular dynamics. First, we present DELVE, an unsupervised feature selection method for identifying a representative subset of dynamically-expressed molecular features that resolve cellular trajectories in noisy data. In contrast to previous work, DELVE uses a bottom-up approach to mitigate the effect of unwanted sources of variation confounding inference and models cell states from dynamic feature modules that constitute core regulatory complexes. Using simulations, single-cell RNA sequencing data, and iterative immunofluorescence imaging data in the context of cell cycle and cellular differentiation, we demonstrate that DELVE selects genes or proteins that more accurately characterize cell populations and improve the recovery of cell type transitions. Next, we present the first task-oriented benchmarking study that investigates integration of temporal gene expression modalities for dynamic cell state prediction. We benchmark ten multi-modal integration approaches on ten datasets spanning different biological contexts, sequencing technologies, and species. This study illustrates how temporal gene expression modalities can be optimally combined to improve inference of cellular trajectories and more accurately predict sample-associated perturbation and disease phenotypes. Lastly, we illustrate an application of these approaches and perform an integrative analysis of gene expression and RNA velocity data to study the crosstalk between signaling pathways that govern the mesendoderm fate decision during directed definitive endoderm differentiation. Results of this study suggest that lineage-specific, temporally expressed genes within the primitive streak may serve as a potential target for increasing definitive endoderm efficiency. Collectively, this work uses scalable data-driven approaches to effectively manage the inherent biological or technical challenges associated with single-cell data in order to improve inference of cellular dynamics.Doctor of Philosoph
    • 

    corecore