13 research outputs found

    Unsupervised and semi-supervised training methods for eukaryotic gene prediction

    Get PDF
    This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing. Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns. The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments. Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof

    Computing the likelihood of sequence segmentation under Markov modelling

    Get PDF
    I tackle the problem of partitioning a sequence into homogeneous segments, where homogeneity is defined by a set of Markov models. The problem is to study the likelihood that a sequence is divided into a given number of segments. Here, the moments of this likelihood are computed through an efficient algorithm. Unlike methods involving Hidden Markov Models, this algorithm does not require probability transitions between the models. Among many possible usages of the likelihood, I present a maximum \textit{a posteriori} probability criterion to predict the number of homogeneous segments into which a sequence can be divided, and an application of this method to find CpG islands

    Reconstructing the energy landscape of a distribution from Monte Carlo samples

    Full text link
    Defining the energy function as the negative logarithm of the density, we explore the energy landscape of a distribution via the tree of sublevel sets of its energy. This tree represents the hierarchy among the connected components of the sublevel sets. We propose ways to annotate the tree so that it provides information on both topological and statistical aspects of the distribution, such as the local energy minima (local modes), their local domains and volumes, and the barriers between them. We develop a computational method to estimate the tree and reconstruct the energy landscape from Monte Carlo samples simulated at a wide energy range of a distribution. This method can be applied to any arbitrary distribution on a space with defined connectedness. We test the method on multimodal distributions and posterior distributions to show that our estimated trees are accurate compared to theoretical values. When used to perform Bayesian inference of DNA sequence segmentation, this approach reveals much more information than the standard approach based on marginal posterior distributions.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS196 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Computational Characterization of 3′ Splice Variants in the GFAP Isoform Family

    Get PDF
    Glial fibrillary acidic protein (GFAP) is an intermediate filament (IF) protein specific to central nervous system (CNS) astrocytes. It has been the subject of intense interest due to its association with neurodegenerative diseases, and because of growing evidence that IF proteins not only modulate cellular structure, but also cellular function. Moreover, GFAP has a family of splicing isoforms apparently more complex than that of other CNS IF proteins, consistent with it possessing a range of functional and structural roles. The gene consists of 9 exons, and to date all isoforms associated with 3′ end splicing have been identified from modifications within intron 7, resulting in the generation of exon 7a (GFAPδ/ε) and 7b (GFAPκ). To better understand the nature and functional significance of variation in this region, we used a Bayesian multiple change-point approach to identify conserved regions. This is the first successful application of this method to a single gene – it has previously only been used in whole-genome analyses. We identified several highly or moderately conserved regions throughout the intron 7/7a/7b regions, including untranslated regions and regulatory features, consistent with the biology of GFAP. Several putative unconfirmed features were also identified, including a possible new isoform. We then integrated multiple computational analyses on both the DNA and protein sequences from the mouse, rat and human, showing that the major isoform, GFAPα, has highly conserved structure and features across the three species, whereas the minor isoforms GFAPδ/ε and GFAPκ have low conservation of structure and features at the distal 3′ end, both relative to each other and relative to GFAPα. The overall picture suggests distinct and tightly regulated functions for the 3′ end isoforms, consistent with complex astrocyte biology. The results illustrate a computational approach for characterising splicing isoform families, using both DNA and protein sequences

    Statistical analysis of high-throughput sequencing count data

    No full text
    All of the work presented in this thesis grew out of collaborations with other researchers. For each chapter, I brie y summarize my contribution and acknowledge the contributions of others. Chapter 2 represents a conceptual framework for modeling read counts using various distributions. These ideas grew out of conversations with Ho-Ryun Chung at the Max Planck Institute for Molecular Genetics (MPIMG) in Berlin and Simon Anders at the European Molecular Biology Laboratories (EMBL) in Heidelberg. Chapter 3 was published in Statistical Applications in Genetics and Molecular Biology [1]. The idea for detecting copy number variants in exome-enriched sequencing data was proposed by Stefan Haas and with Alena van Bommel various methods were tested and evaluated. My contribution was developing the hidden Markov model, implementing the software and testing the performance. I wish to acknowledge the X-linked intellectual disabilities project team at MPIMG including H.-Hilger Ropers, Vera Kalscheuer, Ruping Sun, Anne-Katrin Emde, Wei Chen, Hao Hu and Tomasz Zemojtel, who provided helpful discussions. Chapter 4 resulted from a 5 month visit to the group of Wolfgang Huber at EMBL in Heidelberg. Simon Anders proposed the idea of incorporating priors for dispersion and log fold change into the DESeq framework. My contribution was to implement these new statistical methods as a new package DESeq2, with closer integration with core Bioconductor packages. I would like to acknowledge all the members of the Huber group for helpful discussions. Chapter 5 resulted from a collaboration with the Transcriptional Regulation Group of Sebastiaan Meijsing at the MPIMG. I would like to thank Stephan Starick who initially proposed to investigate the interaction between glucocorticoid receptor and the chromatin landscape. My contribution was the statistical analysis presented in the chapter. Sebastiaan Meijsing provided valuable feedback during the evolution of the project. I wish to acknowledge the contributions of Morgane Thomas-Chollier, Katja Borzym, Sam Cooper and Ho-Ryun Chung

    STED Nanoscopy to Illuminate New Avenues in Cancer Research – From Live Cell Staining and Direct Imaging to Decisive Preclinical Insights for Diagnosis and Therapy

    Get PDF
    Molecular imaging is established as an indispensable tool in various areas of cancer research, ranging from basic cancer biology and preclinical research to clinical trials and medical practice. In particular, the field of fluorescence imaging has experienced exceptional progress during the last three decades with the development of various in vivo technologies. Within this field, fluorescence microscopy is primarily of experimental use since it is especially qualified for addressing the fundamental questions of molecular oncology. As stimulated emission depletion (STED) nanoscopy combines the highest spatial and temporal resolutions with live specimen compatibility, it is best-suited for real-time investigations of the differences in the molecular machineries of malignant and normal cells to eventually translate the acquired knowledge into increased diagnostic and therapeutic efficacy. This thesis presents the application of STED nanoscopy to two acute topics in cancer research of direct or indirect clinical interest. The first project has investigated the structure of telomeres, the ends of the linear eukaryotic chromosomes, in intact human cells at the nanoscale. To protect genome integrity, a telomere can mask the chromosome end by folding back and sequestering its single-stranded 3’-overhang in an upstream part of the double-stranded DNA repeat region. The formed t-loop structure has so far only been visualized by electron microscopy and fluorescence nanoscopy with cross-linked mammalian telomeric DNA after disruption of cell nuclei and spreading. For the first time, this work demonstrates the existence of t-loops within their endogenous nuclear environment in intact human cells. The identification of further telomere conformations has laid the groundwork for distinguishing cancerous cells that use different telomere maintenance mechanisms based on their individual telomere populations by a combined STED nanoscopy and deep learning approach. The population difference was essentially attributed to the promyelocytic leukemia (PML) protein that significantly perturbs the organization of a subpopulation of telomeres towards an open conformation in cancer cells that employ a telomerase-independent, alternative telomere lengthening mechanism. Elucidating the nanoscale topology of telomeres and associated proteins within the nucleus has provided new insight into telomere structure-function relationships relevant for understanding the deregulation of telomere maintenance in cancer cells. After understanding the molecular foundations, this newly gained knowledge can be exploited to develop novel or refined diagnostic and treatment strategies. The second project has characterized the intracellular distribution of recently developed prostate cancer tracers. These novel prostate-specific membrane antigen (PSMA) inhibitors have revolutionized the treatment regimen of prostate cancer by enabling targeted imaging and therapy approaches. However, the exact internalization mechanism and the subcellular fate of these tracers have remained elusive. By combining STED nanoscopy with a newly developed non-standard live cell staining protocol, this work confirmed cell surface clustering of the targeted membrane antigen upon PSMA inhibitor binding, subsequent clathrin-dependent endocytosis and endosomal trafficking of the antigen-inhibitor complex. PSMA inhibitors accumulate in prostate cancer cells at clinically relevant time points, but strikingly and in contrast to the targeted antigen itself, they eventually distribute homogenously in the cytosol. This project has revealed the subcellular fate of PSMA/PSMA inhibitor complexes for the first time and provides crucial knowledge for the future application of these tracers including the development of new strategies in the field of prostate cancer diagnostics and therapeutics. Relying on the photostability and biocompatibility of the applied fluorophores, the performance of live cell STED nanoscopy in the field of cancer research is boosted by the development of improved fluorophores. The third project in this thesis introduces a biocompatible, small molecule near-infrared dye suitable for live cell STED imaging. By the application of a halogen dance rearrangement, a dihalogenated fluorinatable pyridinyl rhodamine could be synthesized at high yield. The option of subsequent radiolabeling combined with excellent optical properties and a non-toxic profile renders this dye an appropriate candidate for medical and bioimaging applications. Providing an intrinsic and highly specific mitochondrial targeting ability, the radiolabeled analogue is suggested as a vehicle for multimodal (positron emission tomography and optical imaging) medical imaging of mitochondria for cancer diagnosis and therapeutic approaches in patients and biopsy tissue. The absence of cytotoxicity is not only a crucial prerequisite for clinically used fluorophores. To guarantee the generation of meaningful data mirroring biological reality, the absence of cytotoxicity is likewise a decisive property of dyes applied in live cell STED nanoscopy. The fourth project in this thesis proposes a universal approach for cytotoxicity testing based on characterizing the influence of the compound of interest on the proliferation behavior of human cell lines using digital holographic cytometry. By applying this approach to recently developed live cell STED compatible dyes, pronounced cytotoxic effects could be excluded. Looking more closely, some of the tested dyes slightly altered cell proliferation, so this project provides guidance on the right choice of dye for the least invasive live cell STED experiments. Ultimately, live cell STED data should be exploited to extract as much biological information as possible. However, some information might be partially hidden by image degradation due the dynamics of living samples and the deliberate choice of rather conservative imaging parameters in order to preserve sample viability. The fifth project in this thesis presents a novel image restoration method in a Bayesian framework that simultaneously performs deconvolution, denoising as well as super-resolution, to restore images suffering from noise with mixed Poisson-Gaussian statistics. Established deconvolution or denoising methods that consider only one type of noise generally do not perform well on images degraded significantly by mixed noise. The newly introduced method was validated with live cell STED telomere data proving that the method can compete with state-of-the-art approaches. Taken together, this thesis demonstrates the value of an integrated approach for STED nanoscopy imaging studies. A coordinated workflow including sample preparation, image acquisition and data analysis provided a reliable platform for deriving meaningful conclusions for current questions in the field of cancer research. Moreover, this thesis emphasizes the strength of iteratively adapting the individual components in the operational chain and it particularly points towards those components that, if further improved, optimize the significance of the final results rendering live cell STED nanoscopy even more powerful

    Segmenting eukaryotic genomes with the generalized Gibbs sampler

    No full text
    Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the author's website (www.uq.edu.au/similar to uqjkeith/) or upon request to the author

    Ultrasensitive detection of toxocara canis excretory-secretory antigens by a nanobody electrochemical magnetosensor assay.

    Full text link
    peer reviewedHuman Toxocariasis (HT) is a zoonotic disease caused by the migration of the larval stage of the roundworm Toxocara canis in the human host. Despite of being the most cosmopolitan helminthiasis worldwide, its diagnosis is elusive. Currently, the detection of specific immunoglobulins IgG against the Toxocara Excretory-Secretory Antigens (TES), combined with clinical and epidemiological criteria is the only strategy to diagnose HT. Cross-reactivity with other parasites and the inability to distinguish between past and active infections are the main limitations of this approach. Here, we present a sensitive and specific novel strategy to detect and quantify TES, aiming to identify active cases of HT. High specificity is achieved by making use of nanobodies (Nbs), recombinant single variable domain antibodies obtained from camelids, that due to their small molecular size (15kDa) can recognize hidden epitopes not accessible to conventional antibodies. High sensitivity is attained by the design of an electrochemical magnetosensor with an amperometric readout with all components of the assay mixed in one single step. Through this strategy, 10-fold higher sensitivity than a conventional sandwich ELISA was achieved. The assay reached a limit of detection of 2 and15 pg/ml in PBST20 0.05% or serum, spiked with TES, respectively. These limits of detection are sufficient to detect clinically relevant toxocaral infections. Furthermore, our nanobodies showed no cross-reactivity with antigens from Ascaris lumbricoides or Ascaris suum. This is to our knowledge, the most sensitive method to detect and quantify TES so far, and has great potential to significantly improve diagnosis of HT. Moreover, the characteristics of our electrochemical assay are promising for the development of point of care diagnostic systems using nanobodies as a versatile and innovative alternative to antibodies. The next step will be the validation of the assay in clinical and epidemiological contexts
    corecore