13 research outputs found
Unsupervised and semi-supervised training methods for eukaryotic gene prediction
This thesis describes new gene finding methods for eukaryotic gene prediction. The current methods for deriving model parameters for gene prediction algorithms are based on curated or experimentally validated set of genes or gene elements. These training sets often require time and additional expert efforts especially for the species that are in the initial stages of genome sequencing. Unsupervised training allows determination of model parameters from anonymous genomic sequence with. The importance and the practical applicability of the unsupervised training is critical for ever growing rate of eukaryotic genome sequencing.
Three distinct training procedures are developed for diverse group of eukaryotic species. GeneMark-ES is developed for species with strong donor and acceptor site signals such as Arabidopsis thaliana, Caenorhabditis elegans and Drosophila melanogaster. The second version of the algorithm, GeneMark-ES-2, introduces enhanced intron model to better describe the gene structure of fungal species with posses with relatively weak donor and acceptor splice sites and well conserved branch point signal. GeneMark-LE, semi-supervised training approach is designed for eukaryotic species with small number of introns.
The results indicate that the developed unsupervised training methods perform well as compared to other training methods and as estimated from the set of genes supported by EST-to-genome alignments.
Analysis of novel genomes reveals interesting biological findings and show that several candidates of under-annotated and over-annotated fungal species are present in the current set of annotated of fungal genomes.Ph.D.Committee Chair: Mark Borodovky; Committee Member: Jung H. Choi; Committee Member: King Jordan; Committee Member: Leonid Bunimovich; Committee Member: Yury Chernof
Computing the likelihood of sequence segmentation under Markov modelling
I tackle the problem of partitioning a sequence into homogeneous segments,
where homogeneity is defined by a set of Markov models. The problem is to study
the likelihood that a sequence is divided into a given number of segments.
Here, the moments of this likelihood are computed through an efficient
algorithm. Unlike methods involving Hidden Markov Models, this algorithm does
not require probability transitions between the models. Among many possible
usages of the likelihood, I present a maximum \textit{a posteriori} probability
criterion to predict the number of homogeneous segments into which a sequence
can be divided, and an application of this method to find CpG islands
Reconstructing the energy landscape of a distribution from Monte Carlo samples
Defining the energy function as the negative logarithm of the density, we
explore the energy landscape of a distribution via the tree of sublevel sets of
its energy. This tree represents the hierarchy among the connected components
of the sublevel sets. We propose ways to annotate the tree so that it provides
information on both topological and statistical aspects of the distribution,
such as the local energy minima (local modes), their local domains and volumes,
and the barriers between them. We develop a computational method to estimate
the tree and reconstruct the energy landscape from Monte Carlo samples
simulated at a wide energy range of a distribution. This method can be applied
to any arbitrary distribution on a space with defined connectedness. We test
the method on multimodal distributions and posterior distributions to show that
our estimated trees are accurate compared to theoretical values. When used to
perform Bayesian inference of DNA sequence segmentation, this approach reveals
much more information than the standard approach based on marginal posterior
distributions.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS196 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Computational Characterization of 3′ Splice Variants in the GFAP Isoform Family
Glial fibrillary acidic protein (GFAP) is an intermediate filament (IF) protein specific to central nervous system (CNS) astrocytes. It has been the subject of intense interest due to its association with neurodegenerative diseases, and because of growing evidence that IF proteins not only modulate cellular structure, but also cellular function. Moreover, GFAP has a family of splicing isoforms apparently more complex than that of other CNS IF proteins, consistent with it possessing a range of functional and structural roles. The gene consists of 9 exons, and to date all isoforms associated with 3′ end splicing have been identified from modifications within intron 7, resulting in the generation of exon 7a (GFAPδ/ε) and 7b (GFAPκ). To better understand the nature and functional significance of variation in this region, we used a Bayesian multiple change-point approach to identify conserved regions. This is the first successful application of this method to a single gene – it has previously only been used in whole-genome analyses. We identified several highly or moderately conserved regions throughout the intron 7/7a/7b regions, including untranslated regions and regulatory features, consistent with the biology of GFAP. Several putative unconfirmed features were also identified, including a possible new isoform. We then integrated multiple computational analyses on both the DNA and protein sequences from the mouse, rat and human, showing that the major isoform, GFAPα, has highly conserved structure and features across the three species, whereas the minor isoforms GFAPδ/ε and GFAPκ have low conservation of structure and features at the distal 3′ end, both relative to each other and relative to GFAPα. The overall picture suggests distinct and tightly regulated functions for the 3′ end isoforms, consistent with complex astrocyte biology. The results illustrate a computational approach for characterising splicing isoform families, using both DNA and protein sequences
Recommended from our members
Probabilistic Modeling for Whole Metagenome Profiling
To address the shortcomings in existing Markov model implementations in handling large amount of metagenomic data with comparable or better accuracy in classification, we developed a new algorithm based on pseudo-count supplemented standard Markov model (SMM), which leverages the power of higher order models to more robustly classify reads at different taxonomic levels. Assessment on simulated metagenomic datasets demonstrated that overall SMM was more accurate in classifying reads to their respective taxa at all ranks compared to the interpolated methods. Higher order SMMs (9th order or greater) also outperformed BLAST alignments in assigning taxonomic labels to metagenomic reads at different taxonomic ranks (genus and higher) on tests that masked the read originating species (genome models) in the database. Similar results were obtained by masking at other taxonomic ranks in order to simulate the plausible scenarios of non-representation of the source of a read at different taxonomic levels in the genome database. The performance gap became more pronounced with higher taxonomic levels. To eliminate contaminations in datasets and to further improve our alignment-free approach, we developed a new framework based on a genome segmentation and clustering algorithm. This framework allowed removal of adapter sequences and contaminant DNA, as well as generation of clusters of similar segments, which were then used to sample representative read fragments to constitute training datasets. The parameters of a logistic regression model were learnt from these training datasets using a Bayesian optimization procedure. This allowed us to establish thresholds for classifying metagenomic reads by SMM. This led to the development of a Python-based frontend that combines our SMM algorithm with the logistic regression optimization, named POSMM (Python Optimized Standard Markov Model). POSMM provides a much-needed alternative to metagenome profiling programs. Our algorithm that builds the genome models on the fly, and thus obviates the need to build a database, complements alignment-based classification and can thus be used in concert with alignment-based classifiers to raise the bar in metagenome profiling
Statistical analysis of high-throughput sequencing count data
All of the work presented in this thesis grew out of collaborations with other researchers. For each chapter, I brie y summarize my contribution and acknowledge the contributions of others. Chapter 2 represents a conceptual framework for modeling read counts using various distributions. These ideas grew out of conversations with Ho-Ryun Chung at the Max Planck Institute for Molecular Genetics (MPIMG) in Berlin and Simon Anders at the European Molecular Biology Laboratories (EMBL) in Heidelberg. Chapter 3 was published in Statistical Applications in Genetics and Molecular Biology [1]. The idea for detecting copy number variants in exome-enriched sequencing data was proposed by Stefan Haas and with Alena van Bommel various methods were tested and evaluated. My contribution was developing the hidden Markov model, implementing the software and testing the performance. I wish to acknowledge the X-linked intellectual disabilities project team at MPIMG including H.-Hilger Ropers, Vera Kalscheuer, Ruping Sun, Anne-Katrin Emde, Wei Chen, Hao Hu and Tomasz Zemojtel, who provided helpful discussions. Chapter 4 resulted from a 5 month visit to the group of Wolfgang Huber at EMBL in Heidelberg. Simon Anders proposed the idea of incorporating priors for dispersion and log fold change into the DESeq framework. My contribution was to implement these new statistical methods as a new package DESeq2, with closer integration with core Bioconductor packages. I would like to acknowledge all the members of the Huber group for helpful discussions. Chapter 5 resulted from a collaboration with the Transcriptional Regulation Group of Sebastiaan Meijsing at the MPIMG. I would like to thank Stephan Starick who initially proposed to investigate the interaction between glucocorticoid receptor and the chromatin landscape. My contribution was the statistical analysis presented in the chapter. Sebastiaan Meijsing provided valuable feedback during the evolution of the project. I wish to acknowledge the contributions of Morgane Thomas-Chollier, Katja Borzym, Sam Cooper and Ho-Ryun Chung
STED Nanoscopy to Illuminate New Avenues in Cancer Research – From Live Cell Staining and Direct Imaging to Decisive Preclinical Insights for Diagnosis and Therapy
Molecular imaging is established as an indispensable tool in various areas of cancer research, ranging from basic cancer biology and preclinical research to clinical trials and medical practice. In particular, the field of fluorescence imaging has experienced exceptional progress during the last three decades with the development of various in vivo technologies. Within this field, fluorescence microscopy is primarily of experimental use since it is especially qualified for addressing the fundamental questions of molecular oncology. As stimulated emission depletion (STED) nanoscopy combines the highest spatial and temporal resolutions with live specimen compatibility, it is best-suited for real-time investigations of the differences in the molecular machineries of malignant and normal cells to eventually translate the acquired knowledge into increased diagnostic and therapeutic efficacy.
This thesis presents the application of STED nanoscopy to two acute topics in cancer research of direct or indirect clinical interest. The first project has investigated the structure of telomeres, the ends of the linear eukaryotic chromosomes, in intact human cells at the nanoscale. To protect genome integrity, a telomere can mask the chromosome end by folding back and sequestering its single-stranded 3’-overhang in an upstream part of the double-stranded DNA repeat region. The formed t-loop structure has so far only been visualized by electron microscopy and fluorescence nanoscopy with cross-linked mammalian telomeric DNA after disruption of cell nuclei and spreading. For the first time, this work demonstrates the existence of t-loops within their endogenous nuclear environment in intact human cells. The identification of further telomere conformations has laid the groundwork for distinguishing cancerous cells that use different telomere maintenance mechanisms based on their individual telomere populations by a combined STED nanoscopy and deep learning approach. The population difference was essentially attributed to the promyelocytic leukemia (PML) protein that significantly perturbs the organization of a subpopulation of telomeres towards an open conformation in cancer cells that employ a telomerase-independent, alternative telomere lengthening mechanism. Elucidating the nanoscale topology of telomeres and associated proteins within the nucleus has provided new insight into telomere structure-function relationships relevant for understanding the deregulation of telomere maintenance in cancer cells.
After understanding the molecular foundations, this newly gained knowledge can be exploited to develop novel or refined diagnostic and treatment strategies. The second project has characterized the intracellular distribution of recently developed prostate cancer tracers. These novel prostate-specific membrane antigen (PSMA) inhibitors have revolutionized the treatment regimen of prostate cancer by enabling targeted imaging and therapy approaches. However, the exact internalization mechanism and the subcellular fate of these tracers have remained elusive. By combining STED nanoscopy with a newly developed non-standard live cell staining protocol, this work confirmed cell surface clustering of the targeted membrane antigen upon PSMA inhibitor binding, subsequent clathrin-dependent endocytosis and endosomal trafficking of the antigen-inhibitor complex. PSMA inhibitors accumulate in prostate cancer cells at clinically relevant time points, but strikingly and in contrast to the targeted antigen itself, they eventually distribute homogenously in the cytosol. This project has revealed the subcellular fate of PSMA/PSMA inhibitor complexes for the first time and provides crucial knowledge for the future application of these tracers including the development of new strategies in the field of prostate cancer diagnostics and therapeutics.
Relying on the photostability and biocompatibility of the applied fluorophores, the performance of live cell STED nanoscopy in the field of cancer research is boosted by the development of improved fluorophores. The third project in this thesis introduces a biocompatible, small molecule near-infrared dye suitable for live cell STED imaging. By the application of a halogen dance rearrangement, a dihalogenated fluorinatable pyridinyl rhodamine could be synthesized at high yield. The option of subsequent radiolabeling combined with excellent optical properties and a non-toxic profile renders this dye an appropriate candidate for medical and bioimaging applications. Providing an intrinsic and highly specific mitochondrial targeting ability, the radiolabeled analogue is suggested as a vehicle for multimodal (positron emission tomography and optical imaging) medical imaging of mitochondria for cancer diagnosis and therapeutic approaches in patients and biopsy tissue.
The absence of cytotoxicity is not only a crucial prerequisite for clinically used fluorophores. To guarantee the generation of meaningful data mirroring biological reality, the absence of cytotoxicity is likewise a decisive property of dyes applied in live cell STED nanoscopy. The fourth project in this thesis proposes a universal approach for cytotoxicity testing based on characterizing the influence of the compound of interest on the proliferation behavior of human cell lines using digital holographic cytometry. By applying this approach to recently developed live cell STED compatible dyes, pronounced cytotoxic effects could be excluded. Looking more closely, some of the tested dyes slightly altered cell proliferation, so this project provides guidance on the right choice of dye for the least invasive live cell STED experiments.
Ultimately, live cell STED data should be exploited to extract as much biological information as possible. However, some information might be partially hidden by image degradation due the dynamics of living samples and the deliberate choice of rather conservative imaging parameters in order to preserve sample viability. The fifth project in this thesis presents a novel image restoration method in a Bayesian framework that simultaneously performs deconvolution, denoising as well as super-resolution, to restore images suffering from noise with mixed Poisson-Gaussian statistics. Established deconvolution or denoising methods that consider only one type of noise generally do not perform well on images degraded significantly by mixed noise. The newly introduced method was validated with live cell STED telomere data proving that the method can compete with state-of-the-art approaches.
Taken together, this thesis demonstrates the value of an integrated approach for STED nanoscopy imaging studies. A coordinated workflow including sample preparation, image acquisition and data analysis provided a reliable platform for deriving meaningful conclusions for current questions in the field of cancer research. Moreover, this thesis emphasizes the strength of iteratively adapting the individual components in the operational chain and it particularly points towards those components that, if further improved, optimize the significance of the final results rendering live cell STED nanoscopy even more powerful
Segmenting eukaryotic genomes with the generalized Gibbs sampler
Eukaryotic genomes display segmental patterns of variation in various properties, including GC content and degree of evolutionary conservation. DNA segmentation algorithms are aimed at identifying statistically significant boundaries between such segments. Such algorithms may provide a means of discovering new classes of functional elements in eukaryotic genomes. This paper presents a model and an algorithm for Bayesian DNA segmentation and considers the feasibility of using it to segment whole eukaryotic genomes. The algorithm is tested on a range of simulated and real DNA sequences, and the following conclusions are drawn. Firstly, the algorithm correctly identifies non-segmented sequence, and can thus be used to reject the null hypothesis of uniformity in the property of interest. Secondly, estimates of the number and locations of change-points produced by the algorithm are robust to variations in algorithm parameters and initial starting conditions and correspond to real features in the data. Thirdly, the algorithm is successfully used to segment human chromosome 1 according to GC content, thus demonstrating the feasibility of Bayesian segmentation of eukaryotic genomes. The software described in this paper is available from the author's website (www.uq.edu.au/similar to uqjkeith/) or upon request to the author
Ultrasensitive detection of toxocara canis excretory-secretory antigens by a nanobody electrochemical magnetosensor assay.
peer reviewedHuman Toxocariasis (HT) is a zoonotic disease caused by the migration
of the larval stage of the roundworm Toxocara canis in the human host.
Despite of being the most cosmopolitan helminthiasis worldwide, its
diagnosis is elusive. Currently, the detection of specific immunoglobulins
IgG against the Toxocara Excretory-Secretory Antigens (TES), combined
with clinical and epidemiological criteria is the only strategy to diagnose
HT. Cross-reactivity with other parasites and the inability to distinguish
between past and active infections are the main limitations of this
approach. Here, we present a sensitive and specific novel strategy to
detect and quantify TES, aiming to identify active cases of HT. High
specificity is achieved by making use of nanobodies (Nbs), recombinant
single variable domain antibodies obtained from camelids, that due to
their small molecular size (15kDa) can recognize hidden epitopes not
accessible to conventional antibodies. High sensitivity is attained by the
design of an electrochemical magnetosensor with an amperometric readout
with all components of the assay mixed in one single step. Through
this strategy, 10-fold higher sensitivity than a conventional sandwich
ELISA was achieved. The assay reached a limit of detection of 2 and15
pg/ml in PBST20 0.05% or serum, spiked with TES, respectively. These
limits of detection are sufficient to detect clinically relevant toxocaral
infections. Furthermore, our nanobodies showed no cross-reactivity
with antigens from Ascaris lumbricoides or Ascaris suum. This is to our
knowledge, the most sensitive method to detect and quantify TES so far,
and has great potential to significantly improve diagnosis of HT. Moreover,
the characteristics of our electrochemical assay are promising for the
development of point of care diagnostic systems using nanobodies as a
versatile and innovative alternative to antibodies. The next step will be the
validation of the assay in clinical and epidemiological contexts