17 research outputs found

    PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text

    Full text link
    Current computational methods for exon-intron structure prediction from a cluster of transcript (EST, mRNA) data do not exhibit the time and space efficiency necessary to process large clusters of over than 20,000 ESTs and genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a computational goal quite far to be achieved, since accuracy is strictly related to exploiting the inherent redundancy of information present in a large cluster. We propose a fast method for the problem that combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are highly confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings that are sequences obtained from paths of a graph structure, called Embedding Graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the size of P, T and of the output. PIntron, the software tool implementing our methodology, is able to process in a few seconds some critical genes that are not manageable by other gene structure prediction tools. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when compared with ENCODE data. Detailed experimental data, additional results and PIntron software are available at http://www.algolab.eu/PIntron

    ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences

    Get PDF
    BACKGROUND: Currently available methods to predict splice sites are mainly based on the independent and progressive alignment of transcript data (mostly ESTs) to the genomic sequence. Apart from often being computationally expensive, this approach is vulnerable to several problems – hence the need to develop novel strategies. RESULTS: We propose a method, based on a novel multiple genome-EST alignment algorithm, for the detection of splice sites. To avoid limitations of splice sites prediction (mainly, over-predictions) due to independent single EST alignments to the genomic sequence our approach performs a multiple alignment of transcript data to the genomic sequence based on the combined analysis of all available data. We recast the problem of predicting constitutive and alternative splicing as an optimization problem, where the optimal multiple transcript alignment minimizes the number of exons and hence of splice site observations. We have implemented a splice site predictor based on this algorithm in the software tool ASPIC (Alternative Splicing PredICtion). It is distinguished from other methods based on BLAST-like tools by the incorporation of entirely new ad hoc procedures for accurate and computationally efficient transcript alignment and adopts dynamic programming for the refinement of intron boundaries. ASPIC also provides the minimal set of non-mergeable transcript isoforms compatible with the detected splicing events. The ASPIC web resource is dynamically interconnected with the Ensembl and Unigene databases and also implements an upload facility. CONCLUSION: Extensive bench marking shows that ASPIC outperforms other existing methods in the detection of novel splicing isoforms and in the minimization of over-predictions. ASPIC also requires a lower computation time for processing a single gene and an EST cluster. The ASPIC web resource is available at

    Insights into protein-RNA complexes from computational analyses of iCLIP experiments

    Get PDF
    RNA-binding proteins (RBPs) are the primary regulators of all aspects of posttranscriptional gene regulation. In order to understand how RBPs perform their function, it is important to identify their binding sites. Recently, new techniques have been developed to employ high-throughput sequencing to study protein-RNA interactions in vivo, including the individual-nucleotide resolution UV crosslinking and immunoprecipitation (iCLIP). iCLIP identifies sites of protein-RNA crosslinking with nucleotide resolution in a transcriptome-wide manner. It is composed of over60steps,whichcanbemodified,butitisnotclearhowvariationsinthemethod affect the assignment of RNA binding sites. This is even more pertinent given that several variants of iCLIP have been developed. A central question of my research is how to correctly assign binding sites to RBPs using the data produced by iCLIP and similar techniques. I first focused on the technical analyses and solutions for the iCLIP method. I examinedcDNAdeletionsandcrosslink-associatedmotifstoshowthatthestartsof cDNAs are appropriate to assign the crosslink sites in all variants of CLIP, including iCLIP, eCLIP and irCLIP. I also showed that the non-coinciding cDNA-starts are caused by technical conditions in the iCLIP protocol that may lead to sequence constraintsatcDNA-endsinthefinalcDNAlibrary. Ialsodemonstratedtheimportance of fully optimizing the RNase and purification conditions in iCLIP to avoid thesecDNA-endconstraints. Next,IdevelopedCLIPo,acomputationalframework that assesses various features of iCLIP data to provide quality control standards which reveals how technical variations between experiments affect the specificity of assigned binding sites. I used CLIPo to compare multiple PTBP1 experiments produced by iCLIP, eCLIP and irCLIP, to reveal major effects of sequence constraintsatcDNA-endsorstarts,cDNAlengthdistributionandnon-specificcontaminants. Moreover, I assessed how the variations between these methods influence themechanisticconclusions. Thus,CLIPopresentsthequalitycontrolstandardsfor transcriptome-wide assignment of protein-RNA binding sites. I continued with analyses of RBP complexes by using data from spliceosomeiCLIP. This method simultaneously detects crosslink sites of small nuclear ribonucleoproteins (snRNPs) and auxiliary splicing factors on pre-mRNAs. I demonstratedthatthehighresolutionofspliceosome-iCLIPallowsfordistinctionbetween multiple proximal RNA binding sites, which can be valuable for transcriptomewide studies of large ribonucleoprotein complexes. Moreover, I showed that spliceosome-iCLIP can experimentally identify over 50,000 human branch points. In summary, I detected technical biases from iCLIP data, and demonstrated how such biases can be avoided, so that cDNA-starts appropriately assign the RNA binding sites. CLIPo analysis proved a useful quality control tool that evaluates data specificity across different methods, and I applied it to iCLIP, irCLIP and ENCODE eCLIP datasets. I presented how spliceosome-iCLIP data can be used to study the splicing machinery on pre-mRNAs and how to use constrained cDNAs from spliceosome-iCLIP data to identify branch points on a genome-wide scale. Taken together, these studies provide new insights into the field of RNA biology and can be used for future studies of iCLIP and related methods

    Integrative Transcriptomic Analysis of Long Intergenic Non-Coding RNAs in Cancer.

    Get PDF
    Ph.D. Thesis. University of Hawaiʻi at Mānoa 2017

    Proceedings of the 7th Sound and Music Computing Conference

    Get PDF
    Proceedings of the SMC2010 - 7th Sound and Music Computing Conference, July 21st - July 24th 2010

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore