6,463 research outputs found

    An expectation-maximization algorithm for probabilistic reconstructions of full-length isoforms from splice graphs.

    Get PDF
    Reconstructing full-length transcript isoforms from sequence fragments (such as ESTs) is a major interest and challenge for bioinformatic analysis of pre-mRNA alternative splicing. This problem has been formulated as finding traversals across the splice graph, which is a directed acyclic graph (DAG) representation of gene structure and alternative splicing. In this manuscript we introduce a probabilistic formulation of the isoform reconstruction problem, and provide an expectation-maximization (EM) algorithm for its maximum likelihood solution. Using a series of simulated data and expressed sequences from real human genes, we demonstrate that our EM algorithm can correctly handle various situations of fragmentation and coupling in the input data. Our work establishes a general probabilistic framework for splice graph-based reconstructions of full-length isoforms

    Needed for completion of the human genome: hypothesis driven experiments and biologically realistic mathematical models

    Get PDF
    With the sponsorship of ``Fundacio La Caixa'' we met in Barcelona, November 21st and 22nd, to analyze the reasons why, after the completion of the human genome sequence, the identification all protein coding genes and their variants remains a distant goal. Here we report on our discussions and summarize some of the major challenges that need to be overcome in order to complete the human gene catalog.Comment: Report and discussion resulting from the `Fundacio La Caixa' gene finding meeting held November 21 and 22 2003 in Barcelon

    PIntron: a Fast Method for Gene Structure Prediction via Maximal Pairings of a Pattern and a Text

    Full text link
    Current computational methods for exon-intron structure prediction from a cluster of transcript (EST, mRNA) data do not exhibit the time and space efficiency necessary to process large clusters of over than 20,000 ESTs and genes longer than 1Mb. Guaranteeing both accuracy and efficiency seems to be a computational goal quite far to be achieved, since accuracy is strictly related to exploiting the inherent redundancy of information present in a large cluster. We propose a fast method for the problem that combines two ideas: a novel algorithm of proved small time complexity for computing spliced alignments of a transcript against a genome, and an efficient algorithm that exploits the inherent redundancy of information in a cluster of transcripts to select, among all possible factorizations of EST sequences, those allowing to infer splice site junctions that are highly confirmed by the input data. The EST alignment procedure is based on the construction of maximal embeddings that are sequences obtained from paths of a graph structure, called Embedding Graph, whose vertices are the maximal pairings of a genomic sequence T and an EST P. The procedure runs in time linear in the size of P, T and of the output. PIntron, the software tool implementing our methodology, is able to process in a few seconds some critical genes that are not manageable by other gene structure prediction tools. At the same time, PIntron exhibits high accuracy (sensitivity and specificity) when compared with ENCODE data. Detailed experimental data, additional results and PIntron software are available at http://www.algolab.eu/PIntron

    A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees

    Get PDF
    Sparse high dimensional graphical model selection is a topic of much interest in modern day statistics. A popular approach is to apply l1-penalties to either (1) parametric likelihoods, or, (2) regularized regression/pseudo-likelihoods, with the latter having the distinct advantage that they do not explicitly assume Gaussianity. As none of the popular methods proposed for solving pseudo-likelihood based objective functions have provable convergence guarantees, it is not clear if corresponding estimators exist or are even computable, or if they actually yield correct partial correlation graphs. This paper proposes a new pseudo-likelihood based graphical model selection method that aims to overcome some of the shortcomings of current methods, but at the same time retain all their respective strengths. In particular, we introduce a novel framework that leads to a convex formulation of the partial covariance regression graph problem, resulting in an objective function comprised of quadratic forms. The objective is then optimized via a coordinate-wise approach. The specific functional form of the objective function facilitates rigorous convergence analysis leading to convergence guarantees; an important property that cannot be established using standard results, when the dimension is larger than the sample size, as is often the case in high dimensional applications. These convergence guarantees ensure that estimators are well-defined under very general conditions, and are always computable. In addition, the approach yields estimators that have good large sample properties and also respect symmetry. Furthermore, application to simulated/real data, timing comparisons and numerical convergence is demonstrated. We also present a novel unifying framework that places all graphical pseudo-likelihood methods as special cases of a more general formulation, leading to important insights

    Computational models for inferring biochemical networks

    Get PDF
    Biochemical networks are of great practical importance. The interaction of biological compounds in cells has been enforced to a proper understanding by the numerous bioinformatics projects, which contributed to a vast amount of biological information. The construction of biochemical systems (systems of chemical reactions), which include both topology and kinetic constants of the chemical reactions, is NP-hard and is a well-studied system biology problem. In this paper, we propose a hybrid architecture, which combines genetic programming and simulated annealing in order to generate and optimize both the topology (the network) and the reaction rates of a biochemical system. Simulations and analysis of an artificial model and three real models (two models and the noisy version of one of them) show promising results for the proposed method.The Romanian National Authority for Scientific Research, CNDI–UEFISCDI, Project No. PN-II-PT-PCCA-2011-3.2-0917

    ASPIC: a novel method to predict the exon-intron structure of a gene that is optimally compatible to a set of transcript sequences

    Get PDF
    BACKGROUND: Currently available methods to predict splice sites are mainly based on the independent and progressive alignment of transcript data (mostly ESTs) to the genomic sequence. Apart from often being computationally expensive, this approach is vulnerable to several problems – hence the need to develop novel strategies. RESULTS: We propose a method, based on a novel multiple genome-EST alignment algorithm, for the detection of splice sites. To avoid limitations of splice sites prediction (mainly, over-predictions) due to independent single EST alignments to the genomic sequence our approach performs a multiple alignment of transcript data to the genomic sequence based on the combined analysis of all available data. We recast the problem of predicting constitutive and alternative splicing as an optimization problem, where the optimal multiple transcript alignment minimizes the number of exons and hence of splice site observations. We have implemented a splice site predictor based on this algorithm in the software tool ASPIC (Alternative Splicing PredICtion). It is distinguished from other methods based on BLAST-like tools by the incorporation of entirely new ad hoc procedures for accurate and computationally efficient transcript alignment and adopts dynamic programming for the refinement of intron boundaries. ASPIC also provides the minimal set of non-mergeable transcript isoforms compatible with the detected splicing events. The ASPIC web resource is dynamically interconnected with the Ensembl and Unigene databases and also implements an upload facility. CONCLUSION: Extensive bench marking shows that ASPIC outperforms other existing methods in the detection of novel splicing isoforms and in the minimization of over-predictions. ASPIC also requires a lower computation time for processing a single gene and an EST cluster. The ASPIC web resource is available at

    Current advances in systems and integrative biology

    Get PDF
    Systems biology has gained a tremendous amount of interest in the last few years. This is partly due to the realization that traditional approaches focusing only on a few molecules at a time cannot describe the impact of aberrant or modulated molecular environments across a whole system. Furthermore, a hypothesis-driven study aims to prove or disprove its postulations, whereas a hypothesis-free systems approach can yield an unbiased and novel testable hypothesis as an end-result. This latter approach foregoes assumptions which predict how a biological system should react to an altered microenvironment within a cellular context, across a tissue or impacting on distant organs. Additionally, re-use of existing data by systematic data mining and re-stratification, one of the cornerstones of integrative systems biology, is also gaining attention. While tremendous efforts using a systems methodology have already yielded excellent results, it is apparent that a lack of suitable analytic tools and purpose-built databases poses a major bottleneck in applying a systematic workflow. This review addresses the current approaches used in systems analysis and obstacles often encountered in large-scale data analysis and integration which tend to go unnoticed, but have a direct impact on the final outcome of a systems approach. Its wide applicability, ranging from basic research, disease descriptors, pharmacological studies, to personalized medicine, makes this emerging approach well suited to address biological and medical questions where conventional methods are not ideal

    A bioinformatic analysis identifies circadian expression of splicing factors and time-dependent alternative splicing events in the HD-MY-Z cell line

    Get PDF
    The circadian clock regulates key cellular processes and its dysregulation is associated to several pathologies including cancer. Although the transcriptional regulation of gene expression by the clock machinery is well described, the role of the clock in the regulation of post-transcriptional processes, including splicing, remains poorly understood. In the present work, we investigated the putative interplay between the circadian clock and splicing in a cancer context. For this, we applied a computational pipeline to identify oscillating genes and alternatively spliced transcripts in time-course high-throughput data sets from normal cells and tissues, and cancer cell lines. We investigated the temporal phenotype of clock-controlled genes and splicing factors, and evaluated their impact in alternative splice patterns in the Hodgkin Lymphoma cell line HD-MY-Z. Our data points to a connection between clock-controlled genes and splicing factors, which correlates with temporal alternative splicing in several genes in the HD-MY-Z cell line. These include the genes DPYD, SS18, VIPR1 and IRF4, involved in metabolism, cell cycle, apoptosis and proliferation. Our results highlight a role for the clock as a temporal regulator of alternative splicing, which may impact malignancy in this cellular model
    • …
    corecore