410 research outputs found

    Weighted Minimum-Length Rearrangement Scenarios

    Get PDF
    We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n^4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements

    Breaking Good: Accounting For Fragility Of Genomic Regions In Rearrangement Distance Estimation

    Get PDF
    Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Models of evolution by genome rearrangements are prone to two types of flaws: One is to ignore the diversity of susceptibility to breakage across genomic regions, and the other is to suppose that susceptibility values are given. Without necessarily supposing their precise localization, we call "solid" the regions that are improbably broken by rearrangements and "fragile" the regions outside solid ones. We propose a model of evolution by inversions where breakage probabilities vary across fragile regions and over time. It contains as a particular case the uniform breakage model on the nucleotidic sequence, where breakage probabilities are proportional to fragile region lengths. This is very different from the frequently used pseudouniform model where all fragile regions have the same probability to break. Estimations of rearrangement distances based on the pseudouniform model completely fail on simulations with the truly uniform model. On pairs of amniote genomes, we show that identifying coding genes with solid regions yields incoherent distance estimations, especially with the pseudouniform model, and to a lesser extent with the truly uniform model. This incoherence is solved when we coestimate the number of fragile regions with the rearrangement distance. The estimated number of fragile regions is surprisingly Small, suggesting that a minority of regions are recurrently used by rearrangements. Estimations for several pairs of genomes at different divergence times are in agreement with a slowly evolvable colocalization of active genomic regions in the cell.8514271439FAPESP [2013/25084-2]French Agence Nationale de la Recherche (ANR) [ANR-10-BINF-01-01]ICT FP7 european programme EVOEVOFundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP

    Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing

    Get PDF
    De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements. The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process

    An approach to improved microbial eukaryotic genome annotation

    Full text link
    Les nouvelles technologies de séquençage d’ADN ont accélérées la vitesse à laquelle les données génomiques sont générées. Par contre, une fois séquencées et assemblées, un défi continu est l'annotation structurelle précise de ces nouvelles séquences génomiques. Par le séquençage et l'assemblage du transcriptome (RNA-Seq) du même organisme, la précision de l'annotation génomique peut être améliorée, car les lectures de RNA-Seq et les transcrits assemblés fournissent des informations précises sur la structure des gènes. Plusieurs pipelines bio-informatiques actuelles incorporent des informations provenant du RNA-Seq ainsi que des données de similarité des séquences protéiques, pour automatiser l'annotation structurelle d’un génome de manière que la qualité se rapproche à celle de l'annotation par des experts. Les pipelines suivent généralement un flux de travail similaire. D'abord, les régions répétitives sont identifiées afin d'éviter de fausser les alignements de séquences et les prédictions de gènes. Deuxièmement, une base de données est construite contenant les données expérimentales telles que l’alignement des lectures de séquences, des transcrits et des protéines, ce qui informe les prédictions de gènes basées sur les Modèles de Markov Cachés généralisés. La dernière étape est de consolider les alignements de séquences et les prédictions de gènes dans un consensus de haute qualité. Or, les pipelines existants sont complexes et donc susceptibles aux biais et aux erreurs, ce qui peut empoisonner les prédictions de gènes et la construction de modèles consensus. Nous avons développé une approche améliorée pour l'annotation des génomes eucaryotes microbiens. Notre approche comprend deux aspects principaux. Le premier est axé sur la création d'un ensemble d'évidences extrinsèques le plus complet et diversifié afin de mieux informer les prédictions de gènes. Le deuxième porte sur la construction du consensus du modèle de gènes en utilisant les évidences extrinsèques et les prédictions par MMC, tel que l'influence de leurs biais potentiel soit réduite. La comparaison de notre nouvel outil avec trois pipelines populaires démontre des gains significatifs de sensibilité et de spécificité des modèles de gènes, de transcrits, d'exons et d'introns dans l’annotation structural de génomes d’eucaryotes microbiens.New sequencing technologies have considerably accelerated the rate at which genomic data is being generated. One ongoing challenge is the accurate structural annotation of those novel genomes once sequenced and assembled, in particular if the organism does not have close relatives with well-annotated genomes. Whole-transcriptome sequencing (RNA-Seq) and assembly—both of which share similarities to whole-genome sequencing and assembly, respectively—have been shown to dramatically increase the accuracy of gene annotation. Read coverage, inferred splice junctions and assembled transcripts can provide valuable information about gene structure. Several annotation pipelines have been developed to automate structural annotation by incorporating information from RNA-Seq, as well as protein sequence similarity data, with the goal of reaching the accuracy of an expert curator. Annotation pipelines follow a similar workflow. The first step is to identify repetitive regions to prevent misinformed sequence alignments and gene predictions. The next step is to construct a database of evidence from experimental data such as RNA-Seq mapping and assembly, and protein sequence alignments, which are used to inform the generalised Hidden Markov Models of gene prediction software. The final step is to consolidate sequence alignments and gene predictions into a high-confidence consensus set. Thus, automated pipelines are complex, and therefore susceptible to incomplete and erroneous use of information, which can poison gene predictions and consensus model building. Here, we present an improved approach to microbial eukaryotic genome annotation. Its conception was based on identifying and mitigating potential sources of error and bias that are present in available pipelines. Our approach has two main aspects. The first is to create a more complete and diverse set of extrinsic evidence to better inform gene predictions. The second is to use extrinsic evidence in tandem with predictions such that the influence of their respective biases in the consensus gene models is reduced. We benchmarked our new tool against three known pipelines, showing significant gains in gene, transcript, exon and intron sensitivity and specificity in the genome annotation of microbial eukaryotes

    Breaking Good: Accounting for Fragility of Genomic Regions in Rearrangement Distance Estimation

    Get PDF
    International audienceModels of evolution by genome rearrangements are prone to two types of flaws: One is to ignore the diversity of susceptibility tobreakage across genomic regions, and the other is to suppose that susceptibility values are given. Without necessarily supposing theirprecise localization,we call “solid” the regions that are improbably broken by rearrangements and “fragile” the regions outside solidones.We propose a model of evolution by inversions where breakage probabilities vary across fragile regions and over time. It containsas a particular case the uniform breakage model on the nucleotidic sequence,where breakage probabilities are proportional to fragileregion lengths. This is very different from the frequently used pseudo uniform model where all fragile regions have the same probabilityto break. Estimations of rearrangement distances based on the pseudo uniform model completely fail on simulations with thetruly uniform model. On pairs of amniote genomes, we show that identifying coding genes with solid regions yields incoherentdistance estimations, especially with the pseudo uniform model, and to a lesser extent with the truly uniform model. This incoherenceis solved when we coestimate the number of fragile regions with the rearrangement distance. The estimated number of fragileregions is surprisingly small, suggesting that a minority of regions are recurrently used by rearrangements. Estimations for several pairsof genomes at different divergence times are in agreement with a slowly evolvable colocalization of active genomic regions in the cell

    An algebraic model for inversion and deletion in bacterial genome rearrangement

    Get PDF
    Reversals are a major contributor to variation among bacterial genomes, with studies suggesting that reversals involving small numbers of regions are more likely than larger reversals. Deletions may arise in bacterial genomes through the same biological mechanism as reversals, and hence a model that incorporates both is desirable. However, while reversal distances between genomes have been well studied, there has yet to be a model which accounts for the combination of deletions and short reversals. To account for both of these operations, we introduce an algebraic model that utilises partial permutations. This leads to an algorithm for calculating the minimum distance to the most recent common ancestor of two bacterial genomes evolving by short reversals and deletions. The algebraic model makes the existing short reversal models more complete and realistic by including deletions, and also introduces new algebraic tools into evolutionary distance problems.Comment: 19 pages, 10 figure

    Epigenetics of complex traits and diseases

    Get PDF
    Thousands of genetic and epigenetic variants have been identified for many common diseases including cancer through genome-wide association studies (GWAS) and epigenome-wide association studies (EWAS). To advance the complex interpretation of both GWAS and EWAS results, I developed new software tools (FORGE2 and eFORGE) for the analysis and interpretation of GWAS and EWAS data, respectively. Both tools determine the cell type-specific regulatory component of a set of target regions (either GWAS-identified genetic variants or EWAS-identified differentially methylated positions). This is achieved by detecting enrichment of overlap with histone mark peaks or DNase I hypersensitive sites across hundreds of tissues, primary cell types, and cell lines from the ENCODE, Roadmap Epigenomics, and BLUEPRINT projects. Application of both tools to publicly available datasets identified novel disease-relevant cell types for many common diseases, a stem cell-like signature in cancer EWAS, and also demonstrated the ability to detect cell-composition effects for EWAS performed on heterogeneous tissues. To complement these bioinformatics efforts and validate selected variants predicted by FORGE2, eFORGE and additional analyses, I performed conformation capture using 4C-seq to fine-map the 3D context of the genomic regions involved, uncovering novel interactions for autoimmunity-associated variants and IKZF3

    Work ow-based systematic design of high throughput genome annotation

    No full text
    The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra-cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect the domestic chicken and cause the intestinal disease coccidiosis which is economy important for poultry industry. E. tenella is highly pathogenic and is often used as a model species for the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense Company), were the core softwares used to build the workflows. Workflows were made by integrating individual bioinformatics tools into a single platform. Each workflow was designed to provide a standalone service for a particular task. Three major workflows were developed based on the genomic resources currently available for E. tenella. These were of ESTs-based gene construction, HMM-based gene prediction and protein-based annotation. Finally, a combining workflow was built to sit above the individual ones to generate a set of automatic annotations using all of the available information. The overall system and its three major components were deployed as web servers that are fully tuneable and reusable for end users. WAGA does not require users to have programming skills or knowledge of the underlying algorithms or mechanisms of its low level components. E. tenella was the target genome here and all the results obtained were displayed by GBrowse. A sample of the results is selected for experimental validation. For evaluation purpose, WAGA was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent of human malaria, which has been extensively annotated. The results obtained were compared with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome project

    COMPUTER METHODS FOR PRE-MICRORNA SECONDARY STRUCTURE PREDICTION

    Get PDF
    This thesis presents a new algorithm to predict the pre-microRNA secondary structure. An accurate prediction of the pre-microRNA secondary structure is important in miRNA informatics. Based on a recently proposed model, nucleotide cyclic motifs (NCM), to predict RNA secondary structure, we propose and implement a Modified NCM (MNCM) model with a physics-based scoring strategy to tackle the problem of pre-microRNA folding. Our microRNAfold is implemented using a global optimal algorithm based on the bottom-up local optimal solutions. It has been shown that studying the functions of multiple genes and predicting the secondary structure of multiple related microRNA is more important and meaningful since many polygenic traits in animals and plants can be controlled by more than a single gene. We propose a parallel algorithm based on the master-slave architecture to predict the secondary structure from an input sequence. The experimental results show that our algorithm is able to produce the optimal secondary structure of polycistronic microRNAs. The trend of speedups of our parallel algorithm matches that of theoretical speedups. Conserved secondary structures are likely to be functional, and secondary structural characteristics that are shared between endogenous pre-miRNAs may contribute toward efficient biogenesis. So identifying conserved secondary structure is very meaningful and identifying conserved characteristics in RNA is a very important research field. After the characteristics are extracted from the secondary structures of RNAs, corresponding patterns or rules could be dug out and used. We propose to use the conserved microRNA characteristics in two aspects: to improve prediction through knowledge base, and to classify the real specific microRNAs from pseudo microRNAs. Through statistical analysis of the performance of classification, we verify that the conserved characteristics extracted from microRNAs’ secondary structures are precise enough. Gene suppression is a powerful tool for functional genomics and elimination of specific gene products. However, current gene suppression vectors can only be used to silence a single gene at a time. So we design an efficient poly-cistronic microRNA vector and the web-based tool allows users to design their own microRNA vectors online

    Bioinformatics

    Get PDF
    This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here
    • …
    corecore