410 research outputs found
Weighted Minimum-Length Rearrangement Scenarios
We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n^4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements
Breaking Good: Accounting For Fragility Of Genomic Regions In Rearrangement Distance Estimation
Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP)Models of evolution by genome rearrangements are prone to two types of flaws: One is to ignore the diversity of susceptibility to breakage across genomic regions, and the other is to suppose that susceptibility values are given. Without necessarily supposing their precise localization, we call "solid" the regions that are improbably broken by rearrangements and "fragile" the regions outside solid ones. We propose a model of evolution by inversions where breakage probabilities vary across fragile regions and over time. It contains as a particular case the uniform breakage model on the nucleotidic sequence, where breakage probabilities are proportional to fragile region lengths. This is very different from the frequently used pseudouniform model where all fragile regions have the same probability to break. Estimations of rearrangement distances based on the pseudouniform model completely fail on simulations with the truly uniform model. On pairs of amniote genomes, we show that identifying coding genes with solid regions yields incoherent distance estimations, especially with the pseudouniform model, and to a lesser extent with the truly uniform model. This incoherence is solved when we coestimate the number of fragile regions with the rearrangement distance. The estimated number of fragile regions is surprisingly Small, suggesting that a minority of regions are recurrently used by rearrangements. Estimations for several pairs of genomes at different divergence times are in agreement with a slowly evolvable colocalization of active genomic regions in the cell.8514271439FAPESP [2013/25084-2]French Agence Nationale de la Recherche (ANR) [ANR-10-BINF-01-01]ICT FP7 european programme EVOEVOFundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP
Efficient Algorithms for Prokaryotic Whole Genome Assembly and Finishing
De-novo genome assembly from DNA fragments is primarily based on sequence overlap information. In addition, mate-pair reads or paired-end reads provide linking information for joining gaps and bridging repeat regions. Genome assemblers in general assemble long contiguous sequences (contigs) using both overlapping reads and linked reads until the assembly runs into an ambiguous repeat region. These contigs are further bridged into scaffolds using linked read information. However, errors can be made in both phases of assembly due to high error threshold of overlap acceptance and linking based on too few mate reads. Identical as well as similar repeat regions can often cause errors in overlap and mate-pair evidence. In addition, the problem of setting the correct threshold to minimize errors and optimize assembly of reads is not trivial and often requires a time-consuming trial and error process to obtain optimal results. The typical trial-and-error with multiple assembler, which can be computationally intensive, and is very inefficient, especially when users must learn how to use a wide variety of assemblers, many of which may be serial requiring long execution time and will not return usable or accurate results. Further, we show that the comparison of assembly results may not provide the users with a clear winner under all circumstances. Therefore, we propose a novel scaffolding tool, Correlative Algorithm for Repeat Placement (CARP), capable of joining short low error contigs using mate pair reads, computationally resolved repeat structures and synteny with one or more reference organisms. The CARP tool requires a set of repeat sequences such as insertion sequences (IS) that can be found computationally found without assembling the genome. Development of methods to identify such repeating regions directly from raw sequence reads or draft genomes led to the development of the ISQuest software package. ISQuest identifies bacterial ISs and their sequence elements—inverted and direct repeats—in raw read data or contigs using flexible search parameters. ISQuest is capable of finding ISs in hundreds of partially assembled genomes within hours; making it a valuable high-throughput tool for a global search of IS and repeat elements.
The CARP tool matches very low error contigs with strong overlap using the ambiguous partial repeat sequence at the ends of the contig annotated using the repeat sequences discovered using ISQuest. These matches are verified by synteny with genomes of one or more reference organisms. We show that the CARP tool can be used to verify low mate pair evidence regions, independently find new joins and significantly reduce the number of scaffolds. Finally, we are demonstrate a novel viewer that presents to the user the computationally derived joins along with the evidence used to make the joins. The viewer allows the user to independently assess their confidence in the joins made by the finishing tools and make an informed decision of whether to invest the resources necessary to confirm a particular portion of the assembly. Further, we allow users to manually record join evidence, re-order contigs, and track the assembly finishing process
An approach to improved microbial eukaryotic genome annotation
Les nouvelles technologies de séquençage d’ADN ont accélérées la vitesse à laquelle les
données génomiques sont générées. Par contre, une fois séquencées et assemblées, un défi
continu est l'annotation structurelle précise de ces nouvelles séquences génomiques. Par le
séquençage et l'assemblage du transcriptome (RNA-Seq) du même organisme, la précision de
l'annotation génomique peut être améliorée, car les lectures de RNA-Seq et les transcrits
assemblés fournissent des informations précises sur la structure des gènes. Plusieurs pipelines
bio-informatiques actuelles incorporent des informations provenant du RNA-Seq ainsi que des
données de similarité des séquences protéiques, pour automatiser l'annotation structurelle d’un
génome de manière que la qualité se rapproche à celle de l'annotation par des experts. Les
pipelines suivent généralement un flux de travail similaire. D'abord, les régions répétitives sont
identifiées afin d'éviter de fausser les alignements de séquences et les prédictions de gènes.
Deuxièmement, une base de données est construite contenant les données expérimentales telles
que l’alignement des lectures de séquences, des transcrits et des protéines, ce qui informe les
prédictions de gènes basées sur les Modèles de Markov Cachés généralisés. La dernière étape
est de consolider les alignements de séquences et les prédictions de gènes dans un consensus de
haute qualité. Or, les pipelines existants sont complexes et donc susceptibles aux biais et aux
erreurs, ce qui peut empoisonner les prédictions de gènes et la construction de modèles
consensus. Nous avons développé une approche améliorée pour l'annotation des génomes
eucaryotes microbiens. Notre approche comprend deux aspects principaux. Le premier est axé
sur la création d'un ensemble d'évidences extrinsèques le plus complet et diversifié afin de mieux
informer les prédictions de gènes. Le deuxième porte sur la construction du consensus du modèle
de gènes en utilisant les évidences extrinsèques et les prédictions par MMC, tel que l'influence
de leurs biais potentiel soit réduite. La comparaison de notre nouvel outil avec trois pipelines
populaires démontre des gains significatifs de sensibilité et de spécificité des modèles de gènes,
de transcrits, d'exons et d'introns dans l’annotation structural de génomes d’eucaryotes
microbiens.New sequencing technologies have considerably accelerated the rate at which genomic data is
being generated. One ongoing challenge is the accurate structural annotation of those novel
genomes once sequenced and assembled, in particular if the organism does not have close
relatives with well-annotated genomes. Whole-transcriptome sequencing (RNA-Seq) and
assembly—both of which share similarities to whole-genome sequencing and assembly,
respectively—have been shown to dramatically increase the accuracy of gene annotation. Read
coverage, inferred splice junctions and assembled transcripts can provide valuable information
about gene structure. Several annotation pipelines have been developed to automate structural
annotation by incorporating information from RNA-Seq, as well as protein sequence similarity
data, with the goal of reaching the accuracy of an expert curator. Annotation pipelines follow a
similar workflow. The first step is to identify repetitive regions to prevent misinformed sequence
alignments and gene predictions. The next step is to construct a database of evidence from
experimental data such as RNA-Seq mapping and assembly, and protein sequence alignments,
which are used to inform the generalised Hidden Markov Models of gene prediction software.
The final step is to consolidate sequence alignments and gene predictions into a high-confidence
consensus set. Thus, automated pipelines are complex, and therefore susceptible to incomplete
and erroneous use of information, which can poison gene predictions and consensus model
building. Here, we present an improved approach to microbial eukaryotic genome annotation.
Its conception was based on identifying and mitigating potential sources of error and bias that
are present in available pipelines. Our approach has two main aspects. The first is to create a
more complete and diverse set of extrinsic evidence to better inform gene predictions. The
second is to use extrinsic evidence in tandem with predictions such that the influence of their
respective biases in the consensus gene models is reduced. We benchmarked our new tool
against three known pipelines, showing significant gains in gene, transcript, exon and intron
sensitivity and specificity in the genome annotation of microbial eukaryotes
Breaking Good: Accounting for Fragility of Genomic Regions in Rearrangement Distance Estimation
International audienceModels of evolution by genome rearrangements are prone to two types of flaws: One is to ignore the diversity of susceptibility tobreakage across genomic regions, and the other is to suppose that susceptibility values are given. Without necessarily supposing theirprecise localization,we call “solid” the regions that are improbably broken by rearrangements and “fragile” the regions outside solidones.We propose a model of evolution by inversions where breakage probabilities vary across fragile regions and over time. It containsas a particular case the uniform breakage model on the nucleotidic sequence,where breakage probabilities are proportional to fragileregion lengths. This is very different from the frequently used pseudo uniform model where all fragile regions have the same probabilityto break. Estimations of rearrangement distances based on the pseudo uniform model completely fail on simulations with thetruly uniform model. On pairs of amniote genomes, we show that identifying coding genes with solid regions yields incoherentdistance estimations, especially with the pseudo uniform model, and to a lesser extent with the truly uniform model. This incoherenceis solved when we coestimate the number of fragile regions with the rearrangement distance. The estimated number of fragileregions is surprisingly small, suggesting that a minority of regions are recurrently used by rearrangements. Estimations for several pairsof genomes at different divergence times are in agreement with a slowly evolvable colocalization of active genomic regions in the cell
An algebraic model for inversion and deletion in bacterial genome rearrangement
Reversals are a major contributor to variation among bacterial genomes, with
studies suggesting that reversals involving small numbers of regions are more
likely than larger reversals. Deletions may arise in bacterial genomes through
the same biological mechanism as reversals, and hence a model that incorporates
both is desirable. However, while reversal distances between genomes have been
well studied, there has yet to be a model which accounts for the combination of
deletions and short reversals.
To account for both of these operations, we introduce an algebraic model that
utilises partial permutations. This leads to an algorithm for calculating the
minimum distance to the most recent common ancestor of two bacterial genomes
evolving by short reversals and deletions. The algebraic model makes the
existing short reversal models more complete and realistic by including
deletions, and also introduces new algebraic tools into evolutionary distance
problems.Comment: 19 pages, 10 figure
Epigenetics of complex traits and diseases
Thousands of genetic and epigenetic variants have been identified for many common diseases including cancer through genome-wide association studies (GWAS) and epigenome-wide association studies (EWAS). To advance the complex interpretation of both GWAS and EWAS results, I developed new software tools (FORGE2 and eFORGE) for the analysis and interpretation of GWAS and EWAS data, respectively. Both tools determine the cell type-specific regulatory component of a set of target regions (either GWAS-identified genetic variants or EWAS-identified differentially methylated positions). This is achieved by detecting enrichment of overlap with histone mark peaks or DNase I hypersensitive sites across hundreds of tissues, primary cell types, and cell lines from the ENCODE, Roadmap Epigenomics, and BLUEPRINT projects. Application of both tools to publicly available datasets identified novel disease-relevant cell types for many common diseases, a stem cell-like signature in cancer EWAS, and also demonstrated the ability to detect cell-composition effects for EWAS performed on heterogeneous tissues. To complement these bioinformatics efforts and validate selected variants predicted by FORGE2, eFORGE and additional analyses, I performed conformation capture using 4C-seq to fine-map the 3D context of the genomic regions involved, uncovering novel interactions for autoimmunity-associated variants and IKZF3
Work ow-based systematic design of high throughput genome annotation
The genus Eimeria belongs to the phylum Apicomplexa, which includes many obligate intra-cellular protozoan parasites of man and livestock. E. tenella is one of seven species that infect
the domestic chicken and cause the intestinal disease coccidiosis which is economy important
for poultry industry. E. tenella is highly pathogenic and is often used as a model species for
the Eimeria biology studies. In this PhD thesis, a comprehensive annotation system named
as \WAGA" (Workflow-based Automatically Genome Annotation) was built and applied to
the E. tenella genome. InforSense KDE, and its BioSense plug-in (products of the InforSense
Company), were the core softwares used to build the workflows.
Workflows were made by integrating individual bioinformatics tools into a single platform.
Each workflow was designed to provide a standalone service for a particular task. Three major
workflows were developed based on the genomic resources currently available for E. tenella.
These were of ESTs-based gene construction, HMM-based gene prediction and protein-based
annotation. Finally, a combining workflow was built to sit above the individual ones to generate
a set of automatic annotations using all of the available information. The overall system and
its three major components were deployed as web servers that are fully tuneable and reusable
for end users. WAGA does not require users to have programming skills or knowledge of the
underlying algorithms or mechanisms of its low level components.
E. tenella was the target genome here and all the results obtained were displayed by GBrowse.
A sample of the results is selected for experimental validation. For evaluation purpose, WAGA
was also applied to another Apicomplexa parasite, Plasmodium falciparum, the causative agent
of human malaria, which has been extensively annotated. The results obtained were compared
with gene predictions of PHAT, a gene finder designed for and used in the P. falciparum genome
project
COMPUTER METHODS FOR PRE-MICRORNA SECONDARY STRUCTURE PREDICTION
This thesis presents a new algorithm to predict the pre-microRNA secondary structure. An accurate prediction of the pre-microRNA secondary structure is important in miRNA informatics. Based on a recently proposed model, nucleotide cyclic motifs (NCM), to predict RNA secondary structure, we propose and implement a Modified NCM (MNCM) model with a physics-based scoring strategy to tackle the problem of pre-microRNA folding. Our microRNAfold is implemented using a global optimal algorithm based on the bottom-up local optimal solutions.
It has been shown that studying the functions of multiple genes and predicting the secondary structure of multiple related microRNA is more important and meaningful since many polygenic traits in animals and plants can be controlled by more than a single gene. We propose a parallel algorithm based on the master-slave architecture to predict the secondary structure from an input sequence. The experimental results show that our algorithm is able to produce the optimal secondary structure of polycistronic microRNAs. The trend of speedups of our parallel algorithm matches that of theoretical speedups.
Conserved secondary structures are likely to be functional, and secondary structural characteristics that are shared between endogenous pre-miRNAs may contribute toward efficient biogenesis. So identifying conserved secondary structure is very meaningful and identifying conserved characteristics in RNA is a very important research field. After the characteristics are extracted from the secondary structures of RNAs, corresponding patterns or rules could be dug out and used.
We propose to use the conserved microRNA characteristics in two aspects: to improve prediction through knowledge base, and to classify the real specific microRNAs from pseudo microRNAs. Through statistical analysis of the performance of classification, we verify that the conserved characteristics extracted from microRNAs’ secondary structures are precise enough.
Gene suppression is a powerful tool for functional genomics and elimination of specific gene products. However, current gene suppression vectors can only be used to silence a single gene at a time. So we design an efficient poly-cistronic microRNA vector and the web-based tool allows users to design their own microRNA vectors online
Bioinformatics
This book is divided into different research areas relevant in Bioinformatics such as biological networks, next generation sequencing, high performance computing, molecular modeling, structural bioinformatics, molecular modeling and intelligent data analysis. Each book section introduces the basic concepts and then explains its application to problems of great relevance, so both novice and expert readers can benefit from the information and research works presented here
- …