9 research outputs found
Back-translation for discovering distant protein homologies in the presence of frameshift mutations
Background: Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins â common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. \ud
\ud
Results: We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/.\ud
\ud
Conclusions: Our approach allows to uncover evolutionary information that is not captured by traditional\ud
alignment methods, which is confirmed by biologically significant example
Designing Efficient Spaced Seeds for SOLiD Read Mapping
The advent of high-throughput sequencing technologies constituted
a major advance in genomic studies, offering new prospects in a
wide range of applications.We propose a rigorous and flexible algorithmic
solution to mapping SOLiD color-space reads to a reference genome. The
solution relies on an advanced method of seed design that uses a faithful
probabilistic model of read matches and, on the other hand, a novel
seeding principle especially adapted to read mapping. Our method can
handle both lossy and lossless frameworks and is able to distinguish, at
the level of seed design, between SNPs and reading errors. We illustrate
our approach by several seed designs and demonstrate their efficiency
De nouvelles méthodes pour l'alignement des séquences biologiques
Biological sequence alignment is a fundamental technique in bioinformatics, and consists of iden- tifying series of similar (conserved) characters that appear in the same order in both sequences, and eventually deducing a set of modifications (substitutions, insertions and deletions) involved in the transformation of one sequence into the other. This technique allows one to infer, based on sequence similarity, if two or more biological sequences are potentially homologous, i.e. if they share a common ancestor, thus enabling the understanding of sequence evolution. This thesis addresses sequence comparison problems in two different contexts: homology detection and high throughput DNA sequencing. The goal of this work is to develop sensitive alignment methods that provide solutions to the following two problems: i) the detection of hidden protein homologies by protein sequence comparison, when the source of the divergence are frameshift mutations, and ii) mapping short SOLiD reads (sequences of overlapping di- nucleotides encoded as colors) to a reference genome. In both cases, the same general idea is applied: to implicitly compare DNA sequences for detecting changes occurring at this level, while manipulating, in practice, other representations (protein sequences, sequences of di-nucleotide codes) that provide additional information and thus help to improve the similarity search. The aim is to design and implement exact and heuristic alignment methods, along with scoring schemes, adapted to these scenarios.L'alignement de sĂ©quences biologiques est une technique fondamentale en bioinformatique, et consiste Ă identifier des sĂ©ries de caractĂšres similaires (conservĂ©s) qui apparaissent dans le mĂȘme ordre dans les deux sĂ©quences, et Ă infĂ©rer un ensemble de modifications (substitutions, insertions et suppressions) impliquĂ©es dans la transformation d'une sĂ©quence en l'autre. Cette technique permet de dĂ©duire, sur la base de la similaritĂ© de sĂ©quence, si deux ou plusieurs sĂ©quences biologiques sont potentiellement homologues, donc si elles partagent un ancĂȘtre commun, permettant ainsi de mieux comprendre l'Ă©volution des sĂ©quences. Cette thĂšse aborde les problĂšmes de comparaison de sĂ©quences dans deux cadres diffĂ©rents: la dĂ©tection d'homologies et le sĂ©quençage Ă haut dĂ©bit. L'objectif de ce travail est de dĂ©velopper des mĂ©thodes d'alignement qui peuvent apporter des solutions aux deux problĂšmes suivants: i) la dĂ©tection d'homologies cachĂ©es entre des protĂ©ines par comparaison de sĂ©quences protĂ©iques, lorsque la source de leur divergence sont les mutations qui changent le cadre de lecture, et ii) le mapping de reads SOLiD (sĂ©quences de di-nuclĂ©otides chevauchantes codĂ©s par des couleurs) sur un gĂ©nome de rĂ©fĂ©rence. Dans les deux cas, la mĂȘme idĂ©e gĂ©nĂ©rale est appliquĂ©e: comparer implicitement les sĂ©quences d'ADN pour la dĂ©tection de changements qui se produisent Ă ce niveau, en manipulant, en pratique, d'autres reprĂ©sentations (sĂ©quences de protĂ©ines, sĂ©quences de codes di-nuclĂ©otides) qui fournissent des informations supplĂ©mentaires et qui aident Ă amĂ©liorer la recherche de similaritĂ©s. Le but est de concevoir et d'appliquer des mĂ©thodes exactes et heuristiques d'alignement, ainsi que des systemes de scores, adaptĂ©s Ă ces scĂ©narios
Back-translation for discovering distant protein homologies in the presence of frameshift mutations
Abstract Background Frameshift mutations in protein-coding DNA sequences produce a drastic change in the resulting protein sequence, which prevents classic protein alignment methods from revealing the proteins' common origin. Moreover, when a large number of substitutions are additionally involved in the divergence, the homology detection becomes difficult even at the DNA level. Results We developed a novel method to infer distant homology relations of two proteins, that accounts for frameshift and point mutations that may have affected the coding sequences. We design a dynamic programming alignment algorithm over memory-efficient graph representations of the complete set of putative DNA sequences of each protein, with the goal of determining the two putative DNA sequences which have the best scoring alignment under a powerful scoring system designed to reflect the most probable evolutionary process. Our implementation is freely available at http://bioinfo.lifl.fr/path/. Conclusions Our approach allows to uncover evolutionary information that is not captured by traditional alignment methods, which is confirmed by biologically significant examples.</p
Designing efficient spaced seeds for SOLiD read mapping
The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications.We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency
Seed design framework for mapping SOLiD reads
The advent of high-throughput sequencing technologies constituted a major advance in genomic studies, offering new prospects in a wide range of applications. We propose a rigorous and flexible algorithmic solution to mapping SOLiD color-space reads to a reference genome. The solution relies on an advanced method of seed design that uses a faithful probabilistic model of read matches and, on the other hand, a novel seeding principle especially adapted to read mapping. Our method can handle both lossy and lossless frameworks and is able to distinguish, at the level of seed design, between SNPs and reading errors. We illustrate our approach by several seed designs and demonstrate their efficiency