5 research outputs found

    A comparative study of sequence analysis tools in computational biology

    Get PDF
    A biomolecular object, such as a deoxyribonucleic acid (DNA), a ribonucleic acid (RNA) or a protein molecule, is made up of a long chain of subunits. A protein is represented as a sequence made from 20 different amino acids, each represented as a letter. There are a vast number of ways in which similar structural domains can be generated in proteins by different amino acid sequences. By contrast, the structure of DNA, made up of only four different nucleotide building blocks that occur in two pairs, is relatively simple, regular, and predictable. Biomolecular sequence alignment/string search is the most important issue and challenging task in many areas of science and information processing. It involves identifying one-to-one correspondences between subunits of different sequences. An efficient algorithm or tool is involved with many important factors, these include the following: Scoring systems, Alignment statistics, Database redundancy and sequence repetitiveness. Sequence motifs are derived from multiple alignments and can be used to examine individual sequences or an entire database for subtle patterns. With motifs, it is sometimes possible to detect distant relationships that may not be demonstrable based on comparisons of primary sequences alone. A more comprehensive solution to the efficient string search is approached by building a small, representative set of motifs and using this as a screening database with automatic masking of matching query subsequences. This technology is still under development but recent studies indicate that a representative set of only 1,000 - 3,000 sequences may suffice and such a database can be searched in seconds

    Sim4cc: a cross-species spliced alignment program

    Get PDF
    Advances in sequencing technologies have accelerated the sequencing of new genomes, far outpacing the generation of gene and protein resources needed to annotate them. Direct comparison and alignment of existing cDNA sequences from a related species is an effective and readily available means to determine genes in the new genomes. Current spliced alignment programs are inadequate for comparing sequences between different species, owing to their low sensitivity and splice junction accuracy. A new spliced alignment tool, sim4cc, overcomes problems in the earlier tools by incorporating three new features: universal spaced seeds, to increase sensitivity and allow comparisons between species at various evolutionary distances, and powerful splice signal models and evolutionarily-aware alignment techniques, to improve the accuracy of gene models. When tested on vertebrate comparisons at diverse evolutionary distances, sim4cc had significantly higher sensitivity compared to existing alignment programs, more than 10% higher than the closest competitor for some comparisons, while being comparable in speed to its predecessor, sim4. Sim4cc can be used in one-to-one or one-to-many comparisons of genomic and cDNA sequences, and can also be effectively incorporated into a high-throughput annotation engine, as demonstrated by the mapping of 64 000 Fagus grandifolia 454 ESTs and unigenes to the poplar genome

    "Smith-Waterman" paralelo en arquitectura de many-core para bĂșsquedas en bases de datos de secuencias

    Get PDF
    87 p.Trabajo fin de MĂĄster dirigido por Sergio GĂĄlvez Rojas, y co-tutores: Oswaldo Trelles Salazar y Gabriel Dorado PĂ©rez. En este trabajo se ha desarrollado un algoritmo denominado MC64-S3W (MultiCore 64 – Sequence Search Smith-Waterman) para realizar el alineamiento local de una secuencia problema contra una base de datos de secuencias de ĂĄcidos nucleicos de gran tamaño (entre 80 y 260 kilobases) en arquitectura hardware de muchos nĂșcleos. La posibilidad de realizar alineamientos de gran tamaño (obteniendo el alineamiento local Ăłptimo) bajo arquitectura de muchos nĂșcleos es, por tanto, uno de los elementos diferenciadores de este trabajo. En el trabajo se justifica el ahorro de tiempo que se consigue al paralelizar varios alineamientos simultĂĄneos y se realiza un estudio comparativo con otras implementaciones paralelas ampliamente referenciadas como es el caso del algoritmo CUDASW++. TambiĂ©n se incluye una comparativa con BLAST. El trabajo se completa con una revisiĂłn del estado del arte en la comparaciĂłn de secuencias de ĂĄcidos nucleicos y pĂ©ptidos, con objeto de obtener el grado de similitud entre ellas, tanto desde un punto de vista algorĂ­tmico como desde el punto de vista de estudios biolĂłgicos en los que se referencian alineamientos de secuencias de gran tamaño

    Modélisation et comparaison de la structure de gÚnes

    Get PDF
    La bio-informatique est un domaine de recherche multi-disciplinaire, Ă  la croisĂ©e de diffĂ©rents domaines : biologie, mĂ©decine, mathĂ©matiques, statistiques, chimie, physique et informatique. Elle a pour but de concevoir et d’appliquer des modĂšles et outils statistiques et computationnels visant l’avancement des connaissances en biologie et dans les sciences connexes. Dans ce contexte, la comprĂ©hension du fonctionnement et de l’évolution des gĂšnes fait l’objet de nombreuses Ă©tudes en bio-informatique. Ces Ă©tudes sont majoritairement fondĂ©es sur la comparaison des gĂšnes et en particulier sur l’alignement de sĂ©quences gĂ©nomiques. Cependant, dans leurs calculs d’alignement de sĂ©quences gĂ©nomiques, les mĂ©thodes existantes se basent uniquement sur la similaritĂ© des sĂ©quences et ne tiennent pas compte de la structure des gĂšnes. L’alignement prenant en compte la structure des sĂ©quences offre l’opportunitĂ© d’en amĂ©liorer la prĂ©cision ainsi que les rĂ©sultats des mĂ©thodes dĂ©veloppĂ©es Ă  partir de ces alignements. C’est dans cette hypothĂšse que s’inscrit l’objectif de cette thĂšse de doctorat : proposer des modĂšles tenant compte de la structure des gĂšnes lors de l’alignement des sĂ©quences de familles de gĂšnes. Ainsi, par cette thĂšse, nous avons contribuĂ© Ă  accroĂźtre les connaissances scientifiques en dĂ©veloppant des modĂšles d’alignement de sĂ©quences biologiques intĂ©grant des informations sur la structure de codage et d’épissage des sĂ©quences. Nous avons proposĂ© un algorithme et une nouvelle fonction du score pour l’alignement de sĂ©quences codantes d’ADN (CDS) en tenant compte de la longueur des dĂ©calages du cadre de traduction. Nous avons aussi proposĂ© un algorithme pour aligner des paires de sĂ©quences d’une famille de gĂšnes en considĂ©rant leurs structures d’épissage. Nous avons Ă©galement dĂ©veloppĂ© un algorithme pour assembler des alignements Ă©pissĂ©s par paire en alignements multiples de sĂ©quences. Enfin, nous avons dĂ©veloppĂ© un outil pour la visualisation d’alignements Ă©pissĂ©s multiples de famille de gĂšnes. Dans cette thĂšse, nous avons soulignĂ© l’importance et dĂ©montrĂ© l’utilitĂ© de tenir compte de la structure des sĂ©quences en entrĂ©e lors du calcul de leur alignement

    A tool for aligning very similar DNA sequences

    No full text
    corecore