Search CORE

4 research outputs found

An efficient algorithm to locate all locally optimal alignments between two sequences allowing for gaps

Author: Barton Geoffrey J.
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/1993
Field of study

This paper appeared in Computer Applications in the Biosciences, (1993), 9, An efficient algorithm is described to locate locally optimal alignments between two sequences allowing for insertions and deletions. The algorithm is based on that of Smith and Waterman (J. Mol. Biol., 147, 195–197, 1981) which returns the single best local alignment. However, the algorithm described here permits all non-intersecting locally optimal alignments to be determined in a single pass through the comparison matrix. The algorithm simplifies the location of repeats, multiple domains and shuffled motifs and is fast enough to be used on a conventional workstation to scan large sequence databanks. 1

CiteSeerX

Crossref

University of Dundee Online Publications

In silico identification of candidate MECP2 targets and quantitative analysis in rett syndrome

Author: Onat Onur Emre
Publication venue: Bilkent University
Publication date: 01/01/2006
Field of study

Cataloged from PDF version of article.Rett syndrome (RTT) is an X-linked neuro-developmental disorder seen exclusively girls in the childhood. It is one of the most common causes of mental retardation with an incidence rate of 1/10,000-1/15,000. Mutations in MECP2 gene was described as a common cause of RTT. MECP2 is a transcriptional repressor that regulates gene expression. It is not fully understood which MECP2 targets are affected in RTT and therefore contribute to disease pathogenesis. Researchers approached the problem in two directions: a) Global expression profile analysis and b) Candidate gene analysis. Global expression profile analysis revealed which a limited number of genes including those on the X-chromosome are de-regulated. Candidate gene analysis studies showed that loss of imprinting as exemplified by DLX5 could also contribute to disease pathogenesis. We hypothesize that Xchromosome inactivation (XCI) is an important physiological epigenetic mechanism that could be involved in Rett pathogenesis. We predicted a MECP2 binding motif by a distinctive bioinformatic approach. Using this algorithm we searched for the candidate MECP2 target genes on the X-chromosome and whole genome. The genes FHL1 and MPP1, whose interaction with MECP2 were heuristically displayed were predicted by our algorithm. We identified more than 100 genes which are on the Xchromosome. 10 genes from the list were selected according to their MECP2 binding homology score and X-inactivation status. In order to test this hypothesis we analyzed these genes with quantitative RT-PCR .We expect to identify the key genes that potentially contribute to RTT pathogenesis via disturbances in X-chromosome inactivation.Onat, Onur EmrePh.D

Bilkent University Institutional Repository

Efficient homology search for genomic sequence databases

Author: Cameron M
Publication venue: RMIT University
Publication date: 01/01/2006
Field of study

Genomic search tools can provide valuable insights into the chemical structure, evolutionary origin and biochemical function of genetic material. A homology search algorithm compares a protein or nucleotide query sequence to each entry in a large sequence database and reports alignments with highly similar sequences. The exponential growth of public data banks such as GenBank has necessitated the development of fast, heuristic approaches to homology search. The versatile and popular blast algorithm, developed by researchers at the US National Center for Biotechnology Information (NCBI), uses a four-stage heuristic approach to efficiently search large collections for analogous sequences while retaining a high degree of accuracy. Despite an abundance of alternative approaches to homology search, blast remains the only method to offer fast, sensitive search of large genomic collections on modern desktop hardware. As a result, the tool has found widespread use with millions of queries posed each day. A significant investment of computing resources is required to process this large volume of genomic searches and a cluster of over 200 workstations is employed by the NCBI to handle queries posed through the organisation's website. As the growth of sequence databases continues to outpace improvements in modern hardware, blast searches are becoming slower each year and novel, faster methods for sequence comparison are required. In this thesis we propose new techniques for fast yet accurate homology search that result in significantly faster blast searches. First, we describe improvements to the final, gapped alignment stages where the query and sequences from the collection are aligned to provide a fine-grain measure of similarity. We describe three new methods for aligning sequences that roughly halve the time required to perform this computationally expensive stage. Next, we investigate improvements to the first stage of search, where short regions of similarity between a pair of sequences are identified. We propose a novel deterministic finite automaton data structure that is significantly smaller than the codeword lookup table employed by ncbi-blast, resulting in improved cache performance and faster search times. We also discuss fast methods for nucleotide sequence comparison. We describe novel approaches for processing sequences that are compressed using the byte packed format already utilised by blast, where four nucleotide bases from a strand of DNA are stored in a single byte. Rather than decompress sequences to perform pairwise comparisons, our innovations permit sequences to be processed in their compressed form, four bases at a time. Our techniques roughly halve average query evaluation times for nucleotide searches with no effect on the sensitivity of blast. Finally, we present a new scheme for managing the high degree of redundancy that is prevalent in genomic collections. Near-duplicate entries in sequence data banks are highly detrimental to retrieval performance, however existing methods for managing redundancy are both slow, requiring almost ten hours to process the GenBank database, and crude, because they simply purge highly-similar sequences to reduce the level of internal redundancy. We describe a new approach for identifying near-duplicate entries that is roughly six times faster than the most successful existing approaches, and a novel approach to managing redundancy that reduces collection size and search times but still provides accurate and comprehensive search results. Our improvements to blast have been integrated into our own version of the tool. We find that our innovations more than halve average search times for nucleotide and protein searches, and have no signifcant effect on search accuracy. Given the enormous popularity of blast, this represents a very significant advance in computational methods to aid life science research

RMIT Research Repository