187 research outputs found

    Sequence Assembly

    Get PDF
    We describe an efficient method for assembling short reads into long sequences. In this method, a hashing technique is used to compute overlaps between short reads, allowing base mismatches in the overlaps. Then an overlap graph is constructed, with each vertex representing a read and each edge representing an overlap. The overlap graph is explored by graph algorithms to find unique paths of reads representing contigs. The consensus sequence of each contig is constructed by computing alignments of multiple reads without gaps. This strategy has been implemented as a short read assembly program called PCAP.Solexa. We also describe how to use PCAP. Solexa in assembly of short reads

    Bioinformatics Support of Genome Sequencing Projects

    Get PDF
    The genome of an organism is the book of life . It encodes the complete set of genetic instructions for the development of the organism. The structure of a genome is a linear sequence of nucleotides. Determination of the sequence of a genome lays the foundation for understanding biology at the molecular level. With the current biotechnology, it is a challenging task to determine the sequence of a genome. A sequencing machine can read the sequence of a piece of DNA for up to 1000 bp (base pairs). However, genomes are very huge. For example, the genome of the bacterium E. coli is about 4 Mb (million base pairs) in size, the genome of the nematode C. elegans is 100 Mb in size, and the human genome is 3 Gb in size. The inability to produce long sequences by sequencing machines requires that long sequences be produced from short sequence reads. A shotgun sequencing strategy is widely used to determine the sequence of a long segment of DNA. In this strategy, multiple copies of the DNA segment are randomly cut into small pieces. The sequence of each piece is read by an automated sequencing machine. The sequence of the large DNA segment is reconstructed by a computer program from short sequence reads. The sequence assembly problem is to assemble short reads into long sequences. What makes the sequence assembly problem non-trivial is that there is no information about how short sequence reads are ordered with respect to the DNA segment

    Horizontal transfer generates genetic variation in an asexual pathogen

    Get PDF
    There are major gaps in the understanding of how genetic variation is generated in the asexual pathogen Verticillium dahliae. On the one hand, V. dahliae is a haploid organism that reproduces clonally. On the other hand, single-nucleotide polymorphisms and chromosomal rearrangements were found between V. dahliae strains. Lineage-specific (LS) regions comprising about 5% of the genome are highly variable between V. dahliae strains. Nonetheless, it is unknown whether horizontal gene transfer plays a major role in generating genetic variation in V. dahliae. Here, we analyzed a previously sequenced V. dahliae population of nine strains from various geographical locations and hosts. We found highly homologous elements in LS regions of each strain; LS regions of V. dahliae strain JR2 are much richer in highly homologous elements than the core genome. In addition, we discovered, in LS regions of JR2, several structural forms of nonhomologous recombination, and two or three homologous sequence types of each form, with almost each sequence type present in an LS region of another strain. A large section of one of the forms is known to be horizontally transferred between V. dahliae strains. We unexpectedly found that 350 kilobases of dynamic LS regions were much more conserved than the core genome between V. dahliae and a closely related species (V. albo-atrum), suggesting that these LS regions were horizontally transferred recently. Our results support the view that genetic variation in LS regions is generated by horizontal transfer between strains, and by chromosomal reshuffling reported previously

    Association between Experiences and Representations: Memory, Dreaming, Dementia and Consciousness

    Get PDF
    The mechanisms underlying major aspects of the human brain remain a mystery. It is unknown how verbal episodic memory is formed and integrated with sensory episodic memory. There is no consensus on the function and nature of dreaming. Here we present a theory for governing neural activity in the human brain. The theory describes the mechanisms for building memory traces for entities and explains how verbal memory is integrated with sensory memory. We infer that a core function of dreaming is to move charged particles such as calcium ions from the hippocampus to association areas to primary areas. We link a high level of calcium ions concentrations to Alzheimer\u27s disease. We present a more precise definition of consciousness. Our results are a step forward in understanding the function and health of the human brain and provide the public with ways to keep a healthy brain

    Bio‐sequence comparison and applications

    Get PDF
    The structure of a genome is a linear sequence of nucleotides that encodes genes and regulatory elements. Genes are homologous if they are related by divergence from a common ancestor (Attwood 2000). Homologous genes perform the same or similar functions. The sequences of homologous genes in related organisms are usually similar. For example, the sequences of homologous genes in humans and mice are 85 percent similar on average (Makalowski et al. 1996). If a new genomic DNA sequence is very similar to the sequence of a gene whose function is known, it is very likely that the genomic DNA sequence contains a gene and its function is similar to the function of the known gene. If a new genomic DNA sequence is highly similar to a cDNA sequence, then the genomic DNA sequence contains a gene and the structure of the gene can be found by aligning the two sequences. Thus methods for comparing sequences are very useful for understanding the structures and functions of genes in a genome. This chapter focuses on methods for comparing two sequences, which often serve as a bias for multiple sequence comparison methods, a topic for the next chapter

    Dynamic use of multiple parameter sets in sequence alignment

    Get PDF
    The level of conservation between two homologous sequences often varies among sequence regions; functionally important domains are more conserved than the remaining regions. Thus, multiple parameter sets should be used in alignment of homologous sequences with a stringent parameter set for highly conserved regions and a moderate parameter set for weakly conserved regions. We describe an alignment algorithm to allow dynamic use of multiple parameter sets with different levels of stringency in computation of an optimal alignment of two sequences. The algorithm dynamically considers various candidate alignments, partitions each candidate alignment into sections, and determines the most appropriate set of parameter values for each section of the alignment. The algorithm and its local alignment version are implemented in a computer program named GAP4. The local alignment algorithm in GAP4, that in its predecessor GAP3, and an ordinary local alignment program SIM were evaluated on 257 716 pairs of homologous sequences from 100 protein families. On 168 475 of the 257 716 pairs (a rate of 65.4%), alignments from GAP4 were more statistically significant than alignments from GAP3 and SIM

    Methods for Comparing a DNA Sequence with a Protein Sequence

    Get PDF
    We describe two methods for constructing an optimal global alignment of, and an optimal local alignment between, a DNA sequence and a protein sequence. The alignment model of the methods addresses the problems of frameshifts and introns in the DNA sequence. The methods require computer memory proportional to the sequence lengths, so they can rigorously process very huge sequences. The simplified versions of the methods were implemented as computer programs named NAP and LAP. The experimental results demonstrate that the programs are sensitive and powerful tools for finding genes by DNA-protein sequence homology

    HPC: Hierarchical phylogeny construction

    Get PDF
    Rapid improvements in DNA sequencing technology have resulted in long genome sequences for a large number of similar isolates with a wide range of single nucleotide polymorphism (SNP) rates, where some isolates can have thousands of times lower SNP rates than others. Genome sequences of this kind are a challenge to existing methods for construction of phylogenetic trees. We address the issues by developing a hierarchical approach to phylogeny construction. In this method, the construction is performed at multiple levels, where at each level, groups of isolates with similar levels of similarity are identified and their phylogenetic trees are constructed. Time savings are achieved by using a sufficiently large number of columns from the input alignment, instead of all its columns. Our results show that the new approach is 20-60 times more efficient than existing programs and more accurate in situations where highly similar isolates have a wide range of SNP rates

    MAP2: multiple alignment of syntenic genomic sequences

    Get PDF
    We describe a multiple alignment program named MAP2 based on a generalized pairwise global alignment algorithm for handling long, different intergenic and intragenic regions in genomic sequences. The MAP2 program produces an ordered list of local multiple alignments of similar regions among sequences, where different regions between local alignments are indicated by reporting only similar regions. We propose two similarity measures for the evaluation of the performance of MAP2 and existing multiple alignment programs. Experimental results produced by MAP2 on four real sets of orthologous genomic sequences show that MAP2 rarely missed a block of transitively similar regions and that MAP2 never produced a block of regions that are not transitively similar. Experimental results by MAP2 on six simulated data sets show that MAP2 found the boundaries between similar and different regions precisely. This feature is useful for finding conserved functional elements in genomic sequences. The MAP2 program is freely available in source code form at for academic use

    A method for finding single-nucleotide polymorphisms with allele frequencies in sequences of deep coverage

    Get PDF
    BACKGROUND: The allele frequencies of single-nucleotide polymorphisms (SNPs) are needed to select an optimal subset of common SNPs for use in association studies. Sequence-based methods for finding SNPs with allele frequencies may need to handle thousands of sequences from the same genome location (sequences of deep coverage). RESULTS: We describe a computational method for finding common SNPs with allele frequencies in single-pass sequences of deep coverage. The method enhances a widely used program named PolyBayes in several aspects. We present results from our method and PolyBayes on eighteen data sets of human expressed sequence tags (ESTs) with deep coverage. The results indicate that our method used almost all single-pass sequences in computation of the allele frequencies of SNPs. CONCLUSION: The new method is able to handle single-pass sequences of deep coverage efficiently. Our work shows that it is possible to analyze sequences of deep coverage by using pairwise alignments of the sequences with the finished genome sequence, instead of multiple sequence alignments
    corecore