4 Building contig scaffolds with 6 Benchmarks, continued 8 Additional reporting functions Assembly programs align nucleotide sequences to each other based on similarity between the sequences. Since each assembly algorithm relies on thresholds to determine which sequences are similar enough to align and which are not, every algorithm will inevitably wrongly assemble in some cases and wrongly fail to assemble in others. An algorithm that performs well on one set of data might fail dreadfully on another. Assembly algorithms are being challenged by increasingly diverse biological questions, including EST clustering, genotyping, and comparative genomics, and by problems inherent to certain datasets, such as repetitive DNA. We are re-engineering Phrap to improve its performance and utility by optimizing the core algorithms and developing a framework to store, manipulate, and view sequence data. XML-formatted hints and constraints will provide instructions to the core alignment program regarding how parts of the data, or the dataset as a whole, can be handled in individualized ways. We have re-engineered Phrap, allowing alignment to incorporate information regarding mate pairs--reads sequenced from the same template, and thereby possessing a known order and orientation with respect to each other. We are also utilizing mate pair information to create larger scaffold structures, with known gap sizes between contigs. 2 Mate pairs Characteristics of mate pair reads:- reads sequenced from the template DNA- known order and orientation (facing in, facing out, or facing the same direction) between reads- known range of separation between read 5 ' ends template mate pairs Mate pair distance and orientation information also allows scaffolds to be built
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.