635 research outputs found
Canonical, Stable, General Mapping using Context Schemes
Motivation: Sequence mapping is the cornerstone of modern genomics. However,
most existing sequence mapping algorithms are insufficiently general.
Results: We introduce context schemes: a method that allows the unambiguous
recognition of a reference base in a query sequence by testing the query for
substrings from an algorithmically defined set. Context schemes only map when
there is a unique best mapping, and define this criterion uniformly for all
reference bases. Mappings under context schemes can also be made stable, so
that extension of the query string (e.g. by increasing read length) will not
alter the mapping of previously mapped positions. Context schemes are general
in several senses. They natively support the detection of arbitrary complex,
novel rearrangements relative to the reference. They can scale over orders of
magnitude in query sequence length. Finally, they are trivially extensible to
more complex reference structures, such as graphs, that incorporate additional
variation. We demonstrate empirically the existence of high performance context
schemes, and present efficient context scheme mapping algorithms.
Availability and Implementation: The software test framework created for this
work is available from
https://registry.hub.docker.com/u/adamnovak/sequence-graphs/.
Contact: [email protected]
Supplementary Information: Six supplementary figures and one supplementary
section are available with the online version of this article.Comment: Submission for Bioinformatic
Detecting Coevolution in and among Protein Domains
Correlated changes of nucleic or amino acids have provided strong information about the structures and interactions of molecules. Despite the rich literature in coevolutionary sequence analysis, previous methods often have to trade off between generality, simplicity, phylogenetic information, and specific knowledge about interactions. Furthermore, despite the evidence of coevolution in selected protein families, a comprehensive screening of coevolution among all protein domains is still lacking. We propose an augmented continuous-time Markov process model for sequence coevolution. The model can handle different types of interactions, incorporate phylogenetic information and sequence substitution, has only one extra free parameter, and requires no knowledge about interaction rules. We employ this model to large-scale screenings on the entire protein domain database (Pfam). Strikingly, with 0.1 trillion tests executed, the majority of the inferred coevolving protein domains are functionally related, and the coevolving amino acid residues are spatially coupled. Moreover, many of the coevolving positions are located at functionally important sites of proteins/protein complexes, such as the subunit linkers of superoxide dismutase, the tRNA binding sites of ribosomes, the DNA binding region of RNA polymerase, and the active and ligand binding sites of various enzymes. The results suggest sequence coevolution manifests structural and functional constraints of proteins. The intricate relations between sequence coevolution and various selective constraints are worth pursuing at a deeper level
CGHub: Kick-starting the Worldwide Genome Web
The University of California, Santa Cruz (UCSC) is under contract with the National Cancer Institute (NCI) to construct and operate the Cancer Genomics Hub (CGHub), a nation-scale library and user portal for cancer genomics data. Â This contract covers growth of the library to 5 Petabytes. The NCI programs that feed into the library currently produce about 20 terabytes of data each month. We discuss the receiver-driven file transfer mechanism Annai GeneTorrent (GT) for use with the library. Annai GT uses multiple TCP streams from multiple computers at the library site to parallelize genome downloads. Â We review our performance experience with the new transfer mechanism and also explain additions to the transfer protocol to support the security required in handling patient cancer genomics data
Recommended from our members
Very Special Languages and Representations of Recursively Enumerable Languages Via Computation Histories ; CU-CS-177-80
A method of encoding the computation histories of a wide class of machines is introduced and used to derive several representation theorems for the class of recursively enumerable languages. In particular it is demonstrated that any recursively enumerable language K ⊂ Σ* can be represented as K = ΦΣ(R ∩ D1 ⋮ D2), where D1 and D2 are fixed semi-Dyck languages, 〈is the shuffle operation, R is a regular language depending on K and ΦΣ is a weak identity homomorphism. This result is the natural analog for the recursively enumerable languages of the Chomsky-Shutzenberger representation of the context-free languages
- …