Search CORE

635 research outputs found

Canonical, Stable, General Mapping using Context Schemes

Author: Haussler David
Novak Adam
Paten Benedict
Rosen Yohei
Publication venue: 'Oxford University Press (OUP)'
Publication date: 11/06/2015
Field of study

Motivation: Sequence mapping is the cornerstone of modern genomics. However, most existing sequence mapping algorithms are insufficiently general. Results: We introduce context schemes: a method that allows the unambiguous recognition of a reference base in a query sequence by testing the query for substrings from an algorithmically defined set. Context schemes only map when there is a unique best mapping, and define this criterion uniformly for all reference bases. Mappings under context schemes can also be made stable, so that extension of the query string (e.g. by increasing read length) will not alter the mapping of previously mapped positions. Context schemes are general in several senses. They natively support the detection of arbitrary complex, novel rearrangements relative to the reference. They can scale over orders of magnitude in query sequence length. Finally, they are trivially extensible to more complex reference structures, such as graphs, that incorporate additional variation. We demonstrate empirically the existence of high performance context schemes, and present efficient context scheme mapping algorithms. Availability and Implementation: The software test framework created for this work is available from https://registry.hub.docker.com/u/adamnovak/sequence-graphs/. Contact: [email protected] Supplementary Information: Six supplementary figures and one supplementary section are available with the online version of this article.Comment: Submission for Bioinformatic

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California

Recommended from our members

Model Completeness of an Algebra of Languages ; CU-CS-178-80

Author: Haussler David
Publication venue: CU Scholar
Publication date: 01/03/1980
Field of study

CU Scholar Institutional Repository

Detecting Coevolution in and among Protein Domains

Author: Andrey Rzhetsky
Chen-Hsiang Yeang
David Haussler
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Correlated changes of nucleic or amino acids have provided strong information about the structures and interactions of molecules. Despite the rich literature in coevolutionary sequence analysis, previous methods often have to trade off between generality, simplicity, phylogenetic information, and specific knowledge about interactions. Furthermore, despite the evidence of coevolution in selected protein families, a comprehensive screening of coevolution among all protein domains is still lacking. We propose an augmented continuous-time Markov process model for sequence coevolution. The model can handle different types of interactions, incorporate phylogenetic information and sequence substitution, has only one extra free parameter, and requires no knowledge about interaction rules. We employ this model to large-scale screenings on the entire protein domain database (Pfam). Strikingly, with 0.1 trillion tests executed, the majority of the inferred coevolving protein domains are functionally related, and the coevolving amino acid residues are spatially coupled. Moreover, many of the coevolving positions are located at functionally important sites of proteins/protein complexes, such as the subunit linkers of superoxide dismutase, the tRNA binding sites of ribosomes, the DNA binding region of RNA polymerase, and the active and ligand binding sites of various enzymes. The results suggest sequence coevolution manifests structural and functional constraints of proteins. The intricate relations between sequence coevolution and various selective constraints are worth pursuing at a deeper level

CiteSeerX

Crossref

Directory of Open Access Journals

PubMed Central

CGHub: Kick-starting the Worldwide Genome Web

Author: Diekhans Mark
Haussler David
Maltbie Dan
Wilks Christopher
Publication venue: 'Proceedings of the Asia-Pacific Advanced Network'
Publication date: 10/06/2013
Field of study

The University of California, Santa Cruz (UCSC) is under contract with the National Cancer Institute (NCI) to construct and operate the Cancer Genomics Hub (CGHub), a nation-scale library and user portal for cancer genomics data. This contract covers growth of the library to 5 Petabytes. The NCI programs that feed into the library currently produce about 20 terabytes of data each month. We discuss the receiver-driven file transfer mechanism Annai GeneTorrent (GT) for use with the library. Annai GT uses multiple TCP streams from multiple computers at the library site to parallelize genome downloads. We review our performance experience with the new transfer mechanism and also explain additions to the transfer protocol to support the security required in handling patient cancer genomics data

Proceedings of the Asia-Pacific Advanced Network

Recommended from our members

Very Special Languages and Representations of Recursively Enumerable Languages Via Computation Histories ; CU-CS-177-80

Author: Haussler David
Zeiger Paul
Publication venue: CU Scholar
Publication date: 01/04/1980
Field of study

A method of encoding the computation histories of a wide class of machines is introduced and used to derive several representation theorems for the class of recursively enumerable languages. In particular it is demonstrated that any recursively enumerable language K ⊂ Σ* can be represented as K = ΦΣ(R ∩ D1 ⋮ D2), where D1 and D2 are fixed semi-Dyck languages, 〈is the shuffle operation, R is a regular language depending on K and ΦΣ is a weak identity homomorphism. This result is the natural analog for the recursively enumerable languages of the Chomsky-Shutzenberger representation of the context-free languages

CU Scholar Institutional Repository

Elsevier - Publisher Connector