Search CORE

87 research outputs found

Compression of next-generation sequencing reads aided by highly efficient de novo assembly

Author: Jones Daniel C.
Katze Michael G.
Peng Xinxia
Ruzzo Walter L.
Publication venue
Publication date: 01/01/2012
Field of study

We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information, and sequences, effectively collapsing very large datasets to less than 15% of their original size with no loss of information. Availability: Quip is freely available under the BSD license from http://cs.washington.edu/homes/dcjones/quip

arXiv.org e-Print Archive

CiteSeerX

Transcripts with in silico predicted RNA structure are enriched everywhere in the mouse brain

Author: Jan Gorodkin
Michael J Hawrylycz
Stefan E Seemann
Susan M Sunkin
Walter L Ruzzo
Publication venue: Springer Nature
Publication date: 01/01/2012
Field of study

BACKGROUND: Post-transcriptional control of gene expression is mostly conducted by specific elements in untranslated regions (UTRs) of mRNAs, in collaboration with specific binding proteins and RNAs. In several well characterized cases, these RNA elements are known to form stable secondary structures. RNA secondary structures also may have major functional implications for long noncoding RNAs (lncRNAs). Recent transcriptional data has indicated the importance of lncRNAs in brain development and function. However, no methodical efforts to investigate this have been undertaken. Here, we aim to systematically analyze the potential for RNA structure in brain-expressed transcripts. RESULTS: By comprehensive spatial expression analysis of the adult mouse in situ hybridization data of the Allen Mouse Brain Atlas, we show that transcripts (coding as well as non-coding) associated with in silico predicted structured probes are highly and significantly enriched in almost all analyzed brain regions. Functional implications of these RNA structures and their role in the brain are discussed in detail along with specific examples. We observe that mRNAs with a structure prediction in their UTRs are enriched for binding, transport and localization gene ontology categories. In addition, after manual examination we observe agreement between RNA binding protein interaction sites near the 3’ UTR structures and correlated expression patterns. CONCLUSIONS: Our results show a potential use for RNA structures in expressed coding as well as noncoding transcripts in the adult mouse brain, and describe the role of structured RNAs in the context of intracellular signaling pathways and regulatory networks. Based on this data we hypothesize that RNA structure is widely involved in transcriptional and translational regulatory mechanisms in the brain and ultimately plays a role in brain function

Springer - Publisher Connector

PubMed Central

A Marfan syndrome gene expression phenotype in cultured skin fibroblasts

Author: Emond Mary
Francke Uta
Jaeger Jochen C
Milewicz Dianna M
Morale Cecile Z
Mulvihill Eileen R
Ruzzo Walter L
Schwartz Stephen M
Yao Zizhen
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Marfan syndrome (MFS) is a heritable connective tissue disorder caused by mutations in the fibrillin-1 gene. This syndrome constitutes a significant identifiable subtype of aortic aneurysmal disease, accounting for over 5% of ascending and thoracic aortic aneurysms. Results We used spotted membrane DNA macroarrays to identify genes whose altered expression levels may contribute to the phenotype of the disease. Our analysis of 4132 genes identified a subset with significant expression differences between skin fibroblast cultures from unaffected controls versus cultures from affected individuals with known fibrillin-1 mutations. Subsequently, 10 genes were chosen for validation by quantitative RT-PCR. Conclusion Differential expression of many of the validated genes was associated with MFS samples when an additional group of unaffected and MFS affected subjects were analyzed (p-value < 3 × 10-6 under the null hypothesis that expression levels in cultured fibroblasts are unaffected by MFS status). An unexpected observation was the range of individual gene expression. In unaffected control subjects, expression ranges exceeding 10 fold were seen in many of the genes selected for qRT-PCR validation. The variation in expression in the MFS affected subjects was even greater.</p

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The identification and functional annotation of RNA structures conserved in vertebrates

Author: Bang-Berthelsen Claus H
Christensen-Dalsgaard Mikkel
Garde Christian
Gorodkin Jan
Hansen Claus
Mirza Aashiq Hussain
Nielsen Henrik
Pociot Flemming
Ruzzo Walter L.
Seemann Ernst Stefan
Tommerup Niels
Torarinsson Elfar
Workman Christopher T
Yao Zizhen
Publication venue: 'Cold Spring Harbor Laboratory'
Publication date: 01/01/2017
Field of study

Copenhagen University Research Information System

Online Research Database In Technology

A Regression-based K nearest neighbor algorithm for gene function prediction from heterogeneous data

Author: A Enright
A Gavin
A Grigoriev
A Hoerl
AJ Dobson
EG WS Cleveland
G GH
GRG Lanckriet
H Ge
M Deng
M Eisen
M Fellenberg
MPS Brown
O Troyanskaya
P Liang
P Pavlidis
P Pavlidis
R Overbeek
R Tibshirani
Walter L Ruzzo
WS Noble
Y Zheng
Zizhen Yao
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: As a variety of functional genomic and proteomic techniques become available, there is an increasing need for functional analysis methodologies that integrate heterogeneous data sources. METHODS: In this paper, we address this issue by proposing a general framework for gene function prediction based on the k-nearest-neighbor (KNN) algorithm. The choice of KNN is motivated by its simplicity, flexibility to incorporate different data types and adaptability to irregular feature spaces. A weakness of traditional KNN methods, especially when handling heterogeneous data, is that performance is subject to the often ad hoc choice of similarity metric. To address this weakness, we apply regression methods to infer a similarity metric as a weighted combination of a set of base similarity measures, which helps to locate the neighbors that are most likely to be in the same class as the target gene. We also suggest a novel voting scheme to generate confidence scores that estimate the accuracy of predictions. The method gracefully extends to multi-way classification problems. RESULTS: We apply this technique to gene function prediction according to three well-known Escherichia coli classification schemes suggested by biologists, using information derived from microarray and genome sequencing data. We demonstrate that our algorithm dramatically outperforms the naive KNN methods and is competitive with support vector machine (SVM) algorithms for integrating heterogenous data. We also show that by combining different data sources, prediction accuracy can improve significantly. CONCLUSION: Our extension of KNN with automatic feature weighting, multi-class prediction, and probabilistic inference, enhance prediction accuracy significantly while remaining efficient, intuitive and flexible. This general framework can also be applied to similar classification problems involving heterogeneous datasets

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

How accurately is ncRNA aligned within whole-genome multiple alignments?

Author: A Prakash
A Prakash
A Siepel
Adrienne X Wang
DA Pollard
DA Pollard
E Rivas
E Torarinsson
EH Margulies
G Bourque
J Pei
JD Thompson
JD Thompson
JD Thompson
L Wang
M Blanchette
M Brudno
M Cline
M Errami
Martin Tompa
MS Rosenberg
S Batzoglou
S Griffiths-Jones
S Griffiths-Jones
S Karlin
S Kumar
S Schwartz
S Washietl
SR Eddy
SR Eddy
T Lassmann
W Miller
Walter L Ruzzo
WJ Kent
WJ Kent
WJ Kent
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Multiple alignment of homologous DNA sequences is of great interest to biologists since it provides a window into evolutionary processes. At present, the accuracy of whole-genome multiple alignments, particularly in noncoding regions, has not been thoroughly evaluated. Results We evaluate the alignment accuracy of certain noncoding regions using noncoding RNA alignments from Rfam as a reference. We inspect the MULTIZ 17-vertebrate alignment from the UCSC Genome Browser for all the human sequences in the Rfam seed alignments. In particular, we find 638 instances of chimeric and partial alignments to human noncoding RNA elements, of which at least 225 can be improved by straightforward means. As a byproduct of our procedure, we predict many novel instances of known ncRNA families that are suggested by the alignment. Conclusion MULTIZ does a fairly accurate job of aligning these genomes in these difficult regions. However, our experiments indicate that better alignments exist in some regions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central