Search CORE

Columbia University Academic Commons

SICTIN: Rapid footprinting of massively parallel sequencing data

Author: A Barski
A Rada-Iglesias
A Valouev
C Dingwall
C Jiang
Claes Wadelius
DE Schones
DS Johnson
E Birney
G Robertson
H Bao
H Hou
H Li
Jan Komorowski
JM Lin
MY Tolstorukov
R Andersson
RM Kuhn
Robin Andersson
S Oberdoerffer
Stefan Enroth
TJP Hubbard
TS Mikkelsen
W Huang
Z Wang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

BACKGROUND: Massively parallel sequencing allows for genome-wide hypothesis-free investigation of for instance transcription factor binding sites or histone modifications. Although nucleotide resolution detailed information can easily be generated, biological insight often requires a more general view of patterns (footprints) over distinct genomic features such as transcription start sites, exons or repetitive regions. The construction of these footprints is however a time consuming task. METHODS: The presented software generates a binary representation of the signals enabling fast and scalable lookup. This representation allows for footprint generation in mere minutes on a desktop computer. Several different input formats are accepted, e.g. the SAM format, bed-files and the UCSC wiggle track. CONCLUSIONS: Hypothesis-free investigation of genome wide interactions allows for biological data mining at a scale never before seen. Until recently, the main focus of analysis of sequencing data has been targeted on signal patterns around transcriptional start sites which are in manageable numbers. Today, focus is shifting to a wider perspective and numerous genomic features are being studied. To this end, we provide a system allowing for fast querying in the order of hundreds of thousands of features

Publikationer från Uppsala Universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Integrating sequence and structural biology with DAS.

Author: A Andreeva
A Kryshtafovych
A Prlić
A Zemla
Andreas Kähäri
Andreas Prlić
B Coessens
D Hull
DL Wheeler
Eugene Kulesha
H Sugawara
I Cases
IN Shindyalov
J Moult
JM Fernández
JR Macías
L Holm
L Käll
L Stein
LD Stein
M Clamp
M Tagari
MD Wilkinson
NJ Mulder
P Jones
P Lackner
PI Olason
PN Seibel
R Bose
RD Dowell
RD Finn
RD Finn
Robert D Finn
RT Fielding
S Pillai
Thomas A Down
Tim JP Hubbard
TJP Hubbard
Publication venue: BMC Bioinformatics
Publication date: 12/09/2007
Field of study

BACKGROUND: The Distributed Annotation System (DAS) is a network protocol for exchanging biological data. It is frequently used to share annotations of genomes and protein sequence. RESULTS: Here we present several extensions to the current DAS 1.5 protocol. These provide new commands to share alignments, three dimensional molecular structure data, add the possibility for registration and discovery of DAS servers, and provide a convention how to provide different types of data plots. We present examples of web sites and applications that use the new extensions. We operate a public registry of DAS sources, which now includes entries for more than 250 distinct sources. CONCLUSION: Our DAS extensions are essential for the management of the growing number of services and exchange of diverse biological data sets. In addition the extensions allow new types of applications to be developed and scientific questions to be addressed. The registry of DAS sources is available at http://www.dasregistry.org.RIGHTS : This article is licensed under the BioMed Central licence at http://www.biomedcentral.com/about/license which is similar to the 'Creative Commons Attribution Licence'. In brief you may : copy, distribute, and display the work; make derivative works; or make commercial use of the work - under the following conditions: the original author must be given credit; for any reuse or distribution, it must be made clear to others what the license terms of this work are

Apollo (Cambridge)

King's Research Portal

Formalization of taxon-based constraints to detect inconsistencies in annotation and ontology development

Author: A Millard
C Mungall
Christopher J Mungall
CJ Mungall
E Camon
Emily C Dimmer
H Mi
I Vastrik
J Day-Richter
J Wielemaker
Jennifer I Deegan
JI Clark
L Evans
M Courtot
Reference Genome Group of the Gene Ontology Consortium
S Carbon
S Hunter
The Gene Ontology Consortium
The UniProt Consortium
TJP Hubbard
W Kuśnierczyk
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The Gene Ontology project supports categorization of gene products according to their location of action, the molecular functions that they carry out, and the processes that they are involved in. Although the ontologies are intentionally developed to be taxon neutral, and to cover all species, there are inherent taxon specificities in some branches. For example, the process 'lactation' is specific to mammals and the location 'mitochondrion' is specific to eukaryotes. The lack of an explicit formalization of these constraints can lead to errors and inconsistencies in automated and manual annotation. Results We have formalized the taxonomic constraints implicit in some GO classes, and specified these at various levels in the ontology. We have also developed an inference system that can be used to check for violations of these constraints in annotations. Using the constraints in conjunction with the inference system, we have detected and removed errors in annotations and improved the structure of the ontology. Conclusions Detection of inconsistencies in taxon-specificity enables gradual improvement of the ontologies, the annotations, and the formalized constraints. This is progressively improving the quality of our data. The full system is available for download, and new constraints or proposed changes to constraints can be submitted online at <url>https://sourceforge.net/tracker/?atid=605890&group_id=36855</url>.</p

Public Library of Science (PLOS)

Statistical Modeling of Transcription Factor Binding Affinities Predicts Regulatory Interactions

Author: B Ren
C Dieterich
C Tuerk
D Galas
D Johnson
DC King
F Gao
G Courtois
Helge G. Roider
HG Roider
M Fried
Martin Vingron
Michael Levitt
O Berg
P von Hippel
R Staden
S Coles
S Rahmann
SE Girardin
T Hisamatsu
Thomas Manke
TI Lee
TJP Hubbard
V Matys
W Feller
WE Johnson
Z Bar-Joseph
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Recent experimental and theoretical efforts have highlighted the fact that binding of transcription factors to DNA can be more accurately described by continuous measures of their binding affinities, rather than a discrete description in terms of binding sites. While the binding affinities can be predicted from a physical model, it is often desirable to know the distribution of binding affinities for specific sequence backgrounds. In this paper, we present a statistical approach to derive the exact distribution for sequence models with fixed GC content. We demonstrate that the affinity distribution of almost all known transcription factors can be effectively parametrized by a class of generalized extreme value distributions. Moreover, this parameterization also describes the affinity distribution for sequence backgrounds with variable GC content, such as human promoter sequences. Our approach is applicable to arbitrary sequences and all transcription factors with known binding preferences that can be described in terms of a motif matrix. The statistical treatment also provides a proper framework to directly compare transcription factors with very different affinity distributions. This is illustrated by our analysis of human promoters with known binding sites, for many of which we could identify the known regulators as those with the highest affinity. The combination of physical model and statistical normalization provides a quantitative measure which ranks transcription factors for a given sequence, and which can be compared directly with large-scale binding data. Its successful application to human promoter sequences serves as an encouraging example of how the method can be applied to other sequences

Public Library of Science (PLOS)

MPG.PuRe

Functional Annotation and Identification of Candidate Disease Genes by Computational Analysis of Normal Tissue Gene Expression Data

Author: C Lefevre
CJ Wolfe
F Pagani
Fabio Rosa
Ferdinando Di Cunto
J Quackenbush
J Sulko
J van Helden
K Wutz
Laura Miozzi
Lorenzo Silengo
M Ashburner
M Michaelides
MA van Driel
MB Eisen
N Udar
NA Faustino
O Steinlein
OL Griffith
Oliver Hofmann
PA Krakowiak
Paolo Provero
PO Brown
R Edgar
RB Roth
Rosario Michael Piro
SM Garvey
T Barrett
TJP Hubbard
Ugo Ala
W Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2008
Field of study

Background: High-throughput gene expression data can predict gene function through the ‘‘guilt by association’ ’ principle: coexpressed genes are likely to be functionally associated. Methodology/Principal Findings: We analyzed publicly available expression data on normal human tissues. The analysis is based on the integration of data obtained with two experimental platforms (microarrays and SAGE) and of various measures of dissimilarity between expression profiles. The building blocks of the procedure are the Ranked Coexpression Groups (RCG), small sets of tightly coexpressed genes which are analyzed in terms of functional annotation. Functionally characterized RCGs are selected by means of the majority rule and used to predict new functional annotations. Functionally characterized RCGs are enriched in groups of genes associated to similar phenotypes. We exploit this fact to find new candidate disease genes for many OMIM phenotypes of unknown molecular origin. Conclusions/Significance: We predict new functional annotations for many human genes, showing that the integration of different data sets and coexpression measures significantly improves the scope of the results. Combining gene expression data, functional annotation and known phenotype-gene associations we provide candidate genes for several geneti

CiteSeerX

Institutional Research Information System University of Turin

Genome Expression Pathway Analysis Tool – Analysis and visualization of microarray gene expression data under genomic, proteomic and metabolic context

Author: A Rosenwald
AA Alizadeh
AI Saeed
B Mlecnik
B Zhang
BM Bolstad
C von Mering
F Al-Shahrour
Gene Ontology Consortium
GJ Dennis
GK Smyth
J Rainer
JM Vaquerizas
Julia C Engelmann
Jörg Schultz
M Kanehisa
M Kapushesky
M Kotera
M Masseroli
M Pelizzola
Markus Weniger
O Troyanskaya
P Khatri
P Lichter
P Shannon
R Gentleman
R Shamir
S Bea
SW Doniger
TJP Hubbard
W Huber
YH Yang
Publication venue: BioMed Central
Publication date: 01/06/2007
Field of study

Abstract Background Regulation of gene expression is relevant to many areas of biology and medicine, in the study of treatments, diseases, and developmental stages. Microarrays can be used to measure the expression level of thousands of mRNAs at the same time, allowing insight into or comparison of different cellular conditions. The data derived out of microarray experiments is highly dimensional and often noisy, and interpretation of the results can get intricate. Although programs for the statistical analysis of microarray data exist, most of them lack an integration of analysis results and biological interpretation. Results We have developed GEPAT, Genome Expression Pathway Analysis Tool, offering an analysis of gene expression data under genomic, proteomic and metabolic context. We provide an integration of statistical methods for data import and data analysis together with a biological interpretation for subsets of probes or single probes on the chip. GEPAT imports various types of oligonucleotide and cDNA array data formats. Different normalization methods can be applied to the data, afterwards data annotation is performed. After import, GEPAT offers various statistical data analysis methods, as hierarchical, k-means and PCA clustering, a linear model based t-test or chromosomal profile comparison. The results of the analysis can be interpreted by enrichment of biological terms, pathway analysis or interaction networks. Different biological databases are included, to give various information for each probe on the chip. GEPAT offers no linear work flow, but allows the usage of any subset of probes and samples as a start for a new data analysis. GEPAT relies on established data analysis packages, offers a modular approach for an easy extension, and can be run on a computer grid to allow a large number of users. It is freely available under the LGPL open source license for academic and commercial users at <url>http://gepat.sourceforge.net</url>. Conclusion GEPAT is a modular, scalable and professional-grade software integrating analysis and interpretation of microarray gene expression data. An installation available for academic users can be found at <url>http://gepat.bioapps.biozentrum.uni-wuerzburg.de</url>.</p

University of Regensburg Publication Server

Multiple organism algorithm for finding ultraconserved elements

Author: A Sandelin
A Siepel
A Woolfe
AL Delcher
AL Delcher
B Ma
CF Cheung
D Gusfield
D Lawson
EA Glazov
EH Margulies
G Bejerano
Greg Madey
HW Mewes
JC Venter
JZ Ni
LD Stein
M Brudno
MI Abouelhoda
N Bray
Neil F Lobo
P Ferragina
RA Holt
S Kurtz
S Kurtz
S Schwartz
Scott Christley
SF Altschul
T Tran
TJP Hubbard
U Manber
WJ Kent
WJ Kent
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Ultraconserved elements are nucleotide or protein sequences with 100% identity (no mismatches, insertions, or deletions) in the same organism or between two or more organisms. Studies indicate that these conserved regions are associated with micro RNAs, mRNA processing, development and transcription regulation. The identification and characterization of these elements among genomes is necessary for the further understanding of their functionality. Results We describe an algorithm and provide freely available software which can find all of the ultraconserved sequences between genomes of multiple organisms. Our algorithm takes a combinatorial approach that finds all sequences without requiring the genomes to be aligned. The algorithm is significantly faster than BLAST and is designed to handle very large genomes efficiently. We ran our algorithm on several large comparative analyses to evaluate its effectiveness; one compared 17 vertebrate genomes where we find 123 ultraconserved elements longer than 40 bps shared by all of the organisms, and another compared the human body louse, <it>Pediculus humanus humanus</it>, against itself and select insects to find thousands of non-coding, potentially functional sequences. Conclusion Whole genome comparative analysis for multiple organisms is both feasible and desirable in our search for biological knowledge. We argue that bioinformatic programs should be forward thinking by assuming analysis on multiple (and possibly large) genomes in the design and implementation of algorithms. Our algorithm shows how a compromise design with a trade-off of disk space versus memory space allows for efficient computation while only requiring modest computer resources, and at the same time providing benefits not available with other software.</p

Gitools: Analysis and Visualisation of Genomic Data Using Interactive Heat-Maps

Author: A Floratos
A Mascarell-Creus
A Sturn
A Subramanian
AI Saeed
B Usadel
BR Zeeberg
BR Zeeberg
Christian Perez-Llamas
D Smedley
DW Huang
DW Huang
G Gundem
I Ferreiro
I Medina
J Chen
J Hou
JN Weinstein
M Ashburner
M Hall
M Kanehisa
M Kapushesky
M Reich
MA Sartor
MA Sartor
MC Whitlock
N Lopez-Bigas
N Lopez-Bigas
Nuria Lopez-Bigas
P Pavlidis
R Shamir
S Holm
Stein Aerts
TJP Hubbard
V Rodilla
Y Benjamini
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Intuitive visualization of data and results is very important in genomics, especially when many conditions are to be analyzed and compared. Heat-maps have proven very useful for the representation of biological data. Here we present Gitools (http://www.gitools.org), an open-source tool to perform analyses and visualize data and results as interactive heat-maps. Gitools contains data import systems from several sources (i.e. IntOGen, Biomart, KEGG, Gene Ontology), which facilitate the integration of novel data with previous knowledge

CiteSeerX