Search CORE

161 research outputs found

Detailed estimation of bioinformatics prediction reliability through the Fragmented Prediction Performance Plots

Author: A Tramontano
B Rost
D Frishman
FC Bernstein
HM Berman
IH Witten
JA Cuff
JA Cuff
O Carugo
Oliviero Carugo
PY Chou
Uniprot Consortium
VA Simossis
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background An important and yet rather neglected question related to bioinformatics predictions is the estimation of the amount of data that is needed to allow reliable predictions. Bioinformatics predictions are usually validated through a series of figures of merit, like for example sensitivity and precision, and little attention is paid to the fact that their performance may depend on the amount of data used to make the predictions themselves. Results Here I describe a tool, named Fragmented Prediction Performance Plot (FPPP), which monitors the relationship between the prediction reliability and the amount of information underling the prediction themselves. Three examples of FPPPs are presented to illustrate their principal features. In one example, the reliability becomes independent, over a certain threshold, of the amount of data used to predict protein features and the intrinsic reliability of the predictor can be estimated. In the other two cases, on the contrary, the reliability strongly depends on the amount of data used to make the predictions and, thus, the intrinsic reliability of the two predictors cannot be determined. Only in the first example it is thus possible to fully quantify the prediction performance. Conclusion It is thus highly advisable to use FPPPs to determine the performance of any new bioinformatics prediction protocol, in order to fully quantify its prediction power and to allow comparisons between two or more predictors based on different types of data.</p

Crossref

Archivio Istituzionale della Ricerca - Università degli Studi di Pavia

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

interPopula: a Python API to access the HapMap Project dataset

Author: B Peng
B Rhead
D Rios
D Smedley
F Hsu
F Rousset
GA Thorisson
IH Consortium
J Akey
JD Hunter
JE Stajich
JL Kelley
LD Stein
PJA Cock
SA Tishkoff
TE Oliphant
Tiago Antao
V Curwen
VJ Carey
Publication venue: BioMed Central
Publication date: 01/12/2010
Field of study

Abstract Background The HapMap project is a publicly available catalogue of common genetic variants that occur in humans, currently including several million SNPs across 1115 individuals spanning 11 different populations. This important database does not provide any programmatic access to the dataset, furthermore no standard relational database interface is provided. Results interPopula is a Python API to access the HapMap dataset. interPopula provides integration facilities with both the Python ecology of software (e.g. Biopython and matplotlib) and other relevant human population datasets (e.g. Ensembl gene annotation and UCSC Known Genes). A set of guidelines and code examples to address possible inconsistencies across heterogeneous data sources is also provided. Conclusions interPopula is a straightforward and flexible Python API that facilitates the construction of scripts and applications that require access to the HapMap dataset.</p

LSTM Online Archive

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Purifying Selection in Deeply Conserved Human Enhancers Is More Consistent than in Coding Sequences

Author: A Eyre-Walker
A Kasprzyk
A Siepel
A Todorova
A Woolfe
A Woolfe
AB Singleton
AL Hughes
AR Boyko
Arnar Palsson
AS Ethayathulla
D Boffelli
DA Tagle
DG Torgerson
Dilrini R. De Silva
DJ Epstein
DL Halligan
E Berezikov
F Butter
G Bejerano
G Elgar
G Piganeau
G Piganeau
GD Stormo
GG Loots
GK McEwen
GR Abecasis
GR Abecasis
GR Ritchie
Greg Elgar
H Li
HJ Parker
I Dubchak
I Keller
IH Consortium
JA Drake
JJ Cai
JM Bras
K Tamura
LA Lettice
M Claussnitzer
M Kasowski
M Spivakov
MA Antezana
MA DePristo
MB Hammer
P Flicek
R McDaniell
R Sachidanandam
RD Dowell
RD Hernandez
Richard Nichols
RJ Guerreiro
S Asthana
S Benko
S Katzman
S Minovitsky
SB Hedges
W McLaren
W Stephan
XJ Mu
YY Teo
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

(c) 2014 De Silva et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Queen Mary Research Online

FigShare

Genome Assembly Has a Major Impact on Gene Content: A Comparison of Annotation in Two Bos Taurus Assemblies

Author: Alexander Souvorov
AV Zimin
DA Wheeler
DL Wheeler
E Pennisi
IH Consortium
J Wang
JC Venter
K Eilbeck
K Liolios
L Florea
L Florea
Liliana Florea
M Clamp
M Nowrousian
M Pertea
MC Schatz
Najib M. El-Sayed
R Li
R Li
RA Gibbs
SF Altschul
SF Altschul
SL Salzberg
Steven L. Salzberg
TD Wu
Theodore S. Kalbfleisch
WJ Kent
WR Pearson
Publication venue: Public Library of Science
Publication date: 22/06/2011
Field of study

Gene and SNP annotation are among the first and most important steps in analyzing a genome. As the number of sequenced genomes continues to grow, a key question is: how does the quality of the assembled sequence affect the annotations? We compared the gene and SNP annotations for two different Bos taurus genome assemblies built from the same data but with significant improvements in the later assembly. The same annotation software was used for annotating both sequences. While some annotation differences are expected even between high-quality assemblies such as these, we found that a staggering 40% of the genes (>9,500) varied significantly between assemblies, due in part to the availability of new gene evidence but primarily to genome mis-assembly events and local sequence variations. For instance, although the later assembly is generally superior, 660 protein coding genes in the earlier assembly are entirely missing from the later genome's annotation, and approximately 3,600 (15%) of the genes have complex structural differences between the two assemblies. In addition, 12–20% of the predicted proteins in both assemblies have relatively large sequence differences when compared to their RefSeq models, and 6–15% of bovine dbSNP records are unrecoverable in the two assemblies. Our findings highlight the consequences of genome assembly quality on gene and SNP annotation and argue for continued improvements in any draft genome sequence. We also found that tracking a gene between different assemblies of the same genome is surprisingly difficult, due to the numerous changes, both small and large, that occur in some genes. As a side benefit, our analyses helped us identify many specific loci for improvement in the Bos taurus genome assembly

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies

Author: AL Price
AL Price
B Servin
C Tian
C Tian
C Tian
CA Anderson
CJ Willer
Consortium IH
EE Schadt
Eric E Schadt
Eugene Chudin
I Pe'er
J Marchini
JB Veyrieras
JC Barrett
JD Storey
Joshua McElwee
JZ Li
K Hao
K Hao
KA Frazer
Ke Hao
P Scheet
PF Sullivan
S Doss
SR Browning
Y Guan
Y Li
YF Pei
Z Yu
Z Zhao
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Genome-Wide Mapping of Copy Number Variation in Humans: Comparative Analysis of High Resolution Array Platforms

Author: A Abyzov
A Abyzov
Alexander E. Urban
Alexej Abyzov
AS Hinrichs
AW Pang
B Schuster-Böckler
C Alkan
C Curtis
D Pinto
DA Oldridge
DF Conrad
DM Altshuler
DT Miller
GH Perry
Gil Ast
H Matsuzaki
H Park
I Ionita-Laza
I Jarick
IH Consortium
JM Kidd
JO Korbel
Mark Gerstein
ME Hurles
Michael Snyder
N Craddock
P Medvedev
P Stankiewicz
PJ Hastings
R Redon
Rajini R. Haraksingh
RE Mills
RM Durbin
SA McCarroll
T Tucker
Y Hasin
Publication venue: Public Library of Science
Publication date: 30/11/2011
Field of study

Accurate and efficient genome-wide detection of copy number variants (CNVs) is essential for understanding human genomic variation, genome-wide CNV association type studies, cytogenetics research and diagnostics, and independent validation of CNVs identified from sequencing based technologies. Numerous, array-based platforms for CNV detection exist utilizing array Comparative Genome Hybridization (aCGH), Single Nucleotide Polymorphism (SNP) genotyping or both. We have quantitatively assessed the abilities of twelve leading genome-wide CNV detection platforms to accurately detect Gold Standard sets of CNVs in the genome of HapMap CEU sample NA12878, and found significant differences in performance. The technologies analyzed were the NimbleGen 4.2 M, 2.1 M and 3×720 K Whole Genome and CNV focused arrays, the Agilent 1×1 M CGH and High Resolution and 2×400 K CNV and SNP+CGH arrays, the Illumina Human Omni1Quad array and the Affymetrix SNP 6.0 array. The Gold Standards used were a 1000 Genomes Project sequencing-based set of 3997 validated CNVs and an ultra high-resolution aCGH-based set of 756 validated CNVs. We found that sensitivity, total number, size range and breakpoint resolution of CNV calls were highest for CNV focused arrays. Our results are important for cost effective CNV detection and validation for both basic and clinical applications

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Bayesian estimation of genomic copy number with single nucleotide polymorphism genotyping arrays

Author: Alejandro Villagran
Beibei Guo
C Fernandez
Caleb Davis
Ching Lau
D Pinkel
G Toruner
GR Bignell
H Willenbrock
IH Consortium
J Huang
J Wang
JC Marioni
Jian Wang
K Wang
L Rabiner
L Winchester
M Hutter
M Tipping
Marina Vannucci
N Metropolis
OM Rueda
P Broet
P Green
PM Rancoita
R Pique-Regi
R Redon
Rudy Guerra
S Colella
S Knuutila
S Knuutila
S Richardson
SJ Diskin
Tsz-Kwong Man
W Hastings
WR Lai
X Zhao
Y Nannya
Publication venue: BioMed Central
Publication date: 01/12/2010
Field of study

Abstract Background The identification of copy number aberration in the human genome is an important area in cancer research. We develop a model for determining genomic copy numbers using high-density single nucleotide polymorphism genotyping microarrays. The method is based on a Bayesian spatial normal mixture model with an unknown number of components corresponding to true copy numbers. A reversible jump Markov chain Monte Carlo algorithm is used to implement the model and perform posterior inference. Results The performance of the algorithm is examined on both simulated and real cancer data, and it is compared with the popular CNAG algorithm for copy number detection. Conclusions We demonstrate that our Bayesian mixture model performs at least as well as the hidden Markov model based CNAG algorithm and in certain cases does better. One of the added advantages of our method is the flexibility of modeling normal cell contamination in tumor samples.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related

Author: A Budovsky
A Seluanov
AA Freitas
Alex A Freitas
AP Bradley
B Karlsson
BP Best
C Kenyon
Chimpanzee Sequencing and Analysis Consortium
D Kipling
D Szafron
DEL Promislow
EC Friedberg
EN Chautard
FV Rassool
IH Witten
João Pedro de Magalhães
JP de Magalhaes
JP de Magalhaes
JP de Magalhaes
K Ariyoshi
L Ferrarini
M Prelog
MA Harris
Mvd Ven
N Cristianini
N Hosaka
Olga Vasieva
P Hasty
P Hasty
P Mombaerts
R Arking
R Tacutu
RD Wood
RD Wood
S Beneke
S Burmaa
SE James
SL Rabinowe
T Hruz
TSK Prasad
YJ Ju
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background The ageing of the worldwide population means there is a growing need for research on the biology of ageing. DNA damage is likely a key contributor to the ageing process and elucidating the role of different DNA repair systems in ageing is of great interest. In this paper we propose a data mining approach, based on classification methods (decision trees and Naive Bayes), for analysing data about human DNA repair genes. The goal is to build classification models that allow us to discriminate between ageing-related and non-ageing-related DNA repair genes, in order to better understand their different properties. Results The main patterns discovered by the classification methods are as follows: (a) the number of protein-protein interactions was a predictor of DNA repair proteins being ageing-related; (b) the use of predictor attributes based on protein-protein interactions considerably increased predictive accuracy of attributes based on Gene Ontology (GO) annotations; (c) GO terms related to "response to stimulus" seem reasonably good predictors of ageing-relatedness for DNA repair genes; (d) interaction with the XRCC5 (Ku80) protein is a strong predictor of ageing-relatedness for DNA repair genes; and (e) DNA repair genes with a high expression in T lymphocytes are more likely to be ageing-related. Conclusions The above patterns are broadly integrated in an analysis discussing relations between Ku, the non-homologous end joining DNA repair pathway, ageing and lymphocyte development. These patterns and their analysis support non-homologous end joining double strand break repair as central to the ageing-relatedness of DNA repair genes. Our work also showcases the use of protein interaction partners to improve accuracy in data mining methods and our approach could be applied to other ageing-related pathways.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Kent Academic Repository

Inference of Relationships in Population Data Using Identity-by-Descent and Identity-by-State

Author: A Gusev
AB Olshen
AG Clark
BL Browning
BS Weir
C Cotterman
C Tian
CW Chiang
D Reich
David B. Allison
DT Bishop
ED Roberson
Elisha D. O. Roberson
Eric L. Stevens
G Coop
G McVean
GR Abecasis
Greg Heckenberg
HM Kang
IH Consortium
IT Jolliffe
J Chen
J Novembre
Jonathan Pevsner
Joseph D. Baugher
K Bryc
KA Frazer
KE Lohmueller
MA Abdulla
NA Rosenberg
NL Sobreira
O Lao
PC Sham
PE Lundmark
RM Durbin
RN Gutenkunst
S Purcell
S Xu
SB Gabriel
SR Browning
TA Manolio
Thomas J. Downey
TJ Pemberton
W Lee
X Gao
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

It is an assumption of large, population-based datasets that samples are annotated accurately whether they correspond to known relationships or unrelated individuals. These annotations are key for a broad range of genetics applications. While many methods are available to assess relatedness that involve estimates of identity-by-descent (IBD) and/or identity-by-state (IBS) allele-sharing proportions, we developed a novel approach that estimates IBD0, 1, and 2 based on observed IBS within windows. When combined with genome-wide IBS information, it provides an intuitive and practical graphical approach with the capacity to analyze datasets with thousands of samples without prior information about relatedness between individuals or haplotypes. We applied the method to a commonly used Human Variation Panel consisting of 400 nominally unrelated individuals. Surprisingly, we identified identical, parent-child, and full-sibling relationships and reconstructed pedigrees. In two instances non-sibling pairs of individuals in these pedigrees had unexpected IBD2 levels, as well as multiple regions of homozygosity, implying inbreeding. This combined method allowed us to distinguish related individuals from those having atypical heterozygosity rates and determine which individuals were outliers with respect to their designated population. Additionally, it becomes increasingly difficult to identify distant relatedness using genome-wide IBS methods alone. However, our IBD method further identified distant relatedness between individuals within populations, supported by the presence of megabase-scale regions lacking IBS0 across individual chromosomes. We benchmarked our approach against the hidden Markov model of a leading software package (PLINK), showing improved calling of distantly related individuals, and we validated it using a known pedigree from a clinical study. The application of this approach could improve genome-wide association, linkage, heterozygosity, and other population genomics studies that rely on SNP genotype data

CiteSeerX

Public Library of Science (PLOS)

Crossref

PubMed Central

Cutoff Scanning Matrix (CSM): structural classification and function prediction by protein inter-residue distance patterns

Author: A Stark
AG Murzin
B Delaunay
C Chothia
CA Orengo
Carlos H da Silveira
CH da Silveira
CH Ding
D del Castillo-Negrete
D Lee
Douglas EV Pires
HB Shen
HM Berman
IH Witten
J Cheng
JA Barker
JD Watson
JD Watson
K Goyal
L Eldén
L Eldén
L Holm
M Ashburner
M Babor
M Punta
MA Alvarez
Marcelo M Santoro
Marcos A dos Santos
MW Berry
P Jain
PD Dobson
R Kolodny
RA Laskowski
RA Laskowski
Raquel C de Melo-Minardi
RD Finn
S Shazman
SC Deerwester
SD Brown
SE Brenner
TU Consortium
V Soundararajan
Wagner Meira
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

BACKGROUND: The unforgiving pace of growth of available biological data has increased the demand for efficient and scalable paradigms, models and methodologies for automatic annotation. In this paper, we present a novel structure-based protein function prediction and structural classification method: Cutoff Scanning Matrix (CSM). CSM generates feature vectors that represent distance patterns between protein residues. These feature vectors are then used as evidence for classification. Singular value decomposition is used as a preprocessing step to reduce dimensionality and noise. The aspect of protein function considered in the present work is enzyme activity. A series of experiments was performed on datasets based on Enzyme Commission (EC) numbers and mechanistically different enzyme superfamilies as well as other datasets derived from SCOP release 1.75. RESULTS: CSM was able to achieve a precision of up to 99% after SVD preprocessing for a database derived from manually curated protein superfamilies and up to 95% for a dataset of the 950 most-populated EC numbers. Moreover, we conducted experiments to verify our ability to assign SCOP class, superfamily, family and fold to protein domains. An experiment using the whole set of domains found in last SCOP version yielded high levels of precision and recall (up to 95%). Finally, we compared our structural classification results with those in the literature to place this work into context. Our method was capable of significantly improving the recall of a previous study while preserving a compatible precision level. CONCLUSIONS: We showed that the patterns derived from CSMs could effectively be used to predict protein function and thus help with automatic function annotation. We also demonstrated that our method is effective in structural classification tasks. These facts reinforce the idea that the pattern of inter-residue distances is an important component of family structural signatures. Furthermore, singular value decomposition provided a consistent increase in precision and recall, which makes it an important preprocessing step when dealing with noisy data

Crossref

Springer - Publisher Connector

PubMed Central

University of Melbourne Institutional Repository