Search CORE

484 research outputs found

Recommended from our members

Annotation of the Drosophila melanogaster euchromatic genome: a systematic review

BACKGROUND: The recent completion of the Drosophila melanogaster genomic sequence to high quality and the availability of a greatly expanded set of Drosophila cDNA sequences, aligning to 78% of the predicted euchromatic genes, afforded FlyBase the opportunity to significantly improve genomic annotations. We made the annotation process more rigorous by inspecting each gene visually, utilizing a comprehensive set of curation rules, requiring traceable evidence for each gene model, and comparing each predicted peptide to SWISS-PROT and TrEMBL sequences. RESULTS: Although the number of predicted protein-coding genes in Drosophila remains essentially unchanged, the revised annotation significantly improves gene models, resulting in structural changes to 85% of the transcripts and 45% of the predicted proteins. We annotated transposable elements and non-protein-coding RNAs as new features, and extended the annotation of untranslated (UTR) sequences and alternative transcripts to include more than 70% and 20% of genes, respectively. Finally, cDNA sequence provided evidence for dicistronic transcripts, neighboring genes with overlapping UTRs on the same DNA sequence strand, alternatively spliced genes that encode distinct, non-overlapping peptides, and numerous nested genes. CONCLUSIONS: Identification of so many unusual gene models not only suggests that some mechanisms for gene regulation are more prevalent than previously believed, but also underscores the complex challenges of eukaryotic gene prediction. At present, experimental data and human curation remain essential to generate high-quality genome annotations

Harvard University - DASH

Springer

Springer - Publisher Connector

PubMed Central

The Dawn of Open Access to Phylogenetic Data

Author: A Gelman
A Stoltzfus
AA Alsheikh-Ali
AJ Drummond
AJ Moore
Andrew F. Magee
BP Blackburne
Brian R. Moore
BT Drew
BT Drew
C Notredame
CJ Savage
D Rabosky
DA Morrison
DG Roche
E Evangelou
HA Piwowar
HA Piwowar
HA Piwowar
HA Piwowar
HA Piwowar
HA Piwowar
HA Piwowar
J Hughes
J Leebens-Mack
JD Thompson
JM Wicherts
JM Wicherts
KM Wong
L Rieseberg
M Plummer
MA Suchard
MAF Noor
MC Whitlock
MC Whitlock
MD Rausher
Michael R. May
MJ Donoghue
MJ Sanderson
MJ Sanderson
MK Uyenoyama
OG Pybus
RM O'brien
S Kullback
SJ Ceci
SP Brooks
T Vines
TH Vines
TJ Vision
William J. Murphy
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

The scientific enterprise depends critically on the preservation of and open access to published data. This basic tenet applies acutely to phylogenies (estimates of evolutionary relationships among species). Increasingly, phylogenies are estimated from increasingly large, genome-scale datasets using increasingly complex statistical methods that require increasing levels of expertise and computational investment. Moreover, the resulting phylogenetic data provide an explicit historical perspective that critically informs research in a vast and growing number of scientific disciplines. One such use is the study of changes in rates of lineage diversification (speciation - extinction) through time. As part of a meta-analysis in this area, we sought to collect phylogenetic data (comprising nucleotide sequence alignment and tree files) from 217 studies published in 46 journals over a 13-year period. We document our attempts to procure those data (from online archives and by direct request to corresponding authors), and report results of analyses (using Bayesian logistic regression) to assess the impact of various factors on the success of our efforts. Overall, complete phylogenetic data for ~60% of these studies are effectively lost to science. Our study indicates that phylogenetic data are more likely to be deposited in online archives and/or shared upon request when: (1) the publishing journal has a strong data-sharing policy; (2) the publishing journal has a higher impact factor, and; (3) the data are requested from faculty rather than students. Although the situation appears dire, our analyses suggest that it is far from hopeless: recent initiatives by the scientific community -- including policy changes by journals and funding agencies -- are improving the state of affairs

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

BioWarehouse: a bioinformatics database warehouse toolkit

Author: Gupta Priyanka
Karp Peter D
Lee Thomas J
Pouliot Yannick
Stringer-Calvert David WJ
Tenenbaum Jessica D
Wagner Valerie
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: This article addresses the problem of interoperation of heterogeneous bioinformatics databases. RESULTS: We introduce BioWarehouse, an open source toolkit for constructing bioinformatics database warehouses using the MySQL and Oracle relational database managers. BioWarehouse integrates its component databases into a common representational framework within a single database management system, thus enabling multi-database queries using the Structured Query Language (SQL) but also facilitating a variety of database integration tasks such as comparative analysis and data mining. BioWarehouse currently supports the integration of a pathway-centric set of databases including ENZYME, KEGG, and BioCyc, and in addition the UniProt, GenBank, NCBI Taxonomy, and CMR databases, and the Gene Ontology. Loader tools, written in the C and JAVA languages, parse and load these databases into a relational database schema. The loaders also apply a degree of semantic normalization to their respective source data, decreasing semantic heterogeneity. The schema supports the following bioinformatics datatypes: chemical compounds, biochemical reactions, metabolic pathways, proteins, genes, nucleic acid sequences, features on protein and nucleic-acid sequences, organisms, organism taxonomies, and controlled vocabularies. As an application example, we applied BioWarehouse to determine the fraction of biochemically characterized enzyme activities for which no sequences exist in the public sequence databases. The answer is that no sequence exists for 36% of enzyme activities for which EC numbers have been assigned. These gaps in sequence data significantly limit the accuracy of genome annotation and metabolic pathway prediction, and are a barrier for metabolic engineering. Complex queries of this type provide examples of the value of the data warehousing approach to bioinformatics research. CONCLUSION: BioWarehouse embodies significant progress on the database integration problem for bioinformatics

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Abundance of Short Proteins in the Mammalian Proteome

Author: Alistair R Forrest
Bill Pavan
Chikatoshi Kai
Ehsan Nourbakhsh
John Hancock
Judith Blake
Jun Kawai
Ken C Pang
Lisa Stubbs
Martin C Frith
Piero Carninci
Sean M Grimmond
Timothy L Bailey
Yoshihide Hayashizaki
Publication venue: Public Library of Science
Publication date: 01/01/2006
Field of study

Short proteins play key roles in cell signalling and other processes, but their abundance in the mammalian proteome is unknown. Current catalogues of mammalian proteins exhibit an artefactual discontinuity at a length of 100 aa, so that protein abundance peaks just above this length and falls off sharply below it. To clarify the abundance of short proteins, we identify proteins in the FANTOM collection of mouse cDNAs by analysing synonymous and non-synonymous substitutions with the computer program CRITICA. This analysis confirms that there is no real discontinuity at length 100. Roughly 10% of mouse proteins are shorter than 100 aa, although the majority of these are variants of proteins longer than 100 aa. We identify many novel short proteins, including a “dark matter” subset containing ones that lack detectable homology to other known proteins. Translation assays confirm that some of these novel proteins can be translated and localised to the secretory pathway

Crossref

Directory of Open Access Journals

PubMed Central

University of Melbourne Institutional Repository

University of Queensland eSpace

Literature classification for semi-automated updating of biological knowledgebases

Author: Brusic Vladimir
Kudahl Ulrich Johan
Olsen Lars Rønn
Winther Ole
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

BACKGROUND: As the output of biological assays increase in resolution and volume, the body of specialized biological data, such as functional annotations of gene and protein sequences, enables extraction of higher-level knowledge needed for practical application in bioinformatics. Whereas common types of biological data, such as sequence data, are extensively stored in biological databases, functional annotations, such as immunological epitopes, are found primarily in semi-structured formats or free text embedded in primary scientific literature. RESULTS: We defined and applied a machine learning approach for literature classification to support updating of TANTIGEN, a knowledgebase of tumor T-cell antigens. Abstracts from PubMed were downloaded and classified as either "relevant" or "irrelevant" for database update. Training and five-fold cross-validation of a k-NN classifier on 310 abstracts yielded classification accuracy of 0.95, thus showing significant value in support of data extraction from the literature. CONCLUSION: We here propose a conceptual framework for semi-automated extraction of epitope data embedded in scientific literature using principles from text mining and machine learning. The addition of such data will aid in the transition of biological databases to knowledgebases

Crossref

Springer - Publisher Connector

Copenhagen University Research Information System

PubMed Central

Online Research Database In Technology

Correlation-based methods for data cleaning, with application to biological databases

Author: KOH LIE YONG
Publication venue
Publication date: 25/09/2007
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Macromolecular Databases – A Background of Bioinformatics

Author: Donatella Verbanac
Dubravko Jelić
Tibor Toth
Publication venue: Faculty of Food Technology and Biotechnology, University of Zagreb
Publication date: 01/01/2003
Field of study

Directory of Open Access Journals

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia