Search CORE

194 research outputs found

Automatic discovery of cross-family sequence features associated with protein function

Author: Brameier Markus
Haan Josien
Krings Andrea
MacCallum Robert M
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed. RESULTS: We show that a self-supervised data mining approach is able to find relationships between sequence features and functional annotations. No preconceived ideas about functional categories are required, and the training data is simply a set of protein sequences and their UniProt/Swiss-Prot annotations. The main technical aspect of the approach is the co-evolution of amino acid-based regular expressions and keyword-based logical expressions with genetic programming. Our experiments on a strictly non-redundant set of eukaryotic proteins reveal that the strongest and most easily detected sequence-to-function relationships are concerned with targeting to various cellular compartments, which is an area already well studied both experimentally and computationally. Of more interest are a number of broad functional roles which can also be correlated with sequence features. These include inhibition, biosynthesis, transcription and defence against bacteria. Despite substantial overlaps between these functions and their corresponding cellular compartments, we find clear differences in the sequence motifs used to predict some of these functions. For example, the presence of polyglutamine repeats appears to be linked more strongly to the "transcription" function than to the general "nuclear" function/location. CONCLUSION: We have developed a novel and useful approach for knowledge discovery in annotated sequence data. The technique is able to identify functionally important sequence features and does not require expert knowledge. By viewing protein function from a sequence perspective, the approach is also suitable for discovering unexpected links between biological processes, such as the recently discovered role of ubiquitination in transcription

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

A genome-wide survey for prion-regulated miRNAs associated with cholesterol homeostasis

Author: Ann-Christin Schmädicke
Dirk Motzkus
Hermann M Schätzl
Judith Montag
Markus Brameier
Sabine Gilch
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Springer - Publisher Connector

Genome-wide comparative analysis of microRNAs in three non-human primates

Author: A Nahvi
B Zhang
D Bartel
D Gusfield
E Berezikov
E Lai
I Hofacker
J Brown
J Hertel
J Nam
J Yue
L He
L Lim
M Brameier
M Legendre
M Saunders
M Weber
Markus Brameier
R Raaum
S Altschul
S Griffths-Jones
U Ohler
V Ambros
V Baev
X Wang
Y Altuvia
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Evolving DNA motifs to predict GeneChip probe performance

Author: AP Harrison
BJ Ross
DJ Montana
F Naef
GJ Upton
HG Beyer
JR Koza
M Brameier
M Brameier
M O'Neill
MA Stalteri
ML Wong
NJ Radcliff
PA Whigham
PA Whigham
RI McKay
T Barrett
T Bäck
T Handstad
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
WB Langdon
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: Affymetrix High Density Oligonuclotide Arrays (HDONA) simultaneously measure expression of thousands of genes using millions of probes. We use correlations between measurements for the same gene across 6685 human tissue samples from NCBI's GEO database to indicated the quality of individual HG-U133A probes. Low correlation indicates a poor probe. Results: Regular expressions can be automatically created from a Backus-Naur form (BNF) context-free grammar using strongly typed genetic programming. Conclusion: The automatically produced motif is better at predicting poor DNA sequences than an existing human generated RE, suggesting runs of Cytosine and Guanine and mixtures should all be avoided. © 2009 Langdon and Harrison; licensee BioMed Central Ltd

University of Essex Research Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

UCL Discovery

PubMed Central

Prediction of nuclear proteins using SVM and HMM models

Author: A Garg
A Heddad
A Pierleoni
BW Matthews
C Dingwall
C Guda
D Xie
Gajendra PS Raghava
KC Chou
M Bhasin
M Brameier
M Cokol
M Kumar
M Rashid
Manish Kumar
O Emanuelsson
P Baldi
R Nair
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The nucleus, a highly organized organelle, plays important role in cellular homeostasis. The nuclear proteins are crucial for chromosomal maintenance/segregation, gene expression, RNA processing/export, and many other processes. Several methods have been developed for predicting the nuclear proteins in the past. The aim of the present study is to develop a new method for predicting nuclear proteins with higher accuracy. Results All modules were trained and tested on a non-redundant dataset and evaluated using five-fold cross-validation technique. Firstly, Support Vector Machines (SVM) based modules have been developed using amino acid and dipeptide compositions and achieved a Mathews correlation coefficient (MCC) of 0.59 and 0.61 respectively. Secondly, we have developed SVM modules using split amino acid compositions (SAAC) and achieved the maximum MCC of 0.66. Thirdly, a hidden Markov model (HMM) based module/profile was developed for searching exclusively nuclear and non-nuclear domains in a protein. Finally, a hybrid module was developed by combining SVM module and HMM profile and achieved a MCC of 0.87 with an accuracy of 94.61%. This method performs better than the existing methods when evaluated on blind/independent datasets. Our method estimated 31.51%, 21.89%, 26.31%, 25.72% and 24.95% of the proteins as nuclear proteins in <it>Saccharomyces cerevisiae, Caenorhabditis elegans, Drosophila melanogaster</it>, mouse and human proteomes respectively. Based on the above modules, we have developed a web server NpPred for predicting nuclear proteins <url>http://www.imtech.res.in/raghava/nppred/</url>. Conclusion This study describes a highly accurate method for predicting nuclear proteins. SVM module has been developed for the first time using SAAC for predicting nuclear proteins, where amino acid composition of N-terminus and the remaining protein were computed separately. In addition, our study is a first documentation where exclusively nuclear and non-nuclear domains have been identified and used for predicting nuclear proteins. The performance of the method improved further by combining both approaches together.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

SPAM detection: Naïve bayesian classification and RPN expression-based LGP approaches compared

Author: A Guven
A Khorsi
AH Gandomi
AW Burks
C Sangeetha
Carlton Downey
CL Hamblin
E Stamatatos
GV Cormack
I Kononenko
J Pearl
L Hirsch
Lorrie Faith Cranor
M Basavaraju
M Brameier
M Matsumoto
M Zhang
PE Bennett
S Mukkamala
VA Yatsko
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 26/07/2016
Field of study

An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters. © Springer International Publishing Switzerland 2016

Crossref

Institutional repository of Tomas Bata University Library

FunSimMat: a comprehensive functional similarity database

Author: A. Schlicker
Brameier
Camon
Dowell
Draghici
Finn
Franke
Freudenberg
Froehlich
Gene Ontology Consortium
Letunic
Lin
Liu
Lord
Lu
M. Albrecht
Perez-Iratxeta
Pu
Ramirez
Rossi
Schlicker
Schlicker
Schlicker
Sen
Suthram
Wu
Zhang
Publication venue: Oxford University Press
Publication date: 01/01/2008
Field of study

Functional similarity based on Gene Ontology (GO) annotation is used in diverse applications like gene clustering, gene expression data analysis, protein interaction prediction and evaluation. However, there exists no comprehensive resource of functional similarity values although such a database would facilitate the use of functional similarity measures in different applications. Here, we describe FunSimMat (Functional Similarity Matrix, http://funsimmat.bioinf.mpi-inf.mpg.de/), a large new database that provides several different semantic similarity measures for GO terms. It offers various precomputed functional similarity values for proteins contained in UniProtKB and for protein families in Pfam and SMART. The web interface allows users to efficiently perform both semantic similarity searches with GO terms and functional similarity searches with proteins or protein families. All results can be downloaded in tab-delimited files for use with other tools. An additional XML–RPC interface gives automatic online access to FunSimMat for programs and remote services

Crossref

PubMed Central

MPG.PuRe

Ab initio identification of human microRNAs based on structure motifs

Author: A Rodriguez
A Sewer
C Xue
Carsten Wiuf
D Bartel
D Gusfield
E Berezikov
E Bonnet
E Lai
I Bentwich
I Hofacker
I Hofacker
J Han
J Krol
J Nam
L He
L Lim
L Lim
M Brameier
M Legendre
M Weber
Markus Brameier
P Jiang
P Saetrom
S Altschul
S Baskerville
S Griffiths-Jones
S Helvik
S Kwang Loong
S Ying
T Gingeras
U Ohler
V Ambros
W Ritchie
X Wang
Y Altuvia
Y Grad
Y Zeng
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background MicroRNAs (miRNAs) are short, non-coding RNA molecules that are directly involved in post-transcriptional regulation of gene expression. The mature miRNA sequence binds to more or less specific target sites on the mRNA. Both their small size and sequence specificity make the detection of completely new miRNAs a challenging task. This cannot be based on sequence information alone, but requires structure information about the miRNA precursor. Unlike comparative genomics approaches, <it>ab initio </it>approaches are able to discover species-specific miRNAs without known sequence homology. Results MiRPred is a novel method for <it>ab initio </it>prediction of miRNAs by genome scanning that only relies on (predicted) secondary structure to distinguish miRNA precursors from other similar-sized segments of the human genome. We apply a machine learning technique, called linear genetic programming, to develop special classifier programs which include multiple regular expressions (motifs) matched against the secondary structure sequence. Special attention is paid to scanning issues. The classifiers are trained on fixed-length sequences as these occur when shifting a window in regular steps over a genome region. Various statistical and empirical evidence is collected to validate the correctness of and increase confidence in the predicted structures. Among other things, we propose a new criterion to select miRNA candidates with a higher stability of folding that is based on the number of matching windows around their genome location. An ensemble of 16 motif-based classifiers achieves 99.9 percent specificity with sensitivity remaining on an acceptable high level when requiring all classifiers to agree on a positive decision. A low false positive rate is considered more important than a low false negative rate, when searching larger genome regions for unknown miRNAs. 117 new miRNAs have been predicted close to known miRNAs on human chromosome 19. All candidate structures match the free energy distribution of miRNA precursors which is significantly shifted towards lower free energies. We employed a human EST library and found that around 75 percent of the candidate sequences are likely to be transcribed, with around 35 percent located in introns. Conclusion Our motif finding method is at least competitive to state-of-the-art feature-based methods for <it>ab initio </it>miRNA discovery. In doing so, it requires less previous knowledge about miRNA precursor structures while programs and motifs allow a more straightforward interpretation and extraction of the acquired knowledge.</p

Crossref

Springer

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Infectious Disease Ontology

Technological developments have resulted in tremendous increases in the volume and diversity of the data and information that must be processed in the course of biomedical and clinical research and practice. Researchers are at the same time under ever greater pressure to share data and to take steps to ensure that data resources are interoperable. The use of ontologies to annotate data has proven successful in supporting these goals and in providing new possibilities for the automated processing of data and information. In this chapter, we describe different types of vocabulary resources and emphasize those features of formal ontologies that make them most useful for computational applications. We describe current uses of ontologies and discuss future goals for ontology-based computing, focusing on its use in the field of infectious diseases. We review the largest and most widely used vocabulary resources relevant to the study of infectious diseases and conclude with a description of the Infectious Disease Ontology (IDO) suite of interoperable ontology modules that together cover the entire infectious disease domain

PhilPapers

CiteSeerX

Crossref

Inducing Diverse Decision Forests with Genetic Programming

Author: E. Bauer
E. Cantú-Paz
I. Falco De
J. Eggermont
J.R. Koza
J.R. Quinlan
L. Breiman
M. Brameier
R.E. Marmelstein
S.K. Murthy
T.K. Ho
Y. Freund
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2005
Field of study

Crossref