Search CORE

2,740 research outputs found

Human pol II promoter prediction: time series descriptors and machine learning

Author: Gangal Rajeev
Sharma Pankaj
Publication venue: Oxford University Press
Publication date: 01/01/2005
Field of study

Although several in silico promoter prediction methods have been developed to date, they are still limited in predictive performance. The limitations are due to the challenge of selecting appropriate features of promoters that distinguish them from non-promoters and the generalization or predictive ability of the machine-learning algorithms. In this paper we attempt to define a novel approach by using unique descriptors and machine-learning methods for the recognition of eukaryotic polymerase II promoters. In this study, non-linear time series descriptors along with non-linear machine-learning algorithms, such as support vector machine (SVM), are used to discriminate between promoter and non-promoter regions. The basic idea here is to use descriptors that do not depend on the primary DNA sequence and provide a clear distinction between promoter and non-promoter regions. The classification model built on a set of 1000 promoter and 1500 non-promoter sequences, showed a 10-fold cross-validation accuracy of 87% and an independent test set had an accuracy >85% in both promoter and non-promoter identification. This approach correctly identified all 20 experimentally verified promoters of human chromosome 22. The high sensitivity and selectivity indicates that n-mer frequencies along with non-linear time series descriptors, such as Lyapunov component stability and Tsallis entropy, and supervised machine-learning methods, such as SVMs, can be useful in the identification of pol II promoters

CiteSeerX

Crossref

PubMed Central

SiteSeek: Post-translational modification analysis using adaptive locality-effective kernel methods and new profiles

Author: A Radzicka
A Radzicka
A Zanzoni
AA Salamov
Albert Y Zomaya
B Amos
B Rost
B Zhang
BA Ballif
Bing Bing Zhou
C Li
CJC Burgess
D Frishman
DT Larose
F Diella
G Horváth
G Rose
GD Rose
H Hu
H Kim
HH Jang
HM Berman
J Liu
J Shawe-Taylor
JC Obenauer
JH Kim
K Koliba
K Lin
L Graves
L Johnson
LA Pinna
LM Iakoucheva
M Hjerrild
M Mann
M Scholz
MA Kramer
MB Yaffe
MJ Korenberg
MJ Zvelebil
N Blom
NL Daly
P Baldi
P Cohen
Paul D Yoo
R David
R Lohmann
RD King
RE Schapire
SA Beausoleil
SB Ficarro
T Hunter
T Hunter
TG Dietterich
W Hardle
Y Freund
Y Xue
Yung Shwen Ho
Z Songyang
ZR Yang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors. Results The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC). Conclusion Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Faktorizacija matrik nizkega ranga pri učenju z večjedrnimi metodami

Author: Stražar Martin
Publication venue
Publication date: 01/10/2018
Field of study

The increased rate of data collection, storage, and availability results in a corresponding interest for data analyses and predictive models based on simultaneous inclusion of multiple data sources. This tendency is ubiquitous in practical applications of machine learning, including recommender systems, social network analysis, finance and computational biology. The heterogeneity and size of the typical datasets calls for simultaneous dimensionality reduction and inference from multiple data sources in a single model. Matrix factorization and multiple kernel learning models are two general approaches that satisfy this goal. This work focuses on two specific goals, namely i) finding interpretable, non-overlapping (orthogonal) data representations through matrix factorization and ii) regression with multiple kernels through the low-rank approximation of the corresponding kernel matrices, providing non-linear outputs and interpretation of kernel selection. The motivation for the models and algorithms designed in this work stems from RNA biology and the rich complexity of protein-RNA interactions. Although the regulation of RNA fate happens at many levels - bringing in various possible data views - we show how different questions can be answered directly through constraints in the model design. We have developed an integrative orthogonality nonnegative matrix factorization (iONMF) to integrate multiple data sources and discover non-overlapping, class-specific RNA binding patterns of varying strengths. We show that the integration of multiple data sources improves the predictive accuracy of retrieval of RNA binding sites and report on a number of inferred protein-specific patterns, consistent with experimentally determined properties. A principled way to extend the linear models to non-linear settings are kernel methods. Multiple kernel learning enables modelling with different data views, but are limited by the quadratic computation and storage complexity of the kernel matrix. Considerable savings in time and memory can be expected if kernel approximation and multiple kernel learning are performed simultaneously. We present the Mklaren algorithm, which achieves this goal via Incomplete Cholesky Decomposition, where the selection of basis functions is based on Least-angle regression, resulting in linear complexity both in the number of data points and kernels. Considerable savings in approximation rank are observed when compared to general kernel matrix decompositions and comparable to methods specialized to particular kernel function families. The principal advantages of Mklaren are independence of kernel function form, robust inducing point selection and the ability to use different kernels in different regions of both continuous and discrete input spaces, such as numeric vector spaces, strings or trees, providing a platform for bioinformatics. In summary, we design novel models and algorithms based on matrix factorization and kernel learning, combining regression, insights into the domain of interest by identifying relevant patterns, kernels and inducing points, while scaling to millions of data points and data views.V času pospešenega zbiranja, organiziranja in dostopnosti podatkov se pojavlja potreba po razvoju napovednih modelov na osnovi hkratnega učenja iz več podatkovnih virov. Konkretni primeri uporabe obsegajo področja strojnega učenja, priporočilnih sistemov, socialnih omrežij, financ in računske biologije. Heterogenost in velikost tipičnih podatkovnih zbirk vodi razvoj postopkov za hkratno zmanjšanje velikosti (zgoščevanje) in sklepanje iz več virov podatkov v skupnem modelu. Matrična faktorizacija in jedrne metode (ang. kernel methods) sta dve splošni orodji, ki omogočata dosego navedenega cilja. Pričujoče delo se osredotoča na naslednja specifična cilja: i) iskanje interpretabilnih, neprekrivajočih predstavitev vzorcev v podatkih s pomočjo ortogonalne matrične faktorizacije in ii) nadzorovano hkratno faktorizacijo več jedrnih matrik, ki omogoča modeliranje nelinearnih odzivov in interpretacijo pomembnosti različnih podatkovnih virov. Motivacija za razvoj modelov in algoritmov v pričujočem delu izhaja iz RNA biologije in bogate kompleksnosti interakcij med proteini in RNA molekulami v celici. Čeprav se regulacija RNA dogaja na več različnih nivojih - kar vodi v več podatkovnih virov/pogledov - lahko veliko lastnosti regulacije odkrijemo s pomočjo omejitev v fazi modeliranja. V delu predstavimo postopek hkratne matrične faktorizacije z omejitvijo, da se posamezni vzorci v podatkih ne prekrivajo med seboj - so neodvisni oz. ortogonalni. V praksi to pomeni, da lahko odkrijemo različne, neprekrivajoče načine regulacije RNA s strani različnih proteinov. Z vzključitvijo več podatkovnih virov izboljšamo napovedno točnost pri napovedovanju potencialnih vezavnih mest posameznega RNA-vezavnega proteina. Vzorci, odkriti iz podatkov so primerljivi z eksperimentalno določenimi lastnostmi proteinov in obsegajo kratka zaporedja nukleotidov na RNA, kooperativno vezavo z drugimi proteini, RNA strukturnimi lastnostmi ter funkcijsko anotacijo. Klasične metode matrične faktorizacije tipično temeljijo na linearnih modelih podatkov. Jedrne metode so eden od načinov za razširitev modelov matrične faktorizacije za modeliranje nelinearnih odzivov. Učenje z več jedri (ang. Multiple kernel learning) omogoča učenje iz več podatkovnih virov, a je omejeno s kvadratno računsko zahtevnostjo v odvisnosti od števila primerov v podatkih. To omejitev odpravimo z ustreznimi približki pri izračunu jedrnih matrik (ang. kernel matrix). V ta namen izboljšamo obstoječe metode na način, da hkrati izračunamo aproksimacijo jedrnih matrik ter njihovo linearno kombinacijo, ki modelira podan tarčni odziv. To dosežemo z metodo Mklaren (ang. Multiple kernel learning based on Least-angle regression), ki je sestavljena iz Nepopolnega razcepa Choleskega in Regresije najmanjših kotov (ang. Least-angle regression). Načrt algoritma vodi v linearno časovno in prostorsko odvisnost tako glede na število primerov v podatkih kot tudi glede na število jedrnih funkcij. Osnovne prednosti postopka so poleg računske odvisnosti tudi splošnost oz. neodvisnost od uporabljenih jedrnih funkcij. Tako lahko uporabimo različne, splošne jedrne funkcije za modeliranje različnih delov prostora vhodnih podatkov, ki so lahko zvezni ali diskretni, npr. vektorski prostori, prostori nizov znakov in drugih podatkovnih struktur, kar je prikladno za uporabo v bioinformatiki. V delu tako razvijemo algoritme na osnovi hkratne matrične faktorizacije in jedrnih metod, obravnavnamo modele linearne in nelinearne regresije ter interpretacije podatkovne domene - odkrijemo pomembna jedra in primere podatkov, pri čemer je metode mogoče poganjati na milijonih podatkovnih primerov in virov

Repository of the University of Ljubljana

Protein Secondary Structure Prediction Using Support Vector Machines, Nueral Networks and Genetic Algorithms

Author: Reyaz-Ahmed Anjum B
Publication venue: ScholarWorks @ Georgia State University
Publication date: 03/05/2007
Field of study

Bioinformatics techniques to protein secondary structure prediction mostly depend on the information available in amino acid sequence. Support vector machines (SVM) have shown strong generalization ability in a number of application areas, including protein structure prediction. In this study, a new sliding window scheme is introduced with multiple windows to form the protein data for training and testing SVM. Orthogonal encoding scheme coupled with BLOSUM62 matrix is used to make the prediction. First the prediction of binary classifiers using multiple windows is compared with single window scheme, the results shows single window not to be good in all cases. Two new classifiers are introduced for effective tertiary classification. This new classifiers use neural networks and genetic algorithms to optimize the accuracy of the tertiary classifier. The accuracy level of the new architectures are determined and compared with other studies. The tertiary architecture is better than most available techniques

ScholarWorks @ Georgia State University

PMeS: Prediction of Methylation Sites Based on Enhanced Feature Encoding Scheme

Author: A Shukla
A Suzuki
AJ Bannister
APL Snijders
B Xiao
BM Turner
C Cortes
C Pang
C Teyssier
CC Chang
CNI Pang
D Plewczynski
DM Shien
DS Johnson
FG Mastronardi
GE Crooks
H Chen
J Sayegh
JF Couture
Jian-Ding Qiu
JJ Gao
JL Fauchere
JL Shao
JL Xu
JM Aleta
KM Daily
L Nanni
LH Dong
LL Hu
M Kiledjian
ME Rudbeck
MR Stallcup
MT Bedford
N Dolzhanskaya
Niall James Haslam
R Predel
RA Varier
Ru-Ping Liang
S Ahmad
S Ahmad
S Niu
S Pahlich
Shao-Ping Shi
Sheng-Bao Suo
Shu-Yun Huang
T Rögnvaldsson
TS Rögnvaldsson
VD Longo
VN Lapko
WK Paik
WK Paik
WL Wooderchak
WZ Li
X Chen
Xing-Yu Sun
ZH Zhang
Publication venue: Public Library of Science
Publication date: 15/06/2012
Field of study

Protein methylation is predominantly found on lysine and arginine residues, and carries many important biological functions, including gene regulation and signal transduction. Given their important involvement in gene expression, protein methylation and their regulatory enzymes are implicated in a variety of human disease states such as cancer, coronary heart disease and neurodegenerative disorders. Thus, identification of methylation sites can be very helpful for the drug designs of various related diseases. In this study, we developed a method called PMeS to improve the prediction of protein methylation sites based on an enhanced feature encoding scheme and support vector machine. The enhanced feature encoding scheme was composed of the sparse property coding, normalized van der Waals volume, position weight amino acid composition and accessible surface area. The PMeS achieved a promising performance with a sensitivity of 92.45%, a specificity of 93.18%, an accuracy of 92.82% and a Matthew’s correlation coefficient of 85.69% for arginine as well as a sensitivity of 84.38%, a specificity of 93.94%, an accuracy of 89.16% and a Matthew’s correlation coefficient of 78.68% for lysine in 10-fold cross validation. Compared with other existing methods, the PMeS provides better predictive performance and greater robustness. It can be anticipated that the PMeS might be useful to guide future experiments needed to identify potential methylation sites in proteins of interest. The online service is available at http://bioinfo.ncu.edu.cn/inquiries_PMeS.aspx

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

Cascaded multi-view canonical correlation (CaMCCo) for early diagnosis of Alzheimer\u27s disease via fusion of clinical, imaging and omic Features

Author: Ances Beau
Carroll Maria
et al
Franklin Erin
Mintun Mark
Morris John
Oliver Angela
Schneider Stacy
Shaw Leslie
Publication venue: Digital Commons@Becker
Publication date: 01/01/2017
Field of study

Digital Commons@Becker

The importance of physicochemical characteristics and nonlinear classifiers in determining HIV-1 protease specificity

Author: Manning Timmy
Walsh Paul
Publication venue: 'Informa UK Limited'
Publication date: 04/12/2015
Field of study

This paper reviews recent research relating to the application of bioinformatics approaches to determining HIV-1 protease specificity, outlines outstanding issues, and presents a new approach to addressing these issues. Leading machine learning theory for the problem currently suggests that the direct encoding of the physicochemical properties of the amino acid substrates is not required for optimal performance. A number of amino acid encoding approaches which incorporate potentially relevant physicochemical properties of the substrate are identified, and are evaluated using a nonlinear task decomposition based neuroevolution algorithm. The results are evaluated, and compared against a recent benchmark set on a nonlinear classifier using only amino acid sequence and identity information. Ensembles of these nonlinear classifiers using the physicochemical properties of the substrate are demonstrated to consistently outperform the recently published state-of-the-art linear support vector machine based approach in out-of-sample evaluations

SWORD (Cork Inst. of Technology)

Machine learning in the analysis of biomolecular simulations

Author: Kaptan Shreyas
Vattulainen Ilpo
Publication venue
Publication date: 31/12/2022
Field of study

Machine learning has rapidly become a key method for the analysis and organization of large-scale data in all scientific disciplines. In life sciences, the use of machine learning techniques is a particularly appealing idea since the enormous capacity of computational infrastructures generates terabytes of data through millisecond simulations of atomistic and molecular-scale biomolecular systems. Due to this explosion of data, the automation, reproducibility, and objectivity provided by machine learning methods are highly desirable features in the analysis of complex systems. In this review, we focus on the use of machine learning in biomolecular simulations. We discuss the main categories of machine learning tasks, such as dimensionality reduction, clustering, regression, and classification used in the analysis of simulation data. We then introduce the most popular classes of techniques involved in these tasks for the purpose of enhanced sampling, coordinate discovery, and structure prediction. Whenever possible, we explain the scope and limitations of machine learning approaches, and we discuss examples of applications of these techniques.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

An Exhaustive Shape-Based Approach for Proteins\u27 Secondary, Tertiary and Quaternary Structures Indexing, Retrieval and Docking

Author: Eric Paquet
Herna L. Viktor
Publication venue: 'IntechOpen'
Publication date: 20/04/2012
Field of study

IntechOpen