Search CORE

16 research outputs found

Physicochemical property distributions for accurate and rapid pairwise protein homology detection

Author: A Ben-Hur
A Kumar
AG Murzin
AR Shah
B Liu
BJ Webb-Robertson
BJ Webb-Robertson
BJ Webb-Robertson
Bobbie-Jo M Webb-Robertson
C Leslie
Christopher S Oehmen
CS Leslie
H Rangwala
H Saigo
I Jung
I Melvin
I Melvin
J Weston
Kyle G Ratuiste
L Liao
NH Anderson
QW Dong
R Kuang
S Hochreiter
SF Altschul
SF Altschul
T Damoulas
T Lingner
TF Smith
WS Noble
WS Noble
Y Hou
Y Hou
Y Yang
Y Yuan
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection. Results We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost. Conclusions A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Dimensionality reduction for click-through rate prediction: Dense versus sparse representation

Author: Fruergaard Bjarne Ørum
Hansen Lars Kai
Hansen Toke Jansen
Publication venue
Publication date: 01/01/2013
Field of study

In online advertising, display ads are increasingly being placed based on real-time auctions where the advertiser who wins gets to serve the ad. This is called real-time bidding (RTB). In RTB, auctions have very tight time constraints on the order of 100ms. Therefore mechanisms for bidding intelligently such as clickthrough rate prediction need to be sufficiently fast. In this work, we propose to use dimensionality reduction of the user-website interaction graph in order to produce simplified features of users and websites that can be used as predictors of clickthrough rate. We demonstrate that the Infinite Relational Model (IRM) as a dimensionality reduction offers comparable predictive performance to conventional dimensionality reduction schemes, while achieving the most economical usage of features and fastest computations at run-time. For applications such as real-time bidding, where fast database I/O and few computations are key to success, we thus recommend using IRM based features as predictors to exploit the recommender effects from bipartite graphs.Comment: Presented at the Probabilistic Models for Big Data workshop at NIPS 201

arXiv.org e-Print Archive

Online Research Database In Technology

Protein Remote Homology Detection Based on an Ensemble Learning Approach

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

Crossref

A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

Author: A Ben-Hur
A Floratos
AR Shah
B Qian
B Rost
B-J Webb-Robertson
Bin Liu
C Leslie
CG Nevill-Manning
CS Leslie
H Ogul
H Rangwala
H Saigo
I Rigoutsos
J Bellegarda
J Shawe-Taylor
K Karplus
L Holm
L Liao
Lei Lin
M Ganapathiraju
M Gribskov
Q Dong
Q Dong
Q Dong
Q Dong
Q Dong
Qiwen Dong
QJ Su
QW Dong
R Kuang
S Henikoff
SE Brenner
SE Dowd
SF Altschul
SF Altschul
T Damoulas
T Håndstad
T Jaakkola
T Lingner
TF Smith
TK Landauer
TL Bailey
VN Vapnik
WR Pearson
WS Noble
Xiaolong Wang
Xuan Wang
Y Hou
Y Hou
Y Yang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences. Results In this paper, a novel building block of proteins called Top-<it>n</it>-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-<it>n</it>-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-<it>n</it>-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-<it>n</it>-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-<it>n</it>-grams and LSA gives significantly better results compared to related methods. Conclusion The method based on Top-<it>n</it>-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-<it>n</it>-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

PDNAsite:identification of DNA-binding site from protein sequence by incorporating spatial and sequence context

Author: A Bochkarev
AN Bullock
AP Bradley
B Liu
C Yan
CA BDavey
CC Chang
CO Pabo
EW Stawiski
H Tjong
HM Berman
IB Kuznetsov
J Wu
JA Swets
KL Griffith
L Wang
L Wang
L Wang
L Wang
M Ptashne
M Radlinska
M Terribilini
MY Gutfreund
N Bhardwaj
NM Luscombe
NM Luscombe
P Ozbek
QW Dong
R Liu
R Liu
R Xu
R Xu
RD Kornberg
S Ahmad
S Ahmad
S Hwang
SY Ho
T Li
W Kabsch
X Ma
X Zhao
Y Ofran
YC Chen
Z Yuan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Protein-DNA interactions are involved in many fundamental biological processes essential for cellular function. Most of the existing computational approaches employed only the sequence context of the target residue for its prediction. In the present study, for each target residue, we applied both the spatial context and the sequence context to construct the feature space. Subsequently, Latent Semantic Analysis (LSA) was applied to remove the redundancies in the feature space. Finally, a predictor (PDNAsite) was developed through the integration of the support vector machines (SVM) classifier and ensemble learning. Results on the PDNA-62 and the PDNA-224 datasets demonstrate that features extracted from spatial context provide more information than those from sequence context and the combination of them gives more performance gain. An analysis of the number of binding sites in the spatial context of the target site indicates that the interactions between binding sites next to each other are important for protein-DNA recognition and their binding ability. The comparison between our proposed PDNAsite method and the existing methods indicate that PDNAsite outperforms most of the existing methods and is a useful tool for DNA-binding site identification. A web-server of our predictor (http://hlt.hitsz.edu.cn:8080/PDNAsite/) is made available for free public accessible to the biological research community

The Hong Kong Polytechnic University Pao Yue-kong Library

Crossref

PolyU Institutional Repository

PubMed Central

Aston Publications Explorer

A discriminative method for family-based protein remote homology detection that combines inductive logic programming and propositional models

Author: A Andreeva
A Ben-Hur
A Karwath
A Karwath
A Shah
Alessandra Carbone
B Liu
B Qian
B Webb-Robertson
C Ferreira
C Leslie
D Higgins
F Wilcoxon
G Yona
Gerson Zaverucha
H Rangwala
H Saigo
J Bernardes
J Davis
J Gough
J Quinlan
J Soeding
J Weston
Juliana S Bernardes
L De Raedt
L Dehaspe
L Liao
N Shan-Hwei
Q Dong
Q Su
R Agrawal
R Hughey
R King
R King
R Kuang
R Sadreyev
S Altschul
S Altschul
S Brenner
S Eddy
S Eddy
S Kawashima
S Lee
T Handstad
T Jaakkola
T Lingner
U Syed
V Alexandrov
V Atalay
Y Hou
Y Hou
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Remote homology detection is a hard computational problem. Most approaches have trained computational models by using either full protein sequences or multiple sequence alignments (MSA), including all positions. However, when we deal with proteins in the "twilight zone" we can observe that only some segments of sequences (motifs) are conserved. We introduce a novel logical representation that allows us to represent physico-chemical properties of sequences, conserved amino acid positions and conserved physico-chemical positions in the MSA. From this, Inductive Logic Programming (ILP) finds the most frequent patterns (motifs) and uses them to train propositional models, such as decision trees and support vector machines (SVM). Results We use the SCOP database to perform our experiments by evaluating protein recognition within the same superfamily. Our results show that our methodology when using SVM performs significantly better than some of the state of the art methods, and comparable to other. However, our method provides a comprehensible set of logical rules that can help to understand what determines a protein function. Conclusions The strategy of selecting only the most frequent patterns is effective for the remote homology detection. This is possible through a suitable first-order logical representation of homologous properties, and through a set of frequent patterns, found by an ILP system, that summarizes essential features of protein functions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

HAL-Inserm

PubMed Central

Deep learning-based k(cat) prediction enables improved enzyme-constrained model reconstruction

Author: Chen Yu
Engqvist Martin
Kerkhoven Eduard
Li Feiran
Li Gang
Lu Hongzhong
Nielsen Jens B
Yuan Le
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2022
Field of study

Enzyme turnover numbers (k(cat)) are key to understanding cellular metabolism, proteome allocation and physiological diversity, but experimentally measured k(cat) data are sparse and noisy. Here we provide a deep learning approach (DLKcat) for high-throughput k(cat) prediction for metabolic enzymes from any organism merely from substrate structures and protein sequences. DLKcat can capture k(cat) changes for mutated enzymes and identify amino acid residues with a strong impact on k(cat) values. We applied this approach to predict genome-scale k(cat) values for more than 300 yeast species. Additionally, we designed a Bayesian pipeline to parameterize enzyme-constrained genome-scale metabolic models from predicted k(cat) values. The resulting models outperformed the corresponding original enzyme-constrained genome-scale metabolic models from previous pipelines in predicting phenotypes and proteomes, and enabled us to explain phenotypic differences. DLKcat and the enzyme-constrained genome-scale metabolic model construction pipeline are valuable tools to uncover global trends of enzyme kinetics and physiological diversity, and to further elucidate cellular metabolism on a large scale

Chalmers Research

Protein Remote Homology Detection Based on an Ensemble Learning Approach

Author: Bingquan Liu
Dong Huang
Junjie Chen
Publication venue
Publication date: 06/03/2020
Field of study

Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods

CiteSeerX

Distributed nonnegative matrix factorization for web-scale dyadic data analysis on mapreduce

Author: Chao Liu
Hung-chih Yang
Jinliang Fan
Li-wei He
Yi-min Wang
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

The Web abounds with dyadic data that keeps increasing by every single second. Previous work has repeatedly shown the usefulness of extracting the interaction structure inside dyadic data [21, 9, 8]. A commonly used tool in extracting the underlying structure is the matrix factorization, whose fame was further boosted in the Netflix challenge [26]. When we were trying to replicate the same success on real-world Web dyadic data, we were seriously challenged by the scal-ability of available tools. We therefore in this paper report our efforts on scaling up the nonnegative matrix factoriza-tion (NMF) technique. We show that by carefully partition-ing the data and arranging the computations to maximize data locality and parallelism, factorizing a tens of millions by hundreds of millions matrix with billions of nonzero cells can be accomplished within tens of hours. This result ef-fectively assures practitioners of the scalability of NMF on Web-scale dyadic data

CiteSeerX

Crossref