Search CORE

1,594 research outputs found

Toward a multilevel representation of protein molecules: comparative approaches to the aggregation/folding propensity problem

Author: Giuliani Alessandro
Livi Lorenzo
Rizzi Antonello
Publication venue: 'Elsevier BV'
Publication date: 29/04/2015
Field of study

This paper builds upon the fundamental work of Niwa et al. [34], which provides the unique possibility to analyze the relative aggregation/folding propensity of the elements of the entire Escherichia coli (E. coli) proteome in a cell-free standardized microenvironment. The hardness of the problem comes from the superposition between the driving forces of intra- and inter-molecule interactions and it is mirrored by the evidences of shift from folding to aggregation phenotypes by single-point mutations [10]. Here we apply several state-of-the-art classification methods coming from the field of structural pattern recognition, with the aim to compare different representations of the same proteins gathered from the Niwa et al. data base; such representations include sequences and labeled (contact) graphs enriched with chemico-physical attributes. By this comparison, we are able to identify also some interesting general properties of proteins. Notably, (i) we suggest a threshold around 250 residues discriminating "easily foldable" from "hardly foldable" molecules consistent with other independent experiments, and (ii) we highlight the relevance of contact graph spectra for folding behavior discrimination and characterization of the E. coli solubility data. The soundness of the experimental results presented in this paper is proved by the statistically relevant relationships discovered among the chemico-physical description of proteins and the developed cost matrix of substitution used in the various discrimination systems.Comment: 17 pages, 3 figures, 46 reference

arXiv.org e-Print Archive

Archivio della ricerca- Università di Roma La Sapienza

Hidden Markov models Incorporating fuzzy measures and integrals for protein sequence identification and alignment

Author: Bidargaddi Niranjan
Chetty Madhu
Kamruzzaman Joarder
Publication venue: 'Elsevier BV'
Publication date: 01/01/2008
Field of study

Profile hidden Markov models (HMMs) based on classical HMMs have been widely applied for protein sequence identification. The formulation of the forward and backward variables in profile HMMs is made under statistical independence assumption of the probability theory. We propose a fuzzy profile HMM to overcome the limitations of that assumption and to achieve an improved alignment for protein sequences belonging to a given family. The proposed model fuzzifies the forward and backward variables by incorporating Sugeno fuzzy measures and Choquet integrals, thus further extends the generalized HMM. Based on the fuzzified forward and backward variables, we propose a fuzzy Baum-Welch parameter estimation algorithm for profiles. The strong correlations and the sequence preference involved in the protein structures make this fuzzy architecture based model as a suitable candidate for building profiles of a given family, since the fuzzy set can handle uncertainties better than classical methods

Elsevier - Publisher Connector

Federation ResearchOnline

PubMed Central

Application of compression-based distance measures to protein sequence classification: a methodological study

Author: András Kocsor
Attila Kertész-Farkas
László Kaján
Sándor Pongor
Publication venue
Publication date: 29/11/2005
Field of study

Abstract Motivation: Distance measures built on the notion of text compression have been used for the comparison and classification of entire genomes and mitochondrial genomes. The present study was undertaken in order to explore their utility in the classification of protein sequences. Results: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith–Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequence segments of the human proteome, CBMs values show a dependence on the length and the complexity of the sequences compared. In classification tasks CBMs performed especially well on distantly related proteins where the performance of a combined measure, constructed from a CBM and a BLAST score, approached or even slightly exceeded that of the Smith–Waterman algorithm and two hidden Markov model-based algorithms. Contact: [email protected] Supplementary information

Open Access Repository

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California

Gene ontology based transfer learning for protein subcellular localization

Author: A Bateman
A Dijk
A Hoglund
A Hoglund
A Pierleoni
C Chen
C Leslie
C Leslie
DH Haft
E Marcotte
EM Zdobnov
F Corpet
FM Li
G Lanckriet
G Schneider
H Ding
H Lin
H Lin
H Liu
H Rangwala
H Shen
HB Shen
HB Shen
HB Shen
HB Shen
HB Shen
J Cedano
J Schultz
J Shen
JD Qiu
JD Qiu
K Chou
K Chou
K Chou
K Hofmann
K Lee
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
L Nanni
M Ashburner
M Esmaeili
M Mak
M Wang
Q Gu
Q Yang
R Apweiler
R Kuang
R Kuang
S Mei
S Pan
Shuigeng Zhou
Suyu Mei
T Blum
T Tung
TK Attwood
W Dai
W Dai
W Huang
W Huang
Wang Fei
X Jiang
X Xiao
XB Zhou
YH Zeng
YS Ding
YS Ding
Z Lei
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Prediction of protein subcellular localization generally involves many complex factors, and using only one or two aspects of data information may not tell the true story. For this reason, some recent predictive models are deliberately designed to integrate multiple heterogeneous data sources for exploiting multi-aspect protein feature information. Gene ontology, hereinafter referred to as <it>GO</it>, uses a controlled vocabulary to depict biological molecules or gene products in terms of biological process, molecular function and cellular component. With the rapid expansion of annotated protein sequences, gene ontology has become a general protein feature that can be used to construct predictive models in computational biology. Existing models generally either concatenated the <it>GO </it>terms into a flat binary vector or applied majority-vote based ensemble learning for protein subcellular localization, both of which can not estimate the individual discriminative abilities of the three aspects of gene ontology. Results In this paper, we propose a Gene Ontology Based Transfer Learning Model (<it>GO-TLM</it>) for large-scale protein subcellular localization. The model transfers the signature-based homologous <it>GO </it>terms to the target proteins, and further constructs a reliable learning system to reduce the adverse affect of the potential false <it>GO </it>terms that are resulted from evolutionary divergence. We derive three <it>GO </it>kernels from the three aspects of gene ontology to measure the <it>GO </it>similarity of two proteins, and derive two other spectrum kernels to measure the similarity of two protein sequences. We use simple non-parametric cross validation to explicitly weigh the discriminative abilities of the five kernels, such that the time & space computational complexities are greatly reduced when compared to the complicated semi-definite programming and semi-indefinite linear programming. The five kernels are then linearly merged into one single kernel for protein subcellular localization. We evaluate <it>GO-TLM </it>performance against three baseline models: <it>MultiLoc, MultiLoc-GO </it>and <it>Euk-mPLoc </it>on the benchmark datasets the baseline models adopted. 5-fold cross validation experiments show that <it>GO-TLM </it>achieves substantial accuracy improvement against the baseline models: 80.38% against model <it>Euk-mPLoc </it>67.40% with <it>12.98% </it>substantial increase; 96.65% and 96.27% against model <it>MultiLoc-GO </it>89.60% and 89.60%, with <it>7.05% </it>and <it>6.67% </it>accuracy increase on dataset <it>MultiLoc plant </it>and dataset <it>MultiLoc animal</it>, respectively; 97.14%, 95.90% and 96.85% against model <it>MultiLoc-GO </it>83.70%, 90.10% and 85.70%, with accuracy increase <it>13.44%</it>, <it>5.8% </it>and <it>11.15% </it>on dataset <it>BaCelLoc plant</it>, dataset <it>BaCelLoc fungi </it>and dataset <it>BaCelLoc animal </it>respectively. For <it>BaCelLoc </it>independent sets, <it>GO-TLM </it>achieves 81.25%, 80.45% and 79.46% on dataset <it>BaCelLoc plant holdout</it>, dataset <it>BaCelLoc plant holdout </it>and dataset <it>BaCelLoc animal holdout</it>, respectively, as compared against baseline model <it>MultiLoc-GO </it>76%, 60.00% and 73.00%, with accuracy increase <it>5.25%</it>, <it>20.45% </it>and <it>6.46%</it>, respectively. Conclusions Since direct homology-based <it>GO </it>term transfer may be prone to introducing noise and outliers to the target protein, we design an explicitly weighted kernel learning system (called Gene Ontology Based Transfer Learning Model, <it>GO-TLM</it>) to transfer to the target protein the known knowledge about related homologous proteins, which can reduce the risk of outliers and share knowledge between homologous proteins, and thus achieve better predictive performance for protein subcellular localization. Cross validation and independent test experimental results show that the homology-based <it>GO </it>term transfer and explicitly weighing the <it>GO </it>kernels substantially improve the prediction performance.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Recognition of short functional motifs in protein sequences

Author: Prytuliak Roman
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 22/06/2018
Field of study

The main goal of this study was to develop a method for computational de novo prediction of short linear motifs (SLiMs) in protein sequences that would provide advantages over existing solutions for the users. The users are typically biological laboratory researchers, who want to elucidate the function of a protein that is possibly mediated by a short motif. Such a process can be subcellular localization, secretion, post-translational modification or degradation of proteins. Conducting such studies only with experimental techniques is often associated with high costs and risks of uncertainty. Preliminary prediction of putative motifs with computational methods, them being fast and much less expensive, provides possibilities for generating hypotheses and therefore, more directed and efficient planning of experiments. To meet this goal, I have developed HH-MOTiF – a web-based tool for de novo discovery of SLiMs in a set of protein sequences. While working on the project, I have also detected patterns in sequence properties of certain SLiMs that make their de novo prediction easier. As some of these patterns are not yet described in the literature, I am sharing them in this thesis. While evaluating and comparing motif prediction results, I have identified conceptual gaps in theoretical studies, as well as existing practical solutions for comparing two sets of positional data annotating the same set of biological sequences. To close this gap and to be able to carry out in-depth performance analyses of HH-MOTiF in comparison to other predictors, I have developed a corresponding statistical method, SLALOM (for StatisticaL Analysis of Locus Overlap Method). It is currently available as a standalone command line tool

Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

Author: San Miguel Carrasco Rafael
Publication venue
Publication date: 01/01/2021
Field of study

Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

e_Buah - Biblioteca Digital de la Universidad de Alcalá

Multiconstrained gene clustering based on generalized projections

Author: A Schlicker
A Schliep
Alan Wee-Chung Liew
B Adryan
C Wolting
D Dembélé
D Hanisch
D Huang
D Tritchler
DM Blei
E Kreyszig
H Stark
Hong Yan
J Zeng
J Zeng
J Zeng
J Zeng
J Zeng
J Zeng
J Zeng
Jia Zeng
JL Sevilla
JZ Wang
L Tari
M Aubry
M Kanehisa
M Shiga
MB Eisen
MF Ramoni
MK Kerr
N Bolshakova
P Tamayo
PT Spellman
PW Lord
R Steuer
S Tavazoie
S Zhu
S Zhu
Shanfeng Zhu
TR Hughes
W Feng
W Pan
X Gan
X Guo
XQ Cao
XQ Cao
XQ Cao
Z Bar-Joseph
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Gene clustering for annotating gene functions is one of the fundamental issues in bioinformatics. The best clustering solution is often regularized by multiple constraints such as gene expressions, Gene Ontology (GO) annotations and gene network structures. How to integrate multiple pieces of constraints for an optimal clustering solution still remains an unsolved problem. Results We propose a novel multiconstrained gene clustering (MGC) method within the generalized projection onto convex sets (POCS) framework used widely in image reconstruction. Each constraint is formulated as a corresponding set. The generalized projector iteratively projects the clustering solution onto these sets in order to find a consistent solution included in the intersection set that satisfies all constraints. Compared with previous MGC methods, POCS can integrate multiple constraints from different nature without distorting the original constraints. To evaluate the clustering solution, we also propose a new performance measure referred to as Gene Log Likelihood (GLL) that considers genes having more than one function and hence in more than one cluster. Comparative experimental results show that our POCS-based gene clustering method outperforms current state-of-the-art MGC methods. Conclusions The POCS-based MGC method can successfully combine multiple constraints from different nature for gene clustering. Also, the proposed GLL is an effective performance measure for the soft clustering solutions.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Hidden Markov model and Chapman Kolmogrov for protein structures prediction from images

Author: Amira S. Ashour
João Manuel R. S. Tavares
Linkon Chowdhury
Md. Sarwar Kamal
Mohammad Ibrahim Khan
Nilanjan Dey
Publication venue: 'Elsevier BV'
Publication date: 01/06/2017
Field of study

Protein structure prediction and analysis are more significant for living organs to perfect asses the livingorgan functionalities. Several protein structure prediction methods use neural network (NN). However,the Hidden Markov model is more interpretable and effective for more biological data analysis comparedto the NN. It employs statistical data analysis to enhance the prediction accuracy. The current workproposed a protein prediction approach from protein images based on Hidden Markov Model andChapman Kolmogrov equation. Initially, a preprocessing stage was applied for protein imagesbinarization using Otsu technique in order to convert the protein image into binary matrix. Subsequently,two counting algorithms, namely the Flood fill and Warshall are employed to classify the proteinstructures. Finally, Hidden Markov model and Chapman Kolmogrov equation are applied on the classifiedstructures for predicting the protein structure. The execution time and algorithmic performances aremeasured to evaluate the primary, secondary and tertiary protein structure prediction

Crossref

Repositório Aberto da Universidade do Porto