Search CORE

Public Research Access Institutional Repository and Information Exchange

FigShare

Development of Computational Techniques for Regulatory DNA Motif Identification Based on Big Biological Data

Author: Yang JInyu
Publication venue: Open PRAIRIE: Open Public Research Access Institutional Repository and Information Exchange
Publication date: 01/01/2017
Field of study

Accurate regulatory DNA motif (or motif) identification plays a fundamental role in the elucidation of transcriptional regulatory mechanisms in a cell and can strongly support the regulatory network construction for both prokaryotic and eukaryotic organisms. Next-generation sequencing techniques generate a huge amount of biological data for motif identification. Specifically, Chromatin Immunoprecipitation followed by high throughput DNA sequencing (ChIP-seq) enables researchers to identify motifs on a genome scale. Recently, technological improvements have allowed for DNA structural information to be obtained in a high-throughput manner, which can provide four DNA shape features. The DNA shape has been found as a complementary factor to genomic sequences in terms of transcription factor (TF)-DNA binding specificity prediction based on traditional machine learning models. Recent studies have demonstrated that deep learning (DL), especially the convolutional neural network (CNN), enables identification of motifs from DNA sequence directly. Although numerous algorithms and tools have been proposed and developed in this field, (1) the lack of intuitive and integrative web servers impedes the progress of making effective use of emerging algorithms and tools; (2) DNA shape has not been integrated with DL; and (3) existing DL models still suffer high false positive and false negative issues in motif identification. This thesis focuses on developing an integrated web server for motif identification based on DNA sequences either from users or built-in databases. This web server allows further motif-related analysis and Cytoscape-like network interpretation and visualization. We then proposed a DL framework for both sequence and shape motif identification from ChIP-seq data using a binomial distribution strategy. This framework can accept as input the different combinations of DNA sequence and DNA shape. Finally, we developed a gated convolutional neural network (GCNN) for capturing motif dependencies among long DNA sequences. Results show that our developed web server enables providing comprehensive motif analysis functionalities compared with existing web servers. The DL framework can identify motifs using an optimized threshold and disclose the strong predictive power of DNA shape in TF-DNA binding specificity. The identified sequence and shape motifs can contribute to TF-DNA binding mechanism interpretation. Additionally, GCNN can improve TF-DNA binding specificity prediction than CNN on most of the datasets

Species abundance information improves sequence taxonomy classification accuracy.

Author: Bokulich Nicholas A
Caporaso J Gregory
Huttley Gavin A
Kaehler Benjamin D
Knight Rob
McDonald Daniel
Publication venue: eScholarship, University of California
Publication date: 01/10/2019
Field of study

Popular naive Bayes taxonomic classifiers for amplicon sequences assume that all species in the reference database are equally likely to be observed. We demonstrate that classification accuracy degrades linearly with the degree to which that assumption is violated, and in practice it is always violated. By incorporating environment-specific taxonomic abundance information, we demonstrate a significant increase in the species-level classification accuracy across common sample types. At the species level, overall average error rates decline from 25% to 14%, which is favourably comparable to the error rates that existing classifiers achieve at the genus level (16%). Our findings indicate that for most practical purposes, the assumption that reference species are equally likely to be observed is untenable. q2-clawback provides a straightforward alternative for samples from common environments

Repository for Publications and Research Data

eScholarship - University of California

A multilayer network approach for guiding drug repositioning in neglected diseases

Author: Berenstein Ariel José
Chernomoretz Ariel
Fernandez Aguero Maria Jose
Magariños María Paula
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2016
Field of study

Drug development for neglected diseases has been historically hampered due to lack of market incentives. The advent of public domain resources containing chemical information from high throughput screenings is changing the landscape of drug discovery for these diseases. In this work we took advantage of data from extensively studied organisms like human, mouse, E. coli and yeast, among others, to develop a novel integrative network model to prioritize and identify candidate drug targets in neglected pathogen proteomes, and bioactive drug-like molecules. We modeled genomic (proteins) and chemical (bioactive compounds) data as a multilayer weighted network graph that takes advantage of bioactivity data across 221 species, chemical similarities between 1.7 105 compounds and several functional relations among 1.67 105 proteins. These relations comprised orthology, sharing of protein domains, and shared participation in defined biochemical pathways. We showcase the application of this network graph to the problem of prioritization of new candidate targets, based on the information available in the graph for known compound-target associations. We validated this strategy by performing a cross validation procedure for known mouse and Trypanosoma cruzi targets and showed that our approach outperforms classic alignment-based approaches. Moreover, our model provides additional flexibility as two different network definitions could be considered, finding in both cases qualitatively different but sensible candidate targets. We also showcase the application of the network to suggest targets for orphan compounds that are active against Plasmodium falciparum in high-throughput screens. In this case our approach provided a reduced prioritization list of target proteins for the query molecules and showed the ability to propose new testable hypotheses for each compound. Moreover, we found that some predictions highlighted by our network model were supported by independent experimental validations as found post-facto in the literature.Fil: Berenstein, Ariel José. Fundación Instituto Leloir; Argentina. Universidad de Buenos Aires. Facultad de Ingeniería. Departamento de Física; ArgentinaFil: Magariños, María Paula. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl Alfonsín" (sede Chascomús). Universidad Nacional de San Martín. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl Alfonsín" (sede Chascomús); ArgentinaFil: Chernomoretz, Ariel. Fundación Instituto Leloir; Argentina. Universidad de Buenos Aires. Facultad de Ingeniería. Departamento de Física; ArgentinaFil: Fernandez Aguero, Maria Jose. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - La Plata. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl Alfonsín" (sede Chascomús). Universidad Nacional de San Martín. Instituto de Investigaciones Biotecnológicas. Instituto de Investigaciones Biotecnológicas "Dr. Raúl Alfonsín" (sede Chascomús); Argentin

Public Library of Science (PLOS)

CONICET Digital

Springer - Publisher Connector

PPIcons: identification of protein-protein interaction sites in selected organisms

Author: Brijesh K. Sriwastava
Dariusz Plewczynski
Subhadip Basu
Ujjwal Maulik
Publication venue: Springer Nature
Publication date: 01/01/2013
Field of study

The physico-chemical properties of interaction interfaces have a crucial role in characterization of protein–protein interactions (PPI). In silico prediction of participating amino acids helps to identify interface residues for further experimental verification using mutational analysis, or inhibition studies by screening library of ligands against given protein. Given the unbound structure of a protein and the fact that it forms a complex with another known protein, the objective of this work is to identify the residues that are involved in the interaction. We attempt to predict interaction sites in protein complexes using local composition of amino acids together with their physico-chemical characteristics. The local sequence segments (LSS) are dissected from the protein sequences using a sliding window of 21 amino acids. The list of LSSs is passed to the support vector machine (SVM) predictor, which identifies interacting residue pairs considering their inter-atom distances. We have analyzed three different model organisms of Escherichia coli, Saccharomyces Cerevisiae and Homo sapiens, where the numbers of considered hetero-complexes are equal to 40, 123 and 33 respectively. Moreover, the unified multi-organism PPI meta-predictor is also developed under the current work by combining the training databases of above organisms. The PPIcons interface residues prediction method is measured by the area under ROC curve (AUC) equal to 0.82, 0.75, 0.72 and 0.76 for the aforementioned organisms and the meta-predictor respectively. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00894-013-1886-9) contains supplementary material, which is available to authorized users

GOPred: GO Molecular Function Prediction by Combined Classifiers

Author: A Arampatzis
A Bairoch
A Ben-Hur
A Fernandes
A Sokolov
A Yildiz
AH Liu
B Vogelstein
BE Engelhardt
BO Bodemann
BYM Cheng
C Altay
C Pasquier
C Zhai
CS Leslie
CZ Cai
D Demos
DMA Martin
DT Holloway
F Wilcoxon
H Hasumi
I Friedberg
I Melvin
J Kittler
JG Shanahan
JTL Wang
K Blekas
L Jensen
MN Wass
Niall James Haslam
O Sasson
OS Sarac
P Rice
PA McChesney
R Eisner
R Karchin
R Schwanbeck
RD King
Rengul Cetin-Atalay
RO Duda
S Tanaka
SF Altschul
SF Altschul
SS Hannenhalli
SY Sohn
T Cover
T Hawkins
V Costa
V Kunik
Volkan Atalay
WR Gilks
WW Colby
X Wang
Y Guermeur
Y jig Cho
Ömer Sinan Saraç
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Functional protein annotation is an important matter for in vivo and in silico biology. Several computational methods have been proposed that make use of a wide range of features such as motifs, domains, homology, structure and physicochemical properties. There is no single method that performs best in all functional classification problems because information obtained using any of these features depends on the function to be assigned to the protein. In this study, we portray a novel approach that combines different methods to better represent protein function. First, we formulated the function annotation problem as a classification problem defined on 300 different Gene Ontology (GO) terms from molecular function aspect. We presented a method to form positive and negative training examples while taking into account the directed acyclic graph (DAG) structure and evidence codes of GO. We applied three different methods and their combinations. Results show that combining different methods improves prediction accuracy in most cases. The proposed method, GOPred, is available as an online computational annotation tool (http://kinaz.fen.bilkent.edu.tr/gopred)

CiteSeerX

Bilkent University Institutional Repository

OpenMETU (Middle East Technical University)

Prediction of conformational B-cell epitopes from 3D structures by random forests with a distance-based feature

Author: AS Kolaskar
B Rost
DR Flower
EA Emini
G Riddick
G Walter
HR Ansari
Hua Zou
J Chen
J Huang
J Larsen
J Mintseris
J Pellequer
J Pellequer
J Ponomarenko
J Sollner
J Sun
J Wu
JM Parker
Juan Liu
JV Ponomarenko
L Breiman
M Sikić
Mark Hall
Meng Zhao
MH Van Regenmortel
MH Van Regenmortel
MJ Blythe
MJ Sweredoski
MJ Sweredoski
ND Rubinstein
ND Rubinstein
ND Rubinstein
P Jain
PA Karplus
PH Andersen
R Liu
S Liang
S Liang
S Saha
SR Comeau
W Kabsch
Wen Zhang
Xinghuo Ye
Y El-Manzalawy
Yi Xiong
ZP Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Antigen-antibody interactions are key events in immune system, which provide important clues to the immune processes and responses. In Antigen-antibody interactions, the specific sites on the antigens that are directly bound by the B-cell produced antibodies are well known as B-cell epitopes. The identification of epitopes is a hot topic in bioinformatics because of their potential use in the epitope-based drug design. Although most B-cell epitopes are discontinuous (or conformational), insufficient effort has been put into the conformational epitope prediction, and the performance of existing methods is far from satisfaction. Results In order to develop the high-accuracy model, we focus on some possible aspects concerning the prediction performance, including the impact of interior residues, different contributions of adjacent residues, and the imbalanced data which contain much more non-epitope residues than epitope residues. In order to address above issues, we take following strategies. Firstly, a concept of 'thick surface patch' instead of 'surface patch' is introduced to describe the local spatial context of each surface residue, which considers the impact of interior residue. The comparison between the thick surface patch and the surface patch shows that interior residues contribute to the recognition of epitopes. Secondly, statistical significance of the distance distribution difference between non-epitope patches and epitope patches is observed, thus an adjacent residue distance feature is presented, which reflects the unequal contributions of adjacent residues to the location of binding sites. Thirdly, a bootstrapping and voting procedure is adopted to deal with the imbalanced dataset. Based on the above ideas, we propose a new method to identify the B-cell conformational epitopes from 3D structures by combining conventional features and the proposed feature, and the random forest (RF) algorithm is used as the classification engine. The experiments show that our method can predict conformational B-cell epitopes with high accuracy. Evaluated by leave-one-out cross validation (LOOCV), our method achieves the mean AUC value of 0.633 for the benchmark bound dataset, and the mean AUC value of 0.654 for the benchmark unbound dataset. When compared with the state-of-the-art prediction models in the independent test, our method demonstrates comparable or better performance. Conclusions Our method is demonstrated to be effective for the prediction of conformational epitopes. Based on the study, we develop a tool to predict the conformational epitopes from 3D structures, available at <url>http://code.google.com/p/my-project-bpredictor/downloads/list</url>.</p

Springer - Publisher Connector

Protein Remote Homology Detection Based on an Ensemble Learning Approach

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2016
Field of study

'Unite and conquer': enhanced prediction of protein subcellular localization by integrating multiple specialized tools

Author: A Bulashevska
A Krogh
C Andreoli
C Guda
C Guda
CS Yu
E Badidi
E Frank
GE Tusnady
Gertraud Burger
H Bannai
H Shatkay
HB Shen
HB Shen
I Small
JL Heazlewood
JR Quinlan
JY Shi
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KJ Park
L Kall
M Bhasin
M Boden
MG Claros
MS Scott
N Pfanner
N Wiedemann
O Emanuelsson
P Donnes
QB Gao
S Džeroski
S Hua
S Matsuda
SHB Chou KC
T Hirokawa
T Zhang
W Li
X Xiao
Y Huang
Yao Qing Shen
YD Cai
YD Cai
YL Chen
YX Pan
Z Lu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background Knowing the subcellular location of proteins provides clues to their function as well as the interconnectivity of biological processes. Dozens of tools are available for predicting protein location in the eukaryotic cell. Each tool performs well on certain data sets, but their predictions often disagree for a given protein. Since the individual tools each have particular strengths, we set out to integrate them in a way that optimally exploits their potential. The method we present here is applicable to various subcellular locations, but tailored for predicting whether or not a protein is localized in mitochondria. Knowledge of the mitochondrial proteome is relevant to understanding the role of this organelle in global cellular processes. Results In order to develop a method for enhanced prediction of subcellular localization, we integrated the outputs of available localization prediction tools by several strategies, and tested the performance of each strategy with known mitochondrial proteins. The accuracy obtained (up to 92%) surpasses by far the individual tools. The method of integration proved crucial to the performance. For the prediction of mitochondrion-located proteins, integration via a two-layer decision tree clearly outperforms simpler methods, as it allows emphasis of biologically relevant features such as the mitochondrial targeting peptide and transmembrane domains. Conclusion We developed an approach that enhances the prediction accuracy of mitochondrial proteins by uniting the strength of specialized tools. The combination of machine-learning based integration with biological expert knowledge leads to improved performance. This approach also alleviates the conundrum of how to choose between conflicting predictions. Our approach is easy to implement, and applicable to predicting subcellular locations other than mitochondria, as well as other biological features. For a trial of our approach, we provide a webservice for mitochondrial protein prediction (named YimLOC), which can be accessed through the AnaBench suite at http://anabench.bcm.umontreal.ca/anabench/. The source code is provided in the Additional File <supplr sid="S2">2</supplr>. <suppl id="S2"> <title> Additional file 2 </title> <text> This file contains scripts for the online server YimLOC. Please note that there scripts only codes for the ready-to-use STACK-mem-DT described in the main text. The scripts do not provide the training process. </text> <file name="1471-2105-8-420-S2.pdf"> Click here for file </file> </suppl

Springer - Publisher Connector