Search CORE

12 research outputs found

Assessing the effects of data selection and representation on the development of reliable E. coli sigma 70 promoter region predictors

Author: Abbas Mostafa M.
El-Manzalawy Yasser
Mohie-Eldin Mostafa M.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

As the number of sequenced bacterial genomes increases, the need for rapid and reliable tools for the annotation of functional elements (e.g., transcriptional regulatory elements) becomes more desirable. Promoters are the key regulatory elements, which recruit the transcriptional machinery through binding to a variety of regulatory proteins (known as sigma factors). The identification of the promoter regions is very challenging because these regions do not adhere to specific sequence patterns or motifs and are difficult to determine experimentally. Machine learning represents a promising and cost-effective approach for computational identification of prokaryotic promoter regions. However, the quality of the predictors depends on several factors including: i) training data; ii) data representation; iii) classification algorithms; iv) evaluation procedures. In this work, we create several variants of E. coli promoter data sets and utilize them to experimentally examine the effect of these factors on the predictive performance of E. coli σ70 promoter models. Our results suggest that under some combinations of the first three criteria, a prediction model might perform very well on cross-validation experiments while its performance on independent test data is drastically very poor. This emphasizes the importance of evaluating promoter region predictors using independent test data, which corrects for the over-optimistic performance that might be estimated using the cross-validation procedure. Our analysis of the tested models shows that good prediction models often perform well despite how the non-promoter data was obtained. On the other hand, poor prediction models seems to be more sensitive to the choice of non-promoter sequences. Interestingly, the best performing sequence-based classifiers outperform the best performing structure-based classifiers on both cross-validation and independent test performance evaluation experiments. Finally, we propose a meta-predictor method combining two top performing sequence-based and structure-based classifiers and compare its performance with some of the state-of-the-art E. coli σ70 promoter prediction methods.NPRP grant No. 4-1454-1-233 from the Qatar National Research Fund (a member of Qatar Foundation).Scopu

Qatar University Institutional Repository

Directory of Open Access Journals

PubMed Central

FigShare

AUC scores for Naive Bayes classifier with 4-mer features (NB_4-mer) trained using seven versions of CV data and in each time tested on the seven versions of the independent test data.

Author: Mostafa M. Abbas (710730)
Mostafa M. Mohie-Eldin (710731)
Yasser EL-Manzalawy (270330)
Publication venue
Publication date
Field of study

Each row corresponds to a specified training set while each column corresponds to a specified test set.AUC scores for Naive Bayes classifier with 4-mer features (NB_4-mer) trained using seven versions of CV data and in each time tested on the seven versions of the independent test data.</p

FigShare

AUC scores for selected classifiers (trained using CV_Mixed data) and tested on different versions of independent test set (e.g., TS_Random and TS_Coding).

Author: Mostafa M. Abbas (710730)
Mostafa M. Mohie-Eldin (710731)
Yasser EL-Manzalawy (270330)
Publication venue
Publication date
Field of study

See <a href="http://www.plosone.org/article/info:doi/10.1371/journal.pone.0119721#sec002" target="_blank">Methods</a> Section for more information about these test sets. For each data set, the rank of each classifier is shown in parentheses.AUC scores for selected classifiers (trained using CV_Mixed data) and tested on different versions of independent test set (e.g., TS_Random and TS_Coding).</p

FigShare

AUC scores for Naive Bayes classifier with DNID features (NB_DNID) trained using seven versions of CV data and in each time tested on the seven versions of the independent test data.

Author: Mostafa M. Abbas (710730)
Mostafa M. Mohie-Eldin (710731)
Yasser EL-Manzalawy (270330)
Publication venue
Publication date
Field of study

Each row corresponds to a specified training set while each column corresponds to a specified test set.AUC scores for Naive Bayes classifier with DNID features (NB_DNID) trained using seven versions of CV data and in each time tested on the seven versions of the independent test data.</p

FigShare

Grading scale for classifiers based on their AUC scores.

Author: Mostafa M. Abbas (710730)
Mostafa M. Mohie-Eldin (710731)
Yasser EL-Manzalawy (270330)
Publication venue
Publication date
Field of study

Grading scale for classifiers based on their AUC scores.</p

FigShare

Performance comparison of BacPP, IPMD, and two variable-window Z-curve models, VWZ1 and VWZ2, trained using Datatset-1 and Datatset-2 (respectively) with four selected classifiers (NB_DNID, RF100_M7, HMM, and meta-predictor) using TS_Mixed independent test set.

Author: Mostafa M. Abbas (710730)
Mostafa M. Mohie-Eldin (710731)
Yasser EL-Manzalawy (270330)
Publication venue
Publication date
Field of study

Performance comparison of BacPP, IPMD, and two variable-window Z-curve models, VWZ1 and VWZ2, trained using Datatset-1 and Datatset-2 (respectively) with four selected classifiers (NB_DNID, RF100_M7, HMM, and meta-predictor) using TS_Mixed independent test set.</p

FigShare

Summary of cross-validation data sets.

Author: Mostafa M. Abbas (710730)
Mostafa M. Mohie-Eldin (710731)
Yasser EL-Manzalawy (270330)
Publication venue
Publication date
Field of study

Summary of cross-validation data sets.</p

FigShare

Parallelizing exact motif finding algorithms on multi-core

Author: Abbas Mostafa M.
Abouelhoda Mohamed
Bahig Hazem M.
Mohie-Eldin M.M.
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2014
Field of study

The motif finding problem is one of the important and challenging problems in bioinformatics. A variety of sequential algorithms have been proposed to find exact motifs, but the running time is still not suitable due to high computational complexity of finding motifs. In this paper we parallelize three efficient sequential algorithms which are HEPPMSprune, PMS5 and PMS6. We implement the algorithms on a Dual Quad-Core machine using openMP to measure the performance of each algorithm. Our experiment on simulated data show that: (1) the parallel PMS6 is faster than the other algorithms in case of challenging instances, while the parallel HEPPMSprune is faster than the other algorithms in most of solvable instances; (2) the scalability of parallel HEPPMSprune is linear for all instances, while the scalability of parallel PMS5 and PMS6 is linear in case of challenging instances only; (3) the memory used by HEPPMSprune is less than that of the other algorithms.NPRP Grant No. 4-1454-1-233 from the Qatar National Research Fund (a member of Qatar Foundation).Scopu

Qatar University Institutional Repository

Ameliorating effects of garlic, calcium, and vitamin C on chronic lead toxicity in albino rats

Author: A Crowe
A Robert
AE Abel-Moneim
AG Vij
C Marchetti
DA Cory-Slechta
DK Rai
EB Dawson
EK Silbergeld
F Ghorbe
GA Mostafa
H Abadin
H Gurer-Orhan
Hiroki Sakai
HY Tomoum
JA Simon
JC Braton
LY Chung
M Maulik
ME Fatma
MG Shalan
MH Attia
Mohie Haridy
Mouchira Mohi-Eldin
MR Gholami
N Ercal
NE Sharaf
NS Ahmed
PC Hsu
RAB Drury
RC Patra
S Hernberg
SJ Flora
SK Banerjee
T Horie
V Marija
Zeinab Al-Amgad
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Assessing the Effects of Data Selection and Representation on the Development of Reliable E. coli Sigma 70 Promoter Region Predictors

Crossref