Search CORE

32 research outputs found

Software defect prediction: do different classifiers find the same defects?

Author: AT Mısırlı
B Turhan
C Catal
C Seiffert
C Soares
D Gray
D Gray
David Bowes
DH Wolpert
E Arisholm
H Chen
I Witten
IH Laradji
Jean Petrić
K Elish
L Briand
L Madeyski
M D’Ambros
M Shepperd
M Shepperd
M Shepperd
MA Hall
N Fenton
NV Chawla
R Malhotra
S Lessmann
T Hall
T Khoshgoftaar
T Menzies
Tracy Hall
U Fayyad
W Chen
Y Zhou
Z Sun
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2017
Field of study

Open Access: This article is distributed under the terms of the Creative Commons Attribution 4.0 International License CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.During the last 10 years, hundreds of different defect prediction models have been published. The performance of the classifiers used in these models is reported to be similar with models rarely performing above the predictive performance ceiling of about 80% recall. We investigate the individual defects that four classifiers predict and analyse the level of prediction uncertainty produced by these classifiers. We perform a sensitivity analysis to compare the performance of Random Forest, Naïve Bayes, RPart and SVM classifiers when predicting defects in NASA, open source and commercial datasets. The defect predictions that each classifier makes is captured in a confusion matrix and the prediction uncertainty of each classifier is compared. Despite similar predictive performance values for these four classifiers, each detects different sets of defects. Some classifiers are more consistent in predicting defects than others. Our results confirm that a unique subset of defects can be detected by specific classifiers. However, while some classifiers are consistent in the predictions they make, other classifiers vary in their predictions. Given our results, we conclude that classifier ensembles with decision-making strategies not based on majority voting are likely to perform best in defect prediction.Peer reviewedFinal Published versio

Crossref

Springer - Publisher Connector

Lancaster E-Prints

University of Hertfordshire Research Archive

Hybrid Models Identified a 12-Gene Signature for Lung Cancer Prognosis and Chemoresponse Prediction

Author: A Bhattacharjee
A Potti
A Subramanian
AC Borczuk
AH Bild
B Emir
Dajie Luo
DG Beer
Ebrahim Sabbagh
HY Chen
IH Witten
IH Witten
J Subramanian
James Denvir
Jörg Hoheisel
K Shedden
KJ Livak
L Guo
L Hood
LJ van 't Veer
M Raponi
MA Hall
Mitchell
MJ van de Vijver
MS Pepe
Nancy Lan Guo
NL Guo
ON Ikediobi
PC Boutros
PC Hoffman
Rebecca Raese
S Paik
SG Baker
SK Lau
T Naruke
UT Shankavaram
Val Vallyathan
VG Tusher
Vincent Castranova
WS Dalton
Y Lu
Y Ma
Y Ma
Ying-Wooi Wan
Yong Qian
Z Wu
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Lung cancer remains the leading cause of cancer-related deaths worldwide. The recurrence rate ranges from 35-50% among early stage non-small cell lung cancer patients. To date, there is no fully-validated and clinically applied prognostic gene signature for personalized treatment.From genome-wide mRNA expression profiles generated on 256 lung adenocarcinoma patients, a 12-gene signature was identified using combinatorial gene selection methods, and a risk score algorithm was developed with Naïve Bayes. The 12-gene model generates significant patient stratification in the training cohort HLM & UM (n = 256; log-rank P = 6.96e-7) and two independent validation sets, MSK (n = 104; log-rank P = 9.88e-4) and DFCI (n = 82; log-rank P = 2.57e-4), using Kaplan-Meier analyses. This gene signature also stratifies stage I and IB lung adenocarcinoma patients into two distinct survival groups (log-rank P<0.04). The 12-gene risk score is more significant (hazard ratio = 4.19, 95% CI: [2.08, 8.46]) than other commonly used clinical factors except tumor stage (III vs. I) in multivariate Cox analyses. The 12-gene model is more accurate than previously published lung cancer gene signatures on the same datasets. Furthermore, this signature accurately predicts chemoresistance/chemosensitivity to Cisplatin, Carboplatin, Paclitaxel, Etoposide, Erlotinib, and Gefitinib in NCI-60 cancer cell lines (P<0.017). The identified 12 genes exhibit curated interactions with major lung cancer signaling hallmarks in functional pathway analysis. The expression patterns of the signature genes have been confirmed in RT-PCR analyses of independent tumor samples.The results demonstrate the clinical utility of the identified gene signature in prognostic categorization. With this 12-gene risk score algorithm, early stage patients at high risk for tumor recurrence could be identified for adjuvant chemotherapy; whereas stage I and II patients at low risk could be spared the toxic side effects of chemotherapeutic drugs

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

The Research Repository @ WVU (West Virginia University)

A P2P Botnet detection scheme based on decision tree and adaptive multilayer neural networks

Author: A Dries
A Nigrin
A Shiravi
AK Jain
C Ludl
C-F Tsai
G Fedynyshyn
H Jiang
H Li
H Nguyen
IH Witten
J Felix
J Zhang
K Wang
K-S Han
L Breiman
Li Zhang
M Hall
M Robnik-Šikonja
M. A. Hossain
MA Razi
Mohammad Alauthaman
Nauman Aslam
P Putten Van der
P Wang
P-N Tan
R Babak
RA Rodríguez-Gómez
Rafe Alasem
S Shin
SRSC Silva
T Holz
T Zhang
W Lu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

In recent years, Botnets have been adopted as a popular method to carry and spread many malicious codes on the Internet. These malicious codes pave the way to execute many fraudulent activities including spam mail, distributed denial-of-service attacks and click fraud. While many Botnets are set up using centralized communication architecture, the peer-to-peer (P2P) Botnets can adopt a decentralized architecture using an overlay network for exchanging command and control data making their detection even more difficult. This work presents a method of P2P Bot detection based on an adaptive multilayer feed-forward neural network in cooperation with decision trees. A classification and regression tree is applied as a feature selection technique to select relevant features. With these features, a multilayer feed-forward neural network training model is created using a resilient back-propagation learning algorithm. A comparison of feature set selection based on the decision tree, principal component analysis and the ReliefF algorithm indicated that the neural network model with features selection based on decision tree has a better identification accuracy along with lower rates of false positives. The usefulness of the proposed approach is demonstrated by conducting experiments on real network traffic datasets. In these experiments, an average detection rate of 99.08 % with false positive rate of 0.75 % was observed

Northumbria University Research Portal

Crossref

Springer - Publisher Connector

Teeside University's Research Repository

Occupancy Classification of Position Weight Matrix-Inferred Transcription Factor Binding Sites

Author: A Barski
A Valouev
Aaron Cohen
B Lenhard
CC Chang
D Karolchik
DH Wolpert
DL Daniels
E Roulet
FN Jensen
G Cooper
G Pavesi
G Robertson
GC Prendergast
GD Stormo
Gregory Yochum
Hollis Wright
IH Witten
Indra Neil Sarkar
J Cohen
JE Darnell
Kemal Sönmez
KI Zeller
KJ Won
M Tompa
MA Hall
N Friedman
ND Heintzman
OJ Sansom
P Hatzis
PJ Collins
Q Sun
R Staden
S Cawley
S Sinha
Shannon McWeeney
SL Schreiber
TL Bailey
TY Roh
UM Fayyad
V Matys
VN Vapnik
Y Chen
YJ Shann
Publication venue: Public Library of Science
Publication date: 04/11/2011
Field of study

BACKGROUND: Computational prediction of Transcription Factor Binding Sites (TFBS) from sequence data alone is difficult and error-prone. Machine learning techniques utilizing additional environmental information about a predicted binding site (such as distances from the site to particular chromatin features) to determine its occupancy/functionality class show promise as methods to achieve more accurate prediction of true TFBS in silico. We evaluate the Bayesian Network (BN) and Support Vector Machine (SVM) machine learning techniques on four distinct TFBS data sets and analyze their performance. We describe the features that are most useful for classification and contrast and compare these feature sets between the factors. RESULTS: Our results demonstrate good performance of classifiers both on TFBS for transcription factors used for initial training and for TFBS for other factors in cross-classification experiments. We find that distances to chromatin modifications (specifically, histone modification islands) as well as distances between such modifications to be effective predictors of TFBS occupancy, though the impact of individual predictors is largely TF specific. In our experiments, Bayesian network classifiers outperform SVM classifiers. CONCLUSIONS: Our results demonstrate good performance of machine learning techniques on the problem of occupancy classification, and demonstrate that effective classification can be achieved using distances to chromatin features. We additionally demonstrate that cross-classification of TFBS is possible, suggesting the possibility of constructing a generalizable occupancy classifier capable of handling TFBS for many different transcription factors

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Predicting Positive p53 Cancer Rescue Regions Using Most Informative Positive (MIP) Active Learning

Author: A Friedler
A Petitjean
A Ventura
AC Joerger
AC Martin
AL Cuff
AN Bullock
AR Fersht
BG Buchanan
CL Brooks
DA Case
DA Cohn
EF Pettersen
F Francois
F Glaser
G Dantas
G. Wesley Hatfield
IH Witten
J Feng
James M. Briggs
JM Lambert
JS Huston
K Otsuka
Kirsty Salmon
L Itti
Linda Hall
Lydia Ho
M Hollstein
M Saar-Tsechansky
MA Hearst
N Roy
NE Sharpless
NG Karaguler
P Baldi
Peter Kaiser
PV Nikolova
R Jones
Richard H. Lathrop
RJ Fox
RK Brachmann
RK Brachmann
Roberta Baronio
S Kato
S Lain
SA Danziger
SA Danziger
Samuel A. Danziger
SM Leach
TE Baroni
VJ Bykov
W Wang
W Xue
Y Cho
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

Many protein engineering problems involve finding mutations that produce proteins with a particular function. Computational active learning is an attractive approach to discover desired biological activities. Traditional active learning techniques have been optimized to iteratively improve classifier accuracy, not to quickly discover biologically significant results. We report here a novel active learning technique, Most Informative Positive (MIP), which is tailored to biological problems because it seeks novel and informative positive results. MIP active learning differs from traditional active learning methods in two ways: (1) it preferentially seeks Positive (functionally active) examples; and (2) it may be effectively extended to select gene regions suitable for high throughput combinatorial mutagenesis. We applied MIP to discover mutations in the tumor suppressor protein p53 that reactivate mutated p53 found in human cancers. This is an important biomedical goal because p53 mutants have been implicated in half of all human cancers, and restoring active p53 in tumors leads to tumor regression. MIP found Positive (cancer rescue) p53 mutants in silico using 33% fewer experiments than traditional non-MIP active learning, with only a minor decrease in classifier accuracy. Applying MIP to in vivo experimentation yielded immediate Positive results. Ten different p53 mutations found in human cancers were paired in silico with all possible single amino acid rescue mutations, from which MIP was used to select a Positive Region predicted to be enriched for p53 cancer rescue mutants. In vivo assays showed that the predicted Positive Region: (1) had significantly more (p<0.01) new strong cancer rescue mutants than control regions (Negative, and non-MIP active learning); (2) had slightly more new strong cancer rescue mutants than an Expert region selected for purely biological considerations; and (3) rescued for the first time the previously unrescuable p53 cancer mutant P152L

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Sequence-Based Prediction of Type III Secreted Proteins

The type III secretion system (TTSS) is a key mechanism for host cell interaction used by a variety of bacterial pathogens and symbionts of plants and animals including humans. The TTSS represents a molecular syringe with which the bacteria deliver effector proteins directly into the host cell cytosol. Despite the importance of the TTSS for bacterial pathogenesis, recognition and targeting of type III secreted proteins has up until now been poorly understood. Several hypotheses are discussed, including an mRNA-based signal, a chaperon-mediated process, or an N-terminal signal peptide. In this study, we systematically analyzed the amino acid composition and secondary structure of N-termini of 100 experimentally verified effector proteins. Based on this, we developed a machine-learning approach for the prediction of TTSS effector proteins, taking into account N-terminal sequence features such as frequencies of amino acids, short peptides, or residues with certain physico-chemical properties. The resulting computational model revealed a strong type III secretion signal in the N-terminus that can be used to detect effectors with sensitivity of ∼71% and selectivity of ∼85%. This signal seems to be taxonomically universal and conserved among animal pathogens and plant symbionts, since we could successfully detect effector proteins if the respective group was excluded from training. The application of our prediction approach to 739 complete bacterial and archaeal genome sequences resulted in the identification of between 0% and 12% putative TTSS effector proteins. Comparison of effector proteins with orthologs that are not secreted by the TTSS showed no clear pattern of signal acquisition by fusion, suggesting convergent evolutionary processes shaping the type III secretion signal. The newly developed program EffectiveT3 (http://www.chlamydiaedb.org) is the first universal in silico prediction program for the identification of novel TTSS effectors. Our findings will facilitate further studies on and improve our understanding of type III secretion and its role in pathogen–host interactions

Crossref

University of Strathclyde Institutional Repository

University of Birmingham Research Portal

Directory of Open Access Journals

PubMed Central

Permanent Hosting, Archiving and Indexing of Digital Resources and Assets

PuSH