Skip to main content
Article thumbnail
Location of Repository

Assessing the Statistical Significance of the Achieved Classification Error of Classifiers Constructed using Serum Peptide Profiles, and a Prescription for Random Sampling Repeated Studies for Massive High-Throughput Genomic and Proteomic Studies

By James Lyons-Weiler, Richard Pelikan, Herbert J Zeh, David C Whitcomb, David E Malehorn, William L Bigbee and Milos Hauskrecht


Peptide profiles generated using SELDI/MALDI time of flight mass spectrometry provide a promising source of patient-specific information with high potential impact on the early detection and classification of cancer and other diseases. The new profiling technology comes, however, with numerous challenges and concerns. Particularly important are concerns of reproducibility of classification results and their significance. In this work we describe a computational validation framework, called PACE (Permutation-Achieved Classification Error), that lets us assess, for a given classification model, the significance of the Achieved Classification Error (ACE) on the profile data. The framework compares the performance statistic of the classifier on true data samples and checks if these are consistent with the behavior of the classifier on the same data with randomly reassigned class labels. A statistically significant ACE increases our belief that a discriminative signal was found in the data. The advantage of PACE analysis is that it can be easily combined with any classification model and is relatively easy to interpret. PACE analysis does not protect researchers against confounding in the experimental design, or other sources of systematic or random error. We use PACE analysis to assess significance of classification results we have achieved on a number of published data sets. The results show that many of these datasets indeed possess a signal that leads to a statistically significant ACE

Topics: Original Research
Publisher: Libertas Academica
OAI identifier:
Provided by: PubMed Central

Suggested articles


  1. (2001). A Bayesian framework for the analysis of microarray expression data: regularized t -test and statistical inferences of gene changes.
  2. (1993). A computer program for non-parametric receiver operating characteristic analysis.
  3. (1983). A leisurely look at the bootstrap, the Jackknife and crossvalidation.
  4. (2003). A New Confi dence Interval for the Difference Between Two Binomial Proportions of Paired Data. UW Bio statistics Working Paper Series. Working Paper 205.70 Lyons-Weiler, Pelikan, and Zeh, et al Cancer Informatics
  5. (2003). A preliminary analysis of non-small cell lung cancer biomarkers in serum. Biomed Environ Sci.
  6. (1998). A tutorial on support vector machines for pattern recognition.
  7. (2004). A web application for the integrated analysis of global gene expression patterns in cancer. Applied Bioinformatics,
  8. (1993). An introduction to the bootstrap.
  9. (2003). Biomarker amplifi cation by serum carrier protein binding. Dis Markers.
  10. (2002). Boosted decision tree analysis of surface-enhanced laser desorption/ionization mass spectral serum profi les discriminates prostate cancer from noncancer patients.
  11. (1990). Bootstrap comparison of fuzzy ROC curves for ECG-LVH algorithms using data from the Framingham Heart Study.
  12. (2000). Bootstrap confi dence intervals for the sensitivity of a quantitative diagnostic test.
  13. (1979). Bootstrap methods: another look at the jackknife. Ann Stat.
  14. (2003). caCORE: a common infrastructure for cancer informatics.
  15. (2003). Comparison of eight computer programs for receiver-operating characteristic analysis.
  16. (1987). Comparison of quantitative diagnostic tests: type I error, power, and sample size.
  17. (1993). Confi dence bands for receiver operating characteristic curves. Med Decis Making.
  18. (2003). Confi dence bands for ROC curves,
  19. (1998). Confi dence intervals for the receiver operating characteristic area in studies with small samples.
  20. (2001). Development of a novel approach for the detection of transitional cell carcinoma of the bladder in urine.
  21. Diagnostics: Producers and Consumers in the Era of Correlative Science.
  22. (2005). Feature Selection for Classifi cation of SELDI-TOF-MS Proteomic Profi les.
  23. (2000). Identifying combinations of cancer markers for further study as triggers of early intervention.
  24. (1997). Improvements on Cross-Validation: The .632+ Bootstrap Estimator.
  25. (2004). Is cross-validation valid for smallsample microarray classifi cation?
  26. (2002). Learning with Kernels.
  27. (1995). Neural Networks for Pattern Recognition.
  28. (2003). Partial AUC estimation and regression.
  29. (2002). Permutation Tests for Classifi cation.
  30. (1994). Permutation Tests: A Practical Guide to Resampling Methods for Testing Hypothesis.
  31. (2003). Point: Proteomic patterns in biological fl uids: do they represent the future of cancer diagnostics? Clin Chem.
  32. Proteinchip(R) surface enhanced laser desorption/ionization (SELDI) mass spectrometry: a novel protein biochip technology for detection of prostate cancer biomarkers in complex protein mixtures. Prostate Cancer Prostatic Dis.
  33. (2002). Proteomics and bioinformatics approaches for identifi cation of serum biomarkers to detect breast cancer. Clin Chem.
  34. (2002). Recent advancements in surface-enhanced laser desorption/ionization-time of flight-mass spectrometry.
  35. (2004). Reproducibility of SELDITOF protein patterns in serum: comparing datasets from different experiments.
  36. (2003). ROC analysis of ultrasound tissue characterization classifi ers for breast cancer diagnosis.
  37. (2002). Serum Protein Fingerprinting Coupled with a Pattern-matching Algorithm Distinguishes Prostate Cancer from Benign Prostate Hyperplasia and Healthy Men.
  38. (2000). Support vector machine classifi cation and validation of cancer tissue samples using microarray expression data.
  39. (1982). The jackknife, the bootstrap and other resampling plans.
  40. (1982). The Meaning and Use of the Area Under a Receiver Operating Characteristic Curve.
  41. (1995). The Nature of Statistical Learning Theory.
  42. (1908). The probable error of a mean.
  43. (1945). The treatment of ties in ranking problems.
  44. (2002). Use of proteomic patterns in serum to identify ovarian cancer. Lancet.

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.