Search CORE

97,027 research outputs found

Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

Author: Marko Nicholas
Razzaghi Talayeh
Roderick Oleg
Safro Ilya
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 07/04/2016
Field of study

This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm

Author: Turney P. D.
Publication venue
Publication date: 01/01/1995
Field of study

This paper introduces ICET, a new algorithm for cost-sensitive classification. ICET uses a genetic algorithm to evolve a population of biases for a decision tree induction algorithm. The fitness function of the genetic algorithm is the average cost of classification when using the decision tree, including both the costs of tests (features, measurements) and the costs of classification errors. ICET is compared here with three other algorithms for cost-sensitive classification - EG2, CS-ID3, and IDX - and also with C4.5, which classifies without regard to cost. The five algorithms are evaluated empirically on five real-world medical datasets. Three sets of experiments are performed. The first set examines the baseline performance of the five algorithms on the five datasets and establishes that ICET performs significantly better than its competitors. The second set tests the robustness of ICET under a variety of conditions and shows that ICET maintains its advantage. The third set looks at ICET's search in bias space and discovers a way to improve the search.Comment: See http://www.jair.org/ for any accompanying file

arXiv.org e-Print Archive

CiteSeerX

NRC Publications Archive

CogPrints Cognitive Sciences Eprint Archive

Application of Multiple imputation in Analysis of missing data in a study of Health-related quality of life

Author: Zhu Chunming
Publication venue
Publication date: 29/06/2011
Field of study

When a new treatment has similar efficacy compared to standard therapy in medical or social studies, the health-related quality of life (HRQL) becomes the main concern of health care professionals and can be the basis for making a decision in patient management. National Surgical Adjuvant Breast and Bowel Protocol (NSABP) C-06 clinical trial compared two therapies: intravenous (IV) fluorouracil (FU) plus Leucovorin (LV) and oral uracil/ftorafur (UFT) plus LV, in treatment of colon cancer. However, there was a high proportion of missing values among the HRQL measurements that only 481 (59.8%) UFT patients and 421 (52.4%) FU patients submitted the forms at all time points. Ignoring the missing data issue often leads to inefficient and sometime biased estimates. The primary objective of this thesis is to evaluate the impact of missing data on the estimated the treatment effect. In this thesis, we analyzed the HRQL data with missing values by multiple imputation. Both model-based and nearest neighborhood hot-deck imputation methods were applied. Confidence intervals for the estimated treatment effect were generated based on the pooled imputation analysis. The results based on multiple imputation indicated that missing data did not introduce major bias in the earlier analyses. However, multiple imputation was worthwhile since the most estimation from the imputation datasets are more efficient than that from incomplete data. These findings have public health importance: they have implications for development of health policies and planning interventions to improve the health related quality of life for those patients with colon cancer

D-Scholarship@Pitt