2,803 research outputs found
Identifying Target Populations for Screening or Not Screening Using Logic Regression
Colorectal cancer remains a significant public health concern despite the fact that effective screening procedures exist and that the disease is treatable when detected at early stages. Numerous risk factors for colon cancer have been identified, but none are very predictive alone. We sought to determine whether there are certain combinations of risk factors that distinguish well between cases and controls, and that could be used to identify subjects at particularly high or low risk of the disease to target screening. Using data from the Seattle site of the Colorectal Cancer Family Registry (C-CFR), we fit logic regression models to combine risk factor information. Logic regression is a methodology that identifies subsets of the population, described by Boolean combinations of binary coded risk factors. This method is well suited to situations in which interactions between many variables result in differences in disease risk. Neither the logic regression models nor stepwise logistic regression models fit for comparison resulted in criteria that could be used to direct subjects to screening. However, we believe that our novel statistical approach could be useful in settings where risk factors do discriminate between cases and controls, and illustrate this with a simulated dataset
Statistical methods to evaluate disease outcome diagnostic accuracy of multiple biomarkers with application to HIV and TB research.
Doctor of Philosophy in Statistics. University of KwaZulu-Natal, Pietermaritzburg 2015.One challenge in clinical medicine is that of the correct diagnosis of disease. Medical researchers
invest considerable time and effort to improving accurate disease diagnosis and following from
this diagnostic tests are important components in modern medical practice. The receiver oper-
ating characteristic (ROC) is a statistical tool commonly used for describing the discriminatory
accuracy and performance of a diagnostic test. A popular summary index of discriminatory
accuracy is the area under ROC curve (AUC). In the medical research data, scientists are
simultaneously evaluating hundreds of biomarkers. A critical challenge is the combination
of biomarkers into models that give insight into disease. In infectious disease, biomarkers
are often evaluated as well as in the micro organism or virus causing infection, adding more
complexity to the analysis. In addition to providing an improved understanding of factors
associated with infection and disease development, combinations of relevant markers are important
to the diagnosis and treatment of disease. Taken together, this extends the role of, the
statistical analyst and presents many novel and major challenges. This thesis discusses some
of the various strategies and issues in using statistical data analysis to address the diagnosis
problem, of selecting and combining multiple markers to estimate the predictive accuracy of
test results. We also consider different methodologies to address missing data and to improve
the predictive accuracy in the presence of incomplete data.
The thesis is divided into five parts. The first part is an introduction to the theory behind
the methods that we used in this work. The second part places emphasis on the so called
classic ROC analysis, which is applied to cross sectional data. The main aim of this chapter
is to address the problem of how to select and combine multiple markers and evaluate
the appropriateness of certain techniques used in estimating the area under the ROC curve
(AUC). Logistic regression models offer a simple method for combining markers. We applied
resampling methods to adjust for over-fitting associated with model selection. We simulated
several multivariate models to evaluate the performance of the resampling approaches in this
setting. We applied these methods to data collected from a study of tuberculosis immune
reconstitution in
ammatory syndrome (TB-IRIS) in Cape Town, South Africa. Baseline levels
of five biomarkers were evaluated and we used this dataset to evaluate whether a combination
of these biomarkers could accurately discriminate between TB-IRIS and non TB-IRIS patients,
by applying AUC analysis and resampling methods.
The third part is concerned with a time dependent ROC analysis with event-time outcome
and comparative analysis of the techniques applied to incomplete covariates. Three different
methods are assessed and investigated, namely mean imputation, nearest neighbor hot deck
imputation and multivariate imputation by chain equations (MICE). These methods were used
together with bootstrap and cross-validation to estimate the time dependent AUC using a
non-parametric approach and a Cox model. We simulated several models to evaluate the
performance of the resampling approaches and imputation methods. We applied the above
methods to a real data set.
The fourth part is concerned with applying more advanced variable selection methods to predict
the survival of patients using time dependent ROC analysis. The least absolute shrinkage and
selection operator (LASSO) Cox model is applied to estimate the bootstrap cross-validated, 632
and 632+ bootstrap AUCs for TBM/HIV data set from KwaZulu-Natal in South Africa. We
also suggest the use of ridge-Cox regression to estimate the AUC and two level bootstrapping
to estimate the variances for AUC, in addition to evaluating these suggested methods.
The last part of the research is an application study using genetic HIV data from rural
KwaZulu-Natal to evaluate the sequence of ambiguities as a biomarker to predict recent infection
in HIV patients
USING THE EXPONENTIAL TILTING MODEL AND THE MONOTONIC DENSITY RATIO MODELTO FIND THE EMPIRICAL DISTRIBUTION FOR PARAMETERS
Master'sMASTER OF SCIENC
Mass spectrometry data mining for cancer detection
Early detection of cancer is crucial for successful intervention strategies. Mass spectrometry-based high throughput proteomics is recognized as a major breakthrough in cancer detection. Many machine learning methods have been used to construct classifiers based on mass spectrometry data for discriminating between cancer stages, yet, the classifiers so constructed generally lack biological interpretability. To better assist clinical uses, a key step is to discover ”biomarker signature profiles”, i.e. combinations of a small number of protein biomarkers strongly discriminating between cancer states.
This dissertation introduces two innovative algorithms to automatically search for a signature and to construct a high-performance signature-based classifier for cancer discrimination tasks based on mass spectrometry data, such as data acquired by MALDI or SELDI techniques. Our first algorithm assumes that homogeneous groups of mass spectra can be modeled by (unknown) Gibbs distributions to generate an optimal signature and an associated signature-based classifier by robust log-likelihood analysis; our second algorithm uses a stochastic optimization algorithm to search for two lists of biomarkers, and then constructs a signature-based classifier.
To support these two algorithms theoretically, this dissertation also studies the empirical probability distributions of mass spectrometry data and implements the actual fitting of Markov random fields to these high-dimensional distributions. We have validated our two signature discovery algorithms on several mass spectrometry datasets related to ovarian cancer and to colorectal cancer patients groups. For these cancer discrimination tasks, our algorithms have yielded better classification performances than existing machine learning algorithms and in addition,have generated more interpretable explicit signatures.Mathematics, Department o
- …