4 research outputs found

    Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset

    Get PDF
    Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%)

    Statistical and computational methods for addressing heterogeneity in genomic data

    Get PDF
    Heterogeneity describes any variability across different datasets. In genomic studies which profile gene expression levels, the presence of heterogeneity is ubiquitous, and may bring challenges to the integrative analysis of multiple datasets. Thus, many efforts are needed to understand and address the impact of heterogeneity. In this dissertation, I have developed novel statistical models and computational software for this purpose. I derived reference-batch ComBat and ComBat-Seq, two improved models based on the state-of-the-art method, ComBat, for addressing one particular type of heterogeneity known as the “batch effects”. I showed their benefits compared to the existing methods in several data types and situations, and implemented these models in publicly available software. Then, I created systematic simulations to explore the impact of common study heterogeneity on the independent validation of genomic prediction models, showing that the most identifiable sources of heterogeneity are not the primary ones affecting the validation of genomic predictors. Finally, I adapted a solution using cross-study ensemble learning to train predictors with generalizable independent performance, to address the unwanted impact of batch effects on prediction. I compared this new framework with the traditional approach for batch correction, showing that cross-study learning may provide a more robust-performing model in independent validation. Results in this dissertation provide insights and guidelines for working with heterogeneous gene expression profiling datasets in practice, and encourage further investigation on understanding and addressing heterogeneity in genomic studie

    Evaluating the utility of gene expression data from patient-matched samples for studying breast cancer

    Get PDF
    Breast cancer is a heterogeneous disease with distinct subtypes and many different clinical presentations. Neoadjuvant therapy of breast cancer offers a window of opportunity to study translational changes in tumours as a result of treatment alone and may help to identify tumour response status. Pairs of samples collected from different sites or sequentially from the same individual can potentially provide additional prognostic information for the risk stratification of breast cancer. Here, we seek to aggregate multiple studies of valuable, multi-sampled, patient-matched cohorts for meta-analysis to check for an enhanced ability to make new and significant findings about the underlying mechanisms of tumour treatment response. Multiple sequentially-matched datasets of pre- and on-treatment matched primary tumour and lymph node samples were collected and examined for differentially expressed genes and pathways indicative of pathological response. Machine learning methods were applied to identify biomarkers of response from the on-treatment samples, and profiling comparisons were made to assess the additional value of matched patient samples to accurately predict risk. Lastly, five sequentially sampled datasets were aggregated for meta-analysis by combining the normalised pre- to on-treatment expression level differences to identify commonalities in the response to therapy across both endocrine and chemotherapy treatment strategies. The gene, AAGAB, was identified through iterative differential analysis, and was found to be 78% accurate in validation for the prediction of pathological complete response in neoadjuvant chemotherapy treated breast cancer. AAGAB demonstrated significant separation of patient survival curves (log rank p = 0.0036), and the on-treatment samples more accurately reflected the patient risk than the pretreatment samples. Matched lymph node tissue of primary breast cancer was more successful at capturing the patient’s risk of recurrence than the primary biopsy, correctly identifying 83% (10/12) of the recurring patients compared to 25% (3/12) in the primary. Underlying differential expression analysis also showed a considerable number of high profile breast cancer genes over-represented in the lymph node. Aggregation of multiple sequential studies resulted in low post integration concordance values with the reference patient data (<30% profiling agreement), and is not recommended for this type of analysis. However, combining the pairwise change values for gene expression level data was successful, and resulted in the creation of highly accurate models for predicting patient response (F1 accuracy score, 0.92) as well as the identification of potential common escape pathways to breast cancer therapies. Analysis of the matched pre- and on-treatment samples revealed the intrinsic value of multiple on-treatment biopsies. These samples offer valuable new targets for biomarker identification that show significant increases in accuracy for the prediction of response and long term outcome in neoadjuvant chemotherapy. Additional sampling of involved metastatic lymph node also improves the prognostic capabilities for clinicians by providing a potentially more accurate view of the per-patient risk profile. Lastly, the pairwise expression change values show the direction of tumour change, which can be used to create new models for the prediction and classification of patient risk and for furthering our understanding of the mechanisms behind patient non-response
    corecore