2,032 research outputs found

    Assessing a feature's trustworthiness and two approaches to feature selection

    Full text link
    University of Technology, Sydney. Faculty of Engineering and Information Technology.Improvements in technology have led to a relentless deluge of information that current data mining approaches have trouble dealing with. An extreme example of this is a problem domain that is referred to as“non-classical”. Non-classical problems fail to fulfill the requirements of statistical theory: that the number of instances in the sample set be much greater than the number of dimensions. Non-classical problems are mainly characterized by many dimensions (or features) and few noise-affected samples. Microarray technology provides one source of non-classical problems, which typically produces data sets with a dimensionality exceeding ten thousand and containing just a few hundred instances. A risk with such a data set is building a model that is significantly influenced by coincidental correlations between the inputs (or the model’s features) and the output. A classical strategy for managing this risk is reducing the dimensionality without significantly affecting the correlation between the remaining features and the model’s output. However this strategy does not explicitly consider the impact of poor data quality (or noise) and having few data samples. In order to actively manage noise—a feature selection strategy is needed that not only considers the correlation between the features and the output, but also the quality of the features. It is proposed that feature quality, or simply the feature’s “trustworthiness”, should be incorporated within feature selection. As the trustworthiness of a feature increases, it is expected that the ability to accurately extract the underlying structure of the data will also increase. Another characteristic of non-classical problems is significant feature redundancy (where information provided within one dimension is also present in one or other dimensions). This research postulates that the use of feature trustworthiness and redundancy provides an opportunity to actively reduce the noise associated with the selected feature set, while still finding features that are well correlated with the model’s output. Two fundamental contributions are provided by this thesis: the notion of feature “trustworthiness” and how trustworthiness can be integrated within feature selection. Trustworthiness provides a flexible approach for evaluating the quality of a feature’s sample data and in certain cases, the quality of the test data. This flexibility encourages the use of prior knowledge about the specific problem and in particular, how the quality of the data is best estimated. Traditionally feature selection implicitly assumes that every instance of data, supplied by preprocessing, has the same quality. Trustworthiness also provides an opportunity for incorporating a measure of the changes applied to the data set as a result of data cleaning. Using an area of computational learning, a theoretical justification was constructed that showed the difficulty of building an accurate model for a non-classical problem. The justification showed how a modest data quality problem can result in insufficient sample data to permit successful learning. It also showed how selecting less noisy data, or sufficiently trustworthy features, can enable successful learning using the available data points. This thesis presents two methodologies that incorporate a measure of data quality within feature selection: one methodology only uses training data, while the other also incorporates test data while evaluating feature trustworthiness. The two methodologies are contrasted with each other and with a traditional feature selection methodology, which does not consider data quality. A number of data sets were used to test these methodologies, with the main data sets being: synthetic data, childhood leukaemia and chronic fatigue syndrome. In most cases the three feature selection methodologies achieved similar accuracy however there were clear differences in the features selected by each. Using heat maps to visualize the clarity of the separation of the class labels by the selected features—showed dramatic differences. The two methodologies that incorporate trustworthiness provided a clearer separation, while the traditional methodology was substantially inferior and appeared to be heavily influenced by artifacts. Using Gene Set Enrichment Analysis (GSEA), a widely used resource for evaluating the biological meaningfulness of gene sets (Subramanian, Tamayo, Mootha, Mukherjee, Ebert, Gillette, Paulovich, Pomeroy, Golub, Lander, and Mesirov, 2005), showed that the two proposed methodologies selected genes that were more biologically meaningful than those selected by a traditional feature selection methodology. The experiments also evaluated the sensitivity of trustworthiness to differences in the data set. By evaluating the trustworthiness of every feature, it was shown that considerable changes occurred across data folds. This result agrees with findings in the literature, such as (Ein-Dor, Kela, Getz, Givol, and Domany, 2005) and provides one explanation for the difficulty of modeling non-classical problems

    Evolutionary algorithms and weighting strategies for feature selection in predictive data mining

    Get PDF
    The improvements in Deoxyribonucleic Acid (DNA) microarray technology mean that thousands of genes can be profiled simultaneously in a quick and efficient manner. DNA microarrays are increasingly being used for prediction and early diagnosis in cancer treatment. Feature selection and classification play a pivotal role in this process. The correct identification of an informative subset of genes may directly lead to putative drug targets. These genes can also be used as an early diagnosis or predictive tool. However, the large number of features (many thousands) present in a typical dataset present a formidable barrier to feature selection efforts. Many approaches have been presented in literature for feature selection in such datasets. Most of them use classical statistical approaches (e.g. correlation). Classical statistical approaches, although fast, are incapable of detecting non-linear interactions between features of interest. By default, Evolutionary Algorithms (EAs) are capable of taking non-linear interactions into account. Therefore, EAs are very promising for feature selection in such datasets. It has been shown that dimensionality reduction increases the efficiency of feature selection in large and noisy datasets such as DNA microarray data. The two-phase Evolutionary Algorithm/k-Nearest Neighbours (EA/k-NN) algorithm is a promising approach that carries out initial dimensionality reduction as well as feature selection and classification. This thesis further investigates the two-phase EA/k-NN algorithm and also introduces an adaptive weights scheme for the k-Nearest Neighbours (k-NN) classifier. It also introduces a novel weighted centroid classification technique and a correlation guided mutation approach. Results show that the weighted centroid approach is capable of out-performing the EA/k-NN algorithm across five large biomedical datasets. It also identifies promising new areas of research that would complement the techniques introduced and investigated

    A statistical framework for testing functional categories in microarray data

    Get PDF
    Ready access to emerging databases of gene annotation and functional pathways has shifted assessments of differential expression in DNA microarray studies from single genes to groups of genes with shared biological function. This paper takes a critical look at existing methods for assessing the differential expression of a group of genes (functional category), and provides some suggestions for improved performance. We begin by presenting a general framework, in which the set of genes in a functional category is compared to the complementary set of genes on the array. The framework includes tests for overrepresentation of a category within a list of significant genes, and methods that consider continuous measures of differential expression. Existing tests are divided into two classes. Class 1 tests assume gene-specific measures of differential expression are independent, despite overwhelming evidence of positive correlation. Analytic and simulated results are presented that demonstrate Class 1 tests are strongly anti-conservative in practice. Class 2 tests account for gene correlation, typically through array permutation that by construction has proper Type I error control for the induced null. However, both Class 1 and Class 2 tests use a null hypothesis that all genes have the same degree of differential expression. We introduce a more sensible and general (Class 3) null under which the profile of differential expression is the same within the category and complement. Under this broader null, Class 2 tests are shown to be conservative. We propose standard bootstrap methods for testing against the Class 3 null and demonstrate they provide valid Type I error control and more power than array permutation in simulated datasets and real microarray experiments.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS146 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Computational models and approaches for lung cancer diagnosis

    Full text link
    The success of treatment of patients with cancer depends on establishing an accurate diagnosis. To this end, the aim of this study is to developed novel lung cancer diagnostic models. New algorithms are proposed to analyse the biological data and extract knowledge that assists in achieving accurate diagnosis results

    Identifying genomic signatures for predicting breast cancer outcomes

    Get PDF
    Predicting the risk for recurrence in breast cancer patients is a critical task in clinics. Recent developments in DNA microarrays have fostered tremendous advances in molecular diagnosis and prognosis of breast cancer.;The first part of our study was based on a novel approach of considering the level of genomic instability as one of the most powerful predictors of clinical outcome. A systematic technique was presented to explore whether there is a linkage between the degree of genomic instability, gene expression patterns, and clinical outcomes by considering the following hypotheses; first, the degree of genomic instability is reflected by an aneuploidy-specific gene signature; second, this signature is robust and allows breast cancer prediction of clinical outcomes. The first hypothesis was tested by gene expression profiling of 48 breast tumors with varying degrees of genomic instability. A supervised machine learning approach of employing a combination of feature selection algorithms was used to identify a 12-gene genomic instability signature from a set of 7657 genes. The second hypothesis was tested by performing patient stratification on published breast cancer datasets using the genomic instability signature. The results concluded that patients with genomically stable breast carcinomas had considerably longer disease-free survival times compared to those with genomically unstable tumors. The gene signature generated significant patient stratification with distinct relapse-free and overall survival (log-rank tests; p \u3c 0.05; n = 469). It was independent of clinical-pathological parameters and provided additional prognostic information within sub-groups defined by each of them.;The importance of selecting patients at high risk for recurrence for more aggressive therapy was realized in the second part of the study, considering the fact that breast cancer patients with advanced stages receive chemotherapy, but only half of them benefit from it. The FDA recently approved the first gene test for cancer; MammaPrint, for node-negative primary breast cancer. Oncotype DX is a commercially available gene test for tamoxifen-treated, node-negative, and estrogen receptor-positive breast cancer. These signatures are specific for early stage breast cancers. A population-based approach to the molecular prognosis of breast cancer is needed for more rational therapy for breast cancer patients. A 28-gene expression signature was identified in our previous study using a population-based approach. Using this signature, a patient-stratification scheme was developed by employing the nearest centroid classification algorithm. It generated a significant stratification with distinct relapse-free survival (log-rank tests; p \u3c 0.05; n = 1337) and overall survival (log-rank tests; p \u3c 0.05; n = 806), based on the transcriptional profiles that were produced on a diverse range of microarray platforms. This molecular classification scheme could enable physicians to make treatment decisions based on specific characteristics of patients and their tumor, rather than population statistics. It could further refine subgroups defined by traditional clinical-pathological parameters into prognostic risk groups. It was unclear, whether a common gene set could predict a poor outcome in breast and ovarian cancer, the most common malignancies in women. The 28-gene signature generated significant prognostic categorization in ovarian cancers (log-rank tests; p \u3c 0.0001; n = 124), thus, confirming the clinical applicability of the gene signature to predict breast and ovarian cancer recurrence

    Medical Image Data and Datasets in the Era of Machine Learning-Whitepaper from the 2016 C-MIMI Meeting Dataset Session.

    Get PDF
    At the first annual Conference on Machine Intelligence in Medical Imaging (C-MIMI), held in September 2016, a conference session on medical image data and datasets for machine learning identified multiple issues. The common theme from attendees was that everyone participating in medical image evaluation with machine learning is data starved. There is an urgent need to find better ways to collect, annotate, and reuse medical imaging data. Unique domain issues with medical image datasets require further study, development, and dissemination of best practices and standards, and a coordinated effort among medical imaging domain experts, medical imaging informaticists, government and industry data scientists, and interested commercial, academic, and government entities. High-level attributes of reusable medical image datasets suitable to train, test, validate, verify, and regulate ML products should be better described. NIH and other government agencies should promote and, where applicable, enforce, access to medical image datasets. We should improve communication among medical imaging domain experts, medical imaging informaticists, academic clinical and basic science researchers, government and industry data scientists, and interested commercial entities

    Permutation Tests for Classification

    Get PDF
    We introduce and explore an approach to estimating statistical significance of classification accuracy, which is particularly useful in scientific applications of machine learning where high dimensionality of the data and the small number of training examples render most standard convergence bounds too loose to yield a meaningful guarantee of the generalization ability of the classifier. Instead, we estimate statistical significance of the observed classification accuracy, or the likelihood of observing such accuracy by chance due to spurious correlations of the high-dimensional data patterns with the class labels in the given training set. We adopt permutation testing, a non-parametric technique previously developed in classical statistics for hypothesis testing in the generative setting (i.e., comparing two probability distributions). We demonstrate the method on real examples from neuroimaging studies and DNA microarray analysis and suggest a theoretical analysis of the procedure that relates the asymptotic behavior of the test to the existing convergence bounds

    Optimally splitting cases for training and testing high dimensional classifiers

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate?</p> <p>Results</p> <p>We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts.</p> <p>Conclusions</p> <p>By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller <it>n </it>resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (<it>n </it>≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.</p
    • …
    corecore