46 research outputs found

    SED, a normalization free method for DNA microarray data analysis

    Get PDF
    BACKGROUND: Analysis of DNA microarray data usually begins with a normalization step where intensities of different arrays are adjusted to the same scale so that the intensity levels from different arrays can be compared with one other. Both simple total array intensity-based as well as more complex "local intensity level" dependent normalization methods have been developed, some of which are widely used. Much less developed methods for microarray data analysis include those that bypass the normalization step and therefore yield results that are not confounded by potential normalization errors. RESULTS: Instead of focusing on the raw intensity levels, we developed a new method for microarray data analysis that maps each gene's expression intensity level to a high dimensional space of SEDs (Signs of Expression Difference), the signs of the expression intensity difference between a given gene and every other gene on the array. Since SED are unchanged under any monotonic transformation of intensity levels, the SED based method is normalization free. When tested on a multi-class tumor classification problem, simple Naive Bayes and Nearest Neighbor methods using the SED approach gave results comparable with normalized intensity-based algorithms. Furthermore, a high percentage of classifiers based on a single gene's SED gave good classification results, suggesting that SED does capture essential information from the intensity levels. CONCLUSION: The results of testing this new method on multi-class tumor classification problems suggests that the SED-based, normalization-free method of microarray data analysis is feasible and promising

    COMPARING VARIOUS MACHINE LEARNING METHODS FOR PREDICTION OF PATIENT REVISIT INTENTION: A CASE STUDY

    Get PDF
    Many techniques have been proposed for analysis of costumer intention, from surveys to statistical models. During the last few years, different machine learning approaches have successfully been applied to costumer-centric decision-making problems. In this study, we conduct a comparative assessment of the performance of ten widely used machine learning methods, (i.e., logistic regression, multilayer perceptron, support vector machines,  IBk linear NN search, KStar, locally weighted learning, decisionstump, C4.5., randomtree and  reduced error pruning tree) for the aim of suggesting appropriate machine learning techniques in the context of patient revisit intention prediction problem. Experimental results reveal that the C4.5 decision tree demonstrates to be the best predictive model since it has the highest overall average accuracy and a very low percentage error on both Type I and Type II errors, closely followed by the locally weighted learning and decisionstump, whereas the logistic regression and the IBk linear NN search algorithms appear to be the worst in terms of average accuracy and type II error. Besides the randomtree and the IBk linear NN search algorithms appear to be the worst in terms of type I error

    A jackknife-like method for classification and uncertainty assessment of multi-category tumor samples using gene expression information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The use of gene expression profiling for the classification of human cancer tumors has been widely investigated. Previous studies were successful in distinguishing several tumor types in binary problems. As there are over a hundred types of cancers, and potentially even more subtypes, it is essential to develop multi-category methodologies for molecular classification for any meaningful practical application.</p> <p>Results</p> <p>A jackknife-based supervised learning method called paired-samples test algorithm (PST), coupled with a binary classification model based on linear regression, was proposed and applied to two well known and challenging datasets consisting of 14 (GCM dataset) and 9 (NC160 dataset) tumor types. The results showed that the proposed method improved the prediction accuracy of the test samples for the GCM dataset, especially when t-statistic was used in the primary feature selection. For the NCI60 dataset, the application of PST improved prediction accuracy when the numbers of used genes were relatively small (100 or 200). These improvements made the binary classification method more robust to the gene selection mechanism and the size of genes to be used. The overall prediction accuracies were competitive in comparison to the most accurate results obtained by several previous studies on the same datasets and with other methods. Furthermore, the relative confidence R(T) provided a unique insight into the sources of the uncertainty shown in the statistical classification and the potential variants within the same tumor type.</p> <p>Conclusion</p> <p>We proposed a novel bagging method for the classification and uncertainty assessment of multi-category tumor samples using gene expression information. The strengths were demonstrated in the application to two bench datasets.</p

    Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data

    Get PDF
    BACKGROUND: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant). RESULTS: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized. CONCLUSIONS: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used

    Instance-based concept learning from multiclass DNA microarray data

    Get PDF
    BACKGROUND: Various statistical and machine learning methods have been successfully applied to the classification of DNA microarray data. Simple instance-based classifiers such as nearest neighbor (NN) approaches perform remarkably well in comparison to more complex models, and are currently experiencing a renaissance in the analysis of data sets from biology and biotechnology. While binary classification of microarray data has been extensively investigated, studies involving multiclass data are rare. The question remains open whether there exists a significant difference in performance between NN approaches and more complex multiclass methods. Comparative studies in this field commonly assess different models based on their classification accuracy only; however, this approach lacks the rigor needed to draw reliable conclusions and is inadequate for testing the null hypothesis of equal performance. Comparing novel classification models to existing approaches requires focusing on the significance of differences in performance. RESULTS: We investigated the performance of instance-based classifiers, including a NN classifier able to assign a degree of class membership to each sample. This model alleviates a major problem of conventional instance-based learners, namely the lack of confidence values for predictions. The model translates the distances to the nearest neighbors into 'confidence scores'; the higher the confidence score, the closer is the considered instance to a pre-defined class. We applied the models to three real gene expression data sets and compared them with state-of-the-art methods for classifying microarray data of multiple classes, assessing performance using a statistical significance test that took into account the data resampling strategy. Simple NN classifiers performed as well as, or significantly better than, their more intricate competitors. CONCLUSION: Given its highly intuitive underlying principles – simplicity, ease-of-use, and robustness – the k-NN classifier complemented by a suitable distance-weighting regime constitutes an excellent alternative to more complex models for multiclass microarray data sets. Instance-based classifiers using weighted distances are not limited to microarray data sets, but are likely to perform competitively in classifications of high-dimensional biological data sets such as those generated by high-throughput mass spectrometry

    Multiclass classification of microarray data samples with a reduced number of genes

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Multiclass classification of microarray data samples with a reduced number of genes is a rich and challenging problem in Bioinformatics research. The problem gets harder as the number of classes is increased. In addition, the performance of most classifiers is tightly linked to the effectiveness of mandatory gene selection methods. Critical to gene selection is the availability of estimates about the maximum number of genes that can be handled by any classification algorithm. Lack of such estimates may lead to either computationally demanding explorations of a search space with thousands of dimensions or classification models based on gene sets of unrestricted size. In the former case, unbiased but possibly overfitted classification models may arise. In the latter case, biased classification models unable to support statistically significant findings may be obtained.</p> <p>Results</p> <p>A novel bound on the maximum number of genes that can be handled by binary classifiers in binary mediated multiclass classification algorithms of microarray data samples is presented. The bound suggests that high-dimensional binary output domains might favor the existence of accurate and sparse binary mediated multiclass classifiers for microarray data samples.</p> <p>Conclusions</p> <p>A comprehensive experimental work shows that the bound is indeed useful to induce accurate and sparse multiclass classifiers for microarray data samples.</p

    Sample phenotype clusters in high-density oligonucleotide microarray data sets are revealed using Isomap, a nonlinear algorithm

    Get PDF
    BACKGROUND: Life processes are determined by the organism's genetic profile and multiple environmental variables. However the interaction between these factors is inherently non-linear [1]. Microarray data is one representation of the nonlinear interactions among genes and genes and environmental factors. Still most microarray studies use linear methods for the interpretation of nonlinear data. In this study, we apply Isomap, a nonlinear method of dimensionality reduction, to analyze three independent large Affymetrix high-density oligonucleotide microarray data sets. RESULTS: Isomap discovered low-dimensional structures embedded in the Affymetrix microarray data sets. These structures correspond to and help to interpret biological phenomena present in the data. This analysis provides examples of temporal, spatial, and functional processes revealed by the Isomap algorithm. In a spinal cord injury data set, Isomap discovers the three main modalities of the experiment – location and severity of the injury and the time elapsed after the injury. In a multiple tissue data set, Isomap discovers a low-dimensional structure that corresponds to anatomical locations of the source tissues. This model is capable of describing low- and high-resolution differences in the same model, such as kidney-vs.-brain and differences between the nuclei of the amygdala, respectively. In a high-throughput drug screening data set, Isomap discovers the monocytic and granulocytic differentiation of myeloid cells and maps several chemical compounds on the two-dimensional model. CONCLUSION: Visualization of Isomap models provides useful tools for exploratory analysis of microarray data sets. In most instances, Isomap models explain more of the variance present in the microarray data than PCA or MDS. Finally, Isomap is a promising new algorithm for class discovery and class prediction in high-density oligonucleotide data sets

    GEDI: a user-friendly toolbox for analysis of large-scale gene expression data

    Get PDF
    Abstract\ud \ud \ud \ud Background\ud \ud Several mathematical and statistical methods have been proposed in the last few years to analyze microarray data. Most of those methods involve complicated formulas, and software implementations that require advanced computer programming skills. Researchers from other areas may experience difficulties when they attempting to use those methods in their research. Here we present an user-friendly toolbox which allows large-scale gene expression analysis to be carried out by biomedical researchers with limited programming skills.\ud \ud \ud \ud Results\ud \ud Here, we introduce an user-friendly toolbox called GEDI (Gene Expression Data Interpreter), an extensible, open-source, and freely-available tool that we believe will be useful to a wide range of laboratories, and to researchers with no background in Mathematics and Computer Science, allowing them to analyze their own data by applying both classical and advanced approaches developed and recently published by Fujita et al.\ud \ud \ud \ud Conclusion\ud \ud GEDI is an integrated user-friendly viewer that combines the state of the art SVR, DVAR and SVAR algorithms, previously developed by us. It facilitates the application of SVR, DVAR and SVAR, further than the mathematical formulas present in the corresponding publications, and allows one to better understand the results by means of available visualizations. Both running the statistical methods and visualizing the results are carried out within the graphical user interface, rendering these algorithms accessible to the broad community of researchers in Molecular Biology.This research was supported by FAPESP, CAPES, CNPq, FINEP and PRP-USP.This research was supported by FAPESP, CAPES, CNPq, FINEP and PRPUSP

    Using Fuzzy Patterns for Gene Selection and Data Reduction on Microarray Data

    Get PDF
    The advent of DNA microarray technology has supplied a large volume of data to many fields like machine learning and data mining. Intelligent support is essential for managing and interpreting this great amount of information. One of the well-known constraints specifically related to microarray data is the large number of genes in comparison with the small number of available experiments. In this context, the ability of design methods capable of overcoming current limitations of state-of-the-art algorithms is crucial to the development of successful applications. In this paper we demonstrate how a supervised fuzzy pattern algorithm can be used to perform DNA microarray data reduction over real data. The benefits of our method can be employed to find biologically significant insights relating to meaningful genes in order to improve previous successful techniques. Experimental results on acute myeloid leukemia diagnosis show the effectiveness of the proposed approach
    corecore