649 research outputs found

    Inverse Projection Representation and Category Contribution Rate for Robust Tumor Recognition

    Full text link
    Sparse representation based classification (SRC) methods have achieved remarkable results. SRC, however, still suffer from requiring enough training samples, insufficient use of test samples and instability of representation. In this paper, a stable inverse projection representation based classification (IPRC) is presented to tackle these problems by effectively using test samples. An IPR is firstly proposed and its feasibility and stability are analyzed. A classification criterion named category contribution rate is constructed to match the IPR and complete classification. Moreover, a statistical measure is introduced to quantify the stability of representation-based classification methods. Based on the IPRC technique, a robust tumor recognition framework is presented by interpreting microarray gene expression data, where a two-stage hybrid gene selection method is introduced to select informative genes. Finally, the functional analysis of candidate's pathogenicity-related genes is given. Extensive experiments on six public tumor microarray gene expression datasets demonstrate the proposed technique is competitive with state-of-the-art methods.Comment: 14 pages, 19 figures, 10 table

    A Review on Dimension Reduction Techniques in Data Mining

    Get PDF
    Real world data is high-dimensional like images, speech signals containing multiple dimensions to represent data. Higher dimensional data are more complex for detecting and exploiting the relationships among terms. Dimensionality reduction is a technique used for reducing complexity for analyzing high dimensional data. There are many methodologies that are being used to find the Critical Dimensions for a dataset that significantly reduces the number of dimensions. They reduce the dimensions from the original input data. Dimensionality reduction methods can be of two types as feature extractions and feature selection techniques. Feature Extraction is a distinct form of Dimensionality Reduction to extract some important feature from input dataset. Two different approaches available for dimensionality reduction are supervised approach and unsupervised approach. One exclusive purpose of this survey is to provide an adequate comprehension of the different dimensionality reduction techniques that exist currently and also to introduce the applicability of any one of the prescribed methods that depends upon the given set of parameters and varying conditions. This paper surveys the schemes that are majorly used for Dimensionality Reduction mainly high dimension datasets. A comparative analysis of surveyed methodologies is also done, based on which, best methodology for a certain type of dataset can be chosen. Keywords: Data Mining, Dimensionality Reduction, Clustering, feature selection; curse of dimensionality; critical dimensio

    Integrated smoothed location model and data reduction approaches for multi variables classification

    Get PDF
    Smoothed Location Model is a classification rule that deals with mixture of continuous variables and binary variables simultaneously. This rule discriminates groups in a parametric form using conditional distribution of the continuous variables given each pattern of the binary variables. To conduct a practical classification analysis, the objects must first be sorted into the cells of a multinomial table generated from the binary variables. Then, the parameters in each cell will be estimated using the sorted objects. However, in many situations, the estimated parameters are poor if the number of binary is large relative to the size of sample. Large binary variables will create too many multinomial cells which are empty, leading to high sparsity problem and finally give exceedingly poor performance for the constructed rule. In the worst case scenario, the rule cannot be constructed. To overcome such shortcomings, this study proposes new strategies to extract adequate variables that contribute to optimum performance of the rule. Combinations of two extraction techniques are introduced, namely 2PCA and PCA+MCA with new cutpoints of eigenvalue and total variance explained, to determine adequate extracted variables which lead to minimum misclassification rate. The outcomes from these extraction techniques are used to construct the smoothed location models, which then produce two new approaches of classification called 2PCALM and 2DLM. Numerical evidence from simulation studies demonstrates that the computed misclassification rate indicates no significant difference between the extraction techniques in normal and non-normal data. Nevertheless, both proposed approaches are slightly affected for non-normal data and severely affected for highly overlapping groups. Investigations on some real data sets show that the two approaches are competitive with, and better than other existing classification methods. The overall findings reveal that both proposed approaches can be considered as improvement to the location model, and alternatives to other classification methods particularly in handling mixed variables with large binary size

    A multiple-filter-multiple-wrapper approach to gene selection and microarray data classification

    Get PDF
    Filters and wrappers are two prevailing approaches for gene selection in microarray data analysis. Filters make use of statistical properties of each gene to represent its discriminating power between different classes. The computation is fast but the predictions are inaccurate. Wrappers make use of a chosen classifier to select genes by maximizing classification accuracy, but the computation burden is formidable. Filters and wrappers have been combined in previous studies to maximize the classification accuracy for a chosen classifier with respect to a filtered set of genes. The drawback of this single-filter-single-wrapper (SFSW) approach is that the classification accuracy is dependent on the choice of specific filter and wrapper. In this paper, a multiple-filter-multiple-wrapper (MFMW) approach is proposed that makes use of multiple filters and multiple wrappers to improve the accuracy and robustness of the classification, and to identify potential biomarker genes. Experiments based on six benchmark data sets show that the MFMW approach outperforms SFSW models (generated by all combinations of filters and wrappers used in the corresponding MFMW model) in all cases and for all six data sets. Some of MFMW-selected genes have been confirmed to be biomarkers or contribute to the development of particular cancers by other studies. © 2006 IEEE.published_or_final_versio

    Bioinformatics framework for genotyping microarray data analysis

    Get PDF
    Functional genomics is a flourishing science enabled by recent technological breakthroughs in high-throughput instrumentation and microarray data analysis. Genotyping microarrays establish the genotypes of DNA sequences containing single nucleotide polymorphisms (SNPs), and can help biologists probe the functions of different genes and/or construct complex gene interaction networks. The enormous amount of data from these experiments makes it infeasible to perform manual processing to obtain accurate and reliable results in daily routines. Advanced algorithms as well as an integrated software toolkit are needed to help perform reliable and fast data analysis. The author developed a MatlabTM based software package, called TIMDA (a Toolkit for Integrated Genotyping Microarray Data Analysis), for fully automatic, accurate and reliable genotyping microarray data analysis. The author also developed new algorithms for image processing and genotype-calling. The modular design of TIMDA allows satisfactory extensibility and maintainability. TIMDA is open source (URL: http://timda.SF.net and can be easily customized by users to meet their particular needs. The quality and reproducibility of results in image processing and genotype-calling and the ease of customization indicate that TIMDA is a useful package for genomics research

    Effective Prostate Cancer Detection using Enhanced Particle Swarm Optimization Algorithm with Random Forest on the Microarray Data

    Get PDF
    Prostate Cancer (PC) is the leading cause of mortality among males, therefore an effective system is required for identifying the sensitive bio-markers for early recognition. The objective of the research is to find the potential bio-markers for characterizing the dissimilar types of PC. In this article, the PC-related genes are acquired from the Gene Expression Omnibus (GEO) database. Then, gene selection is accomplished using enhanced Particle Swarm Optimization (PSO) to select the active genes, which are related to the PC. In the enhanced PSO algorithm, the interval-newton approach is included to keep the search space adaptive by varying the swarm diversity that helps to perform the local search significantly. The selected active genes are fed to the random forest classifier for the classification of PC (high and low-risk). As seen in the experimental investigation, the proposed model achieved an overall classification accuracy of 96.71%, which is better compared to the traditional models like naïve Bayes, support vector machine and neural network

    Comparing Prediction Accuracy for Supervised Techniques in Gene Expression Data

    Get PDF
    Classification is one of the most important tasks for different application such as text categorization, tone recognition, image classification, micro-array gene expression, proteins structure predictions, data classification etc. Microarray based gene expression profiling has been emerged as an efficient technique for cancer classification, as well as for diagnosis, prognosis, and treatment purposes. The classification of different tumor types is of great significance in cancer diagnosis and drug innovation. One challenging area in the studies of gene expression data is the classification of different types of tumors into correct classes. Diagonal discriminant analysis, regularized discriminant analysis, support vector machines and k-nearest neighbor have been suggested as among the best methods for small sample size situations. The methods are applied to datasets from four recently published cancer gene expression studies. Four publicly available microarray data sets are Leukemia, Lymphoma, SRBCT & Prostate. The performance of the classification technique has been evaluated according to the percentage of misclassification through hold-out cross validation

    Comparing Prediction Accuracy for Machine Learning and Other Classical Approaches in Gene Expression Data

    Get PDF
    Microarray based gene expression profiling has been emerged as an efficient technique for cancer classification, as well as for diagnosis, prognosis, and treatment purposes. The classification of different tumor types is of great significance in cancer diagnosis and drug innovation. Using a large number of genes to classify samples based on a small number of microarrays remains a difficult problem. Feature selection techniques can be used to extract the marker genes which influence the classification accuracy effectively by eliminating the unwanted noisy and redundant genes. Quite a number of methods have been proposed in recent years with promising results. But there are still a lot of issues which need to be addressed and understood. Diagonal discriminant analysis, regularized discriminant analysis, support vector machines and k-nearest neighbor have been suggested as among the best methods for small sample size situations. In this paper, we have compared the performance of different discrimination methods for the classification of tumors based on gene expression data. The methods are applied to datasets from four recently published cancer gene expression studies. The performance of the classification technique has been evaluated for varying number of selected features in terms of misclassification rate  using hold-out cross validation. Our study shows that KNN, RDA and SVM with linear kernel methods have lower misclassification rate than the other algorithms. Keywords: microarray, gene expression, KNN, DLDA, RDA, SV

    Blood transcriptomics of drug-na\uefve sporadic Parkinson's disease patients

    Get PDF
    BACKGROUND: Parkinson's disease (PD) is a chronic progressive neurodegenerative disorder that is clinically defined in terms of motor symptoms. These are preceded by prodromal non-motor manifestations that prove the systemic nature of the disease. Identifying genes and pathways altered in living patients provide new information on the diagnosis and pathogenesis of sporadic PD. METHODS: Changes in gene expression in the blood of 40 sporadic PD patients and 20 healthy controls ("Discovery set") were analyzed by taking advantage of the Affymetrix platform. Patients were at the onset of motor symptoms and before initiating any pharmacological treatment. Data analysis was performed by applying Ranking-Principal Component Analysis, PUMA and Significance Analysis of Microarrays. Functional annotations were assigned using GO, DAVID, GSEA to unveil significant enriched biological processes in the differentially expressed genes. The expressions of selected genes were validated using RT-qPCR and samples from an independent cohort of 12 patients and controls ("Validation set"). RESULTS: Gene expression profiling of blood samples discriminates PD patients from healthy controls and identifies differentially expressed genes in blood. The majority of these are also present in dopaminergic neurons of the Substantia Nigra, the key site of neurodegeneration. Together with neuronal apoptosis, lymphocyte activation and mitochondrial dysfunction, already found in previous analysis of PD blood and post-mortem brains, we unveiled transcriptome changes enriched in biological terms related to epigenetic modifications including chromatin remodeling and methylation. Candidate transcripts as CBX5, TCF3, MAN1C1 and DOCK10 were validated by RT-qPCR. CONCLUSIONS: Our data support the use of blood transcriptomics to study neurodegenerative diseases. It identifies changes in crucial components of chromatin remodeling and methylation machineries as early events in sporadic PD suggesting epigenetics as target for therapeutic intervention
    corecore