404,737 research outputs found

    Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

    Full text link
    Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Higher order feature extraction and selection for robust human gesture recognition using CSI of COTS Wi-Fi devices

    Get PDF
    Device-free human gesture recognition (HGR) using commercial o the shelf (COTS) Wi-Fi devices has gained attention with recent advances in wireless technology. HGR recognizes the human activity performed, by capturing the reflections ofWi-Fi signals from moving humans and storing them as raw channel state information (CSI) traces. Existing work on HGR applies noise reduction and transformation to pre-process the raw CSI traces. However, these methods fail to capture the non-Gaussian information in the raw CSI data due to its limitation to deal with linear signal representation alone. The proposed higher order statistics-based recognition (HOS-Re) model extracts higher order statistical (HOS) features from raw CSI traces and selects a robust feature subset for the recognition task. HOS-Re addresses the limitations in the existing methods, by extracting third order cumulant features that maximizes the recognition accuracy. Subsequently, feature selection methods derived from information theory construct a robust and highly informative feature subset, fed as input to the multilevel support vector machine (SVM) classifier in order to measure the performance. The proposed methodology is validated using a public database SignFi, consisting of 276 gestures with 8280 gesture instances, out of which 5520 are from the laboratory and 2760 from the home environment using a 10 5 cross-validation. HOS-Re achieved an average recognition accuracy of 97.84%, 98.26% and 96.34% for the lab, home and lab + home environment respectively. The average recognition accuracy for 150 sign gestures with 7500 instances, collected from five di erent users was 96.23% in the laboratory environment.Taylor's University through its TAYLOR'S PhD SCHOLARSHIP Programmeinfo:eu-repo/semantics/publishedVersio

    Interpretable Deep Learning Methods for Multiview Learning

    Full text link
    Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning

    Comparison of Statistical Testing and Predictive Analysis Methods for Feature Selection in Zero-inflated Microbiome Data

    Get PDF
    Background: Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of microbiome data. Microbiome data consist of operational taxonomic unit (OTU) count data characterized by zero-inflation, over-dispersion, and grouping structure among the sample. Currently, statistical testing methods based on generalized linear mixed effect models (GLMM) are commonly performed to identify OTUs that are associated with a phenotype such as human diseases or plant traits. There are a number of limitations for statistical testing methods including these two: (1) the validity of p-value/q-value depends sensitively on the correctness of models, and (2) the statistical significance does not necessarily imply predictivity. Statistic testing methods depend on model correctness and attempt to select ”marginally relevant” features, not the most predictive ones. Predictive analysis using methods such as LASSO is an alternative approach for feature selection. To the best of our knowledge, this approach has not been used widely for analyzing microbiome data. Methodology: We use four synthetic datasets simulated from zero-inflated negative binomial distribution and a real human gut microbiome data to compare the feature selection performance of LASSO with the likelihood ratio test methods applied to GLMMs. We also investigate the performance of cross-validation in estimating the out-of-sample predictivity of selected features in zero-inflated data. Results: Our studies with synthetic datasets show that the feature selection performance of LASSO is remarkably excellent in zero-inflated data and is comparable with the likelihood ratio test applied to the true data generating model. The feature selection performance of LASSO is better when the distributions of counts are more differentiated by the phenotype, which is a categorical variable in our synthetic datasets. In addition, we performed LOOCV on the train set and out-of-sample prediction on the test set. The performance of the cross-validatory (CV) predictive measures are very close to the out-of-sample predictivity measures. This indicates that LOOCV predictive metrics provide honest measures of the predictivity of the features selected by LASSO. Therefore, the CV predictive measures are good guidance for choosing cutoffs (shrinkage parameter λ\lambda) in selecting features with LASSO. By contrast, when wrong models are fitted to a dataset, the differences between the q-values and the actual false discovery rates are huge; hence, their q-values are tremendously misleading for selecting features. Our comparison of LASSO and statistical testing methods (likelihood ratio test in our analysis) in the real dataset shows that small q-values do not necessarily imply high predictivity of the selected OTUs. However, the researchers often use q-values to find the predictors. That is why we need to look at q-values carefully. Conclusions: Statistical testing methods perform greatly in zero-inflated datasets on both synthetic and real data. However, a serious model checking should be conducted before we use q-values to choose features. Predictive analysis with LASSO is recommended to supplement q-values for selecting features and for measuring the predictivity of selected features

    Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data

    Get PDF
    Biomarkers are of great importance in many fields, such as cancer research, toxicology, diagnosis and treatment of diseases, and to better understand biological response mechanisms to internal or external intervention. High-throughput gene expression profiling technologies, such as DNA microarrays and RNA sequencing, provide large gene expression data sets which enable data-driven biomarker discovery. Traditional statistical tests have been the mainstream for identifying differentially expressed genes as biomarkers. In recent years, machine learning techniques such as feature selection have gained more popularity. Given many options, picking the most appropriate method for a particular data becomes essential. Different evaluation metrics have therefore been proposed. Being evaluated on different aspects, a method’s varied performance across different datasets leads to the idea of integrating multiple methods. Many integration strategies are proposed and have shown great potential. This chapter gives an overview of the current research advances and existing issues in biomarker discovery using machine learning approaches on gene expression data.publishedVersio

    High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso

    Full text link
    The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this paper, we consider a feature-wise kernelized Lasso for capturing non-linear input-output dependency. We first show that, with particular choices of kernel functions, non-redundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments with thousands of features.Comment: 18 page
    • …
    corecore