Search CORE

404,737 research outputs found

Statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic study for complex diseases

Author: Kelemen Arpad
Liang Yulan
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 28/03/2008
Field of study

Recent advances of information technology in biomedical sciences and other applied areas have created numerous large diverse data sets with a high dimensional feature space, which provide us a tremendous amount of information and new opportunities for improving the quality of human life. Meanwhile, great challenges are also created driven by the continuous arrival of new data that requires researchers to convert these raw data into scientific knowledge in order to benefit from it. Association studies of complex diseases using SNP data have become more and more popular in biomedical research in recent years. In this paper, we present a review of recent statistical advances and challenges for analyzing correlated high dimensional SNP data in genomic association studies for complex diseases. The review includes both general feature reduction approaches for high dimensional correlated data and more specific approaches for SNPs data, which include unsupervised haplotype mapping, tag SNP selection, and supervised SNPs selection using statistical testing/scoring, statistical modeling and machine learning methods with an emphasis on how to identify interacting loci.Comment: Published in at http://dx.doi.org/10.1214/07-SS026 the Statistics Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Recommended from our members

Statistical Workflow for Feature Selection in Human Metabolomics Data.

Author: Antonelli Joseph
Cheng Susan
Claggett Brian L
Demler Olga V
Deng Katherine
Henglin Mir
Hushcha Pavel V
Jain Mohit
Kim Andy
Kim Nicole
Lagerborg Kim A
Mora Samia
Niiranen Teemu J
Ovsak Gavin
Pereira Alexandre C
Rao Kevin
Tyagi Octavia
Watrous Jeramie D
Publication venue: eScholarship, University of California
Publication date: 01/07/2019
Field of study

High-throughput metabolomics investigations, when conducted in large human cohorts, represent a potentially powerful tool for elucidating the biochemical diversity underlying human health and disease. Large-scale metabolomics data sources, generated using either targeted or nontargeted platforms, are becoming more common. Appropriate statistical analysis of these complex high-dimensional data will be critical for extracting meaningful results from such large-scale human metabolomics studies. Therefore, we consider the statistical analytical approaches that have been employed in prior human metabolomics studies. Based on the lessons learned and collective experience to date in the field, we offer a step-by-step framework for pursuing statistical analyses of cohort-based human metabolomics data, with a focus on feature selection. We discuss the range of options and approaches that may be employed at each stage of data management, analysis, and interpretation and offer guidance on the analytical decisions that need to be considered over the course of implementing a data analysis workflow. Certain pervasive analytical challenges facing the field warrant ongoing focused research. Addressing these challenges, particularly those related to analyzing human metabolomics data, will allow for more standardization of as well as advances in how research in the field is practiced. In turn, such major analytical advances will lead to substantial improvements in the overall contributions of human metabolomics investigations

eScholarship - University of California

Higher order feature extraction and selection for robust human gesture recognition using CSI of COTS Wi-Fi devices

Author: Ahmad Hafisoh
Ahmed Hasmath Farhana
Harkat Houda
Narasingamurthi Kulasekharan
Phang Swee King
Vaithilingam Chockalingam
Publication venue: 'MDPI AG'
Publication date: 04/07/2019
Field of study

Device-free human gesture recognition (HGR) using commercial o the shelf (COTS) Wi-Fi devices has gained attention with recent advances in wireless technology. HGR recognizes the human activity performed, by capturing the reflections ofWi-Fi signals from moving humans and storing them as raw channel state information (CSI) traces. Existing work on HGR applies noise reduction and transformation to pre-process the raw CSI traces. However, these methods fail to capture the non-Gaussian information in the raw CSI data due to its limitation to deal with linear signal representation alone. The proposed higher order statistics-based recognition (HOS-Re) model extracts higher order statistical (HOS) features from raw CSI traces and selects a robust feature subset for the recognition task. HOS-Re addresses the limitations in the existing methods, by extracting third order cumulant features that maximizes the recognition accuracy. Subsequently, feature selection methods derived from information theory construct a robust and highly informative feature subset, fed as input to the multilevel support vector machine (SVM) classifier in order to measure the performance. The proposed methodology is validated using a public database SignFi, consisting of 276 gestures with 8280 gesture instances, out of which 5520 are from the laboratory and 2760 from the home environment using a 10 5 cross-validation. HOS-Re achieved an average recognition accuracy of 97.84%, 98.26% and 96.34% for the lab, home and lab + home environment respectively. The average recognition accuracy for 150 sign gestures with 7500 instances, collected from five di erent users was 96.23% in the laboratory environment.Taylor's University through its TAYLOR'S PhD SCHOLARSHIP Programmeinfo:eu-repo/semantics/publishedVersio

Multidisciplinary Digital Publishing Institute

Sapientia

Interpretable Deep Learning Methods for Multiview Learning

Author: Lu Han
Safo Sandra E
Sun Ju
Wang Hengkang
Publication venue
Publication date: 15/02/2023
Field of study

Technological advances have enabled the generation of unique and complementary types of data or views (e.g. genomics, proteomics, metabolomics) and opened up a new era in multiview learning research with the potential to lead to new biomedical discoveries. We propose iDeepViewLearn (Interpretable Deep Learning Method for Multiview Learning) for learning nonlinear relationships in data from multiple views while achieving feature selection. iDeepViewLearn combines deep learning flexibility with the statistical benefits of data and knowledge-driven feature selection, giving interpretable results. Deep neural networks are used to learn view-independent low-dimensional embedding through an optimization problem that minimizes the difference between observed and reconstructed data, while imposing a regularization penalty on the reconstructed data. The normalized Laplacian of a graph is used to model bilateral relationships between variables in each view, therefore, encouraging selection of related variables. iDeepViewLearn is tested on simulated and two real-world data, including breast cancer-related gene expression and methylation data. iDeepViewLearn had competitive classification results and identified genes and CpG sites that differentiated between individuals who died from breast cancer and those who did not. The results of our real data application and simulations with small to moderate sample sizes suggest that iDeepViewLearn may be a useful method for small-sample-size problems compared to other deep learning methods for multiview learning

arXiv.org e-Print Archive

Comparison of Statistical Testing and Predictive Analysis Methods for Feature Selection in Zero-inflated Microbiome Data

Author: Wang Xiaoying
Publication venue: 'University of Saskatchewan Library'
Publication date: 09/04/2019
Field of study

Background: Recent advances in next-generation sequencing (NGS) technology enable researchers to collect a large volume of microbiome data. Microbiome data consist of operational taxonomic unit (OTU) count data characterized by zero-inflation, over-dispersion, and grouping structure among the sample. Currently, statistical testing methods based on generalized linear mixed effect models (GLMM) are commonly performed to identify OTUs that are associated with a phenotype such as human diseases or plant traits. There are a number of limitations for statistical testing methods including these two: (1) the validity of p-value/q-value depends sensitively on the correctness of models, and (2) the statistical significance does not necessarily imply predictivity. Statistic testing methods depend on model correctness and attempt to select ”marginally relevant” features, not the most predictive ones. Predictive analysis using methods such as LASSO is an alternative approach for feature selection. To the best of our knowledge, this approach has not been used widely for analyzing microbiome data. Methodology: We use four synthetic datasets simulated from zero-inflated negative binomial distribution and a real human gut microbiome data to compare the feature selection performance of LASSO with the likelihood ratio test methods applied to GLMMs. We also investigate the performance of cross-validation in estimating the out-of-sample predictivity of selected features in zero-inflated data. Results: Our studies with synthetic datasets show that the feature selection performance of LASSO is remarkably excellent in zero-inflated data and is comparable with the likelihood ratio test applied to the true data generating model. The feature selection performance of LASSO is better when the distributions of counts are more differentiated by the phenotype, which is a categorical variable in our synthetic datasets. In addition, we performed LOOCV on the train set and out-of-sample prediction on the test set. The performance of the cross-validatory (CV) predictive measures are very close to the out-of-sample predictivity measures. This indicates that LOOCV predictive metrics provide honest measures of the predictivity of the features selected by LASSO. Therefore, the CV predictive measures are good guidance for choosing cutoffs (shrinkage parameter

\lambda

) in selecting features with LASSO. By contrast, when wrong models are fitted to a dataset, the differences between the q-values and the actual false discovery rates are huge; hence, their q-values are tremendously misleading for selecting features. Our comparison of LASSO and statistical testing methods (likelihood ratio test in our analysis) in the real dataset shows that small q-values do not necessarily imply high predictivity of the selected OTUs. However, the researchers often use q-values to find the predictors. That is why we need to look at q-values carefully. Conclusions: Statistical testing methods perform greatly in zero-inflated datasets on both synthetic and real data. However, a serious model checking should be conducted before we use q-values to choose features. Predictive analysis with LASSO is recommended to supplement q-values for selecting features and for measuring the predictivity of selected features

eCommons@USASK

University of Saskatchewan Research Archive

Machine Learning Approaches for Biomarker Discovery Using Gene Expression Data

Author: Goksøyr Anders
Jonassen Inge
Zhang Xiaokang
Publication venue: 'Exon Publications'
Publication date: 01/01/2021
Field of study

Biomarkers are of great importance in many fields, such as cancer research, toxicology, diagnosis and treatment of diseases, and to better understand biological response mechanisms to internal or external intervention. High-throughput gene expression profiling technologies, such as DNA microarrays and RNA sequencing, provide large gene expression data sets which enable data-driven biomarker discovery. Traditional statistical tests have been the mainstream for identifying differentially expressed genes as biomarkers. In recent years, machine learning techniques such as feature selection have gained more popularity. Given many options, picking the most appropriate method for a particular data becomes essential. Different evaluation metrics have therefore been proposed. Being evaluated on different aspects, a method’s varied performance across different datasets leads to the idea of integrating multiple methods. Many integration strategies are proposed and have shown great potential. This chapter gives an overview of the current research advances and existing issues in biomarker discovery using machine learning approaches on gene expression data.publishedVersio

University of Bergen

NORA - Norwegian Open Research Archives

High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso

Author: Bach F.
Cortes C.
Cover T. M.
Eric P. Xing
Fukumizu K.
Leonid Sigal
Li F.
Liu H.
Makoto Yamada
Masaeli M.
Masashi Sugiyama
Nocedal J.
Raskutti G.
Rodriguez-Lujan I.
Schölkopf B.
Seeger M.
Song L.
Tibshirani R.
Tomioka R.
Wittawat Jitkrittum
Xing E. P.
Zhao Z.
Publication venue: 'MIT Press - Journals'
Publication date: 03/01/2019
Field of study

The goal of supervised feature selection is to find a subset of input features that are responsible for predicting output values. The least absolute shrinkage and selection operator (Lasso) allows computationally efficient feature selection based on linear dependency between input features and output values. In this paper, we consider a feature-wise kernelized Lasso for capturing non-linear input-output dependency. We first show that, with particular choices of kernel functions, non-redundant features with strong statistical dependence on output values can be found in terms of kernel-based independence measures. We then show that the globally optimal solution can be efficiently computed; this makes the approach scalable to high-dimensional problems. The effectiveness of the proposed method is demonstrated through feature selection experiments with thousands of features.Comment: 18 page

arXiv.org e-Print Archive

Crossref