79 research outputs found

    The computational hardness of feature selection in strict-pure synthetic genetic datasets

    Get PDF
    A common task in knowledge discovery is finding a few features correlated with an outcome in a sea of mostly irrelevant data. This task is particularly formidable in genetic datasets containing thousands to millions of Single Nucleotide Polymorphisms (SNPs) for each individual; the goal here is to find a small subset of SNPs correlated with whether an individual is sick or healthy(labeled data). Although determining a correlation between any given SNP (genotype) and a disease label (phenotype) is relatively straightforward, detecting subsets of SNPs such that the correlation is only apparent when the whole subset is considered seems to be much harder. In this thesis, we study the computational hardness of this problem, in particular for a widely used method of generating synthetic SNP datasets. More specifically, we consider the feature selection problem in datasets generated by ”pure and strict” models, such as ones produced by the popular GAMETES software. In these datasets, there is a high correlation between a predefined target set of features (SNPs) and a label; however, any subset of the target set appears uncorrelated with the outcome. Our main result is a (linear-time, parameter-preserving) reduction from the well-known Learning Parity with Noise (LPN) problem to feature selection in such pure and strict datasets. This gives us a host of consequences for the complexity of feature selection in this setting. First, not only it is NP-hard (to even approximate), it is computationally hard on average under a standard cryptographic assumption on hardness on learning parity with noise; moreover, in general it is as hard for the uniform distribution as for arbitrary distributions, and as hard for random noise as for adversarial noise. For the worst case complexity, we get a tighter parameterized lower bound: even in the non-noisy case, finding a parity of Hamming weight at most k is W[1]-hard when the number of samples is relatively small (logarithmic in the number of features). Finally, most relevant to the development of feature selection heuristics, by the unconditional hardness of LPN in Kearns’ statistical query model, no heuristic that only computes statistics about the samples rather than considering samples themselves, can successfully perform feature selection in such pure and strict datasets. This eliminates a large class of common approaches to feature selection

    Identifying Candidate Genetic Associations with MRI-Derived AD-Related ROI via Tree-Guided Sparse Learning

    Get PDF
    Imaging genetics has attracted significant interests in recent studies. Traditional work has focused on mass-univariate statistical approaches that identify important single nucleotide polymorphisms (SNPs) associated with quantitative traits (QTs) of brain structure or function. More recently, to address the problem of multiple comparison and weak detection, multivariate analysis methods such as the least absolute shrinkage and selection operator (Lasso) are often used to select the most relevant SNPs associated with QTs. However, one problem of Lasso, as well as many other feature selection methods for imaging genetics, is that some useful prior information, e.g., the hierarchical structure among SNPs, are rarely used for designing a more powerful model. In this paper, we propose to identify the associations between candidate genetic features (i.e., SNPs) and magnetic resonance imaging (MRI)-derived measures using a tree-guided sparse learning (TGSL) method. The advantage of our method is that it explicitly models the complex hierarchical structure among the SNPs in the objective function for feature selection. Specifically, motivated by the biological knowledge, the hierarchical structures involving gene groups and linkage disequilibrium (LD) blocks as well as individual SNPs are imposed as a tree-guided regularization term in our TGSL model. Experimental studies on simulation data and the Alzheimer's Disease Neuroimaging Initiative (ADNI) data show that our method not only achieves better predictions than competing methods on the MRI-derived measures of AD-related region of interests (ROIs) (i.e., hippocampus, parahippocampal gyrus, and precuneus), but also identifies sparse SNP patterns at the block level to better guide the biological interpretation

    Probing the genomic landscape of human sexuality: a critical systematic review of the literature

    Get PDF
    Whether human sexuality is the result of nature or nurture (or their complex interplay) represents a hot, often ideologically driven, and highly polarized debate with political and social ramifications, and with varying, conflicting findings reported in the literature. A number of heritability and behavioral genetics studies, including pedigree-based investigations, have hypothesized inheritance patterns of human sexual behaviors. On the other hand, in most twin, adoption, and nuclear family studies, it was not possible to disentangle between underlying genetic and shared environmental sources. Furthermore, these studies were not able to estimate the precise extent of genetic loading and to shed light both on the number and nature of the putative inherited factors, which remained largely unknown. Molecular genetic studies offer an unprecedented opportunity to overcome these drawbacks, by dissecting the molecular basis of human sexuality and allowing a better understanding of its biological roots if any. However, there exists no systematic review of the molecular genetics of human sexuality. Therefore, we undertook this critical systematic review and appraisal of the literature, with the ambitious aims of filling in these gaps of knowledge, especially from the methodological standpoint, and providing guidance to future studies. Sixteen studies were finally retained and overviewed in the present systematic review study. Seven studies were linkage studies, four studies utilized the candidate gene approach, and five studies were GWAS investigations. Limitations of these studies and implications for further research are discussed

    Service-oriented discovery of knowledge : foundations, implementations and applications

    Get PDF
    In this thesis we will investigate how a popular new way of distributed computing called service orientation can be used within the field of Knowledge Discovery. We critically investigate its principles and present models for developing withing this paradigm. We then apply this model to create a web service caled Fantom, that mines subgroups in a ranked list of identifiers, based on their score. The descriptions of these subgroups are done in ontologies to provide the scientist a description in a standardized and familiar language. Finally, Fantom is tested on two different data sets from the field of life-sciences; one concerning gene data, the other concerning SNP data.LEI Universiteit LeidenAlgorithm
    • …
    corecore