87,280 research outputs found

    Population structure at different minor allele frequency levels

    Get PDF
    Inferring population genetic structure from large-scale genotyping of single-nucleotide polymorphisms or variants is an important technique for studying the history and distribution of extant human populations, but it is also a very important tool for adjusting tests of association. However, the structures inferred depend on the minor allele frequency of the variants; this is very important when considering the phenotypic association of rare variants. Using the Genetic Analysis Workshop 18 data set for 142 unrelated individuals, which includes genotypes for many rare variants, we study the following hypothesis: the difference in detected structure is the result of a "scale" effect; that is, rare variants are likely to be shared only locally (smaller scale), while common variants can be spread over longer distances. The result is similar to that of using kernel principal component analysis, as the bandwidth of the kernel is changed. We show how different structures become evident as we consider rare or common variants

    Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks

    Full text link
    Kernel spectral clustering corresponds to a weighted kernel principal component analysis problem in a constrained optimization framework. The primal formulation leads to an eigen-decomposition of a centered Laplacian matrix at the dual level. The dual formulation allows to build a model on a representative subgraph of the large scale network in the training phase and the model parameters are estimated in the validation stage. The KSC model has a powerful out-of-sample extension property which allows cluster affiliation for the unseen nodes of the big data network. In this paper we exploit the structure of the projections in the eigenspace during the validation stage to automatically determine a set of increasing distance thresholds. We use these distance thresholds in the test phase to obtain multiple levels of hierarchy for the large scale network. The hierarchical structure in the network is determined in a bottom-up fashion. We empirically showcase that real-world networks have multilevel hierarchical organization which cannot be detected efficiently by several state-of-the-art large scale hierarchical community detection techniques like the Louvain, OSLOM and Infomap methods. We show a major advantage our proposed approach i.e. the ability to locate good quality clusters at both the coarser and finer levels of hierarchy using internal cluster quality metrics on 7 real-life networks.Comment: PLOS ONE, Vol 9, Issue 6, June 201

    Cleaning foregrounds from single-dish 21 cm intensity maps with Kernel principal component analysis

    Get PDF
    The high dynamic range between contaminating foreground emission and the fluctuating 21 cm brightness temperature field is one of the most problematic characteristics of 21 cm intensity mapping data. While these components would ordinarily have distinctive frequency spectra, making it relatively easy to separate them, instrumental effects and calibration errors further complicate matters by modulating and mixing them together. A popular class of foreground cleaning method are unsupervised techniques related to principal component analysis (PCA), which exploit the different shapes and amplitudes of each component's contribution to the covariance of the data in order to segregate the signals. These methods have been shown to be effective at removing foregrounds, while also unavoidably filtering out some of the 21 cm signal too. In this paper we examine, for the first time in the context of 21 cm intensity mapping, a generalized method called Kernel PCA, which instead operates on the covariance of non-linear transformations of the data. This allows more flexible functional bases to be constructed, in principle allowing a cleaner separation between foregrounds and the 21 cm signal to be found. We show that Kernel PCA is effective when applied to simulated single-dish (auto-correlation) 21 cm data under a variety of assumptions about foregrounds models, instrumental effects etc. It presents a different set of behaviours to PCA, e.g. in terms of sensitivity to the data resolution and smoothing scale, outperforming it on intermediate to large scales in most scenarios

    Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods

    Full text link
    Feature extraction and dimensionality reduction are important tasks in many fields of science dealing with signal processing and analysis. The relevance of these techniques is increasing as current sensory devices are developed with ever higher resolution, and problems involving multimodal data sources become more common. A plethora of feature extraction methods are available in the literature collectively grouped under the field of Multivariate Analysis (MVA). This paper provides a uniform treatment of several methods: Principal Component Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis (CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions derived by means of the theory of reproducing kernel Hilbert spaces. We also review their connections to other methods for classification and statistical dependence estimation, and introduce some recent developments to deal with the extreme cases of large-scale and low-sized problems. To illustrate the wide applicability of these methods in both classification and regression problems, we analyze their performance in a benchmark of publicly available data sets, and pay special attention to specific real applications involving audio processing for music genre prediction and hyperspectral satellite images for Earth and climate monitoring

    Revisiting Kernelized Locality-Sensitive Hashing for Improved Large-Scale Image Retrieval

    Full text link
    We present a simple but powerful reinterpretation of kernelized locality-sensitive hashing (KLSH), a general and popular method developed in the vision community for performing approximate nearest-neighbor searches in an arbitrary reproducing kernel Hilbert space (RKHS). Our new perspective is based on viewing the steps of the KLSH algorithm in an appropriately projected space, and has several key theoretical and practical benefits. First, it eliminates the problematic conceptual difficulties that are present in the existing motivation of KLSH. Second, it yields the first formal retrieval performance bounds for KLSH. Third, our analysis reveals two techniques for boosting the empirical performance of KLSH. We evaluate these extensions on several large-scale benchmark image retrieval data sets, and show that our analysis leads to improved recall performance of at least 12%, and sometimes much higher, over the standard KLSH method.Comment: 15 page
    corecore