87,280 research outputs found
Population structure at different minor allele frequency levels
Inferring population genetic structure from large-scale genotyping of single-nucleotide polymorphisms or variants is an important technique for studying the history and distribution of extant human populations, but it is also a very important tool for adjusting tests of association. However, the structures inferred depend on the minor allele frequency of the variants; this is very important when considering the phenotypic association of rare variants. Using the Genetic Analysis Workshop 18 data set for 142 unrelated individuals, which includes genotypes for many rare variants, we study the following hypothesis: the difference in detected structure is the result of a "scale" effect; that is, rare variants are likely to be shared only locally (smaller scale), while common variants can be spread over longer distances. The result is similar to that of using kernel principal component analysis, as the bandwidth of the kernel is changed. We show how different structures become evident as we consider rare or common variants
Multilevel Hierarchical Kernel Spectral Clustering for Real-Life Large Scale Complex Networks
Kernel spectral clustering corresponds to a weighted kernel principal
component analysis problem in a constrained optimization framework. The primal
formulation leads to an eigen-decomposition of a centered Laplacian matrix at
the dual level. The dual formulation allows to build a model on a
representative subgraph of the large scale network in the training phase and
the model parameters are estimated in the validation stage. The KSC model has a
powerful out-of-sample extension property which allows cluster affiliation for
the unseen nodes of the big data network. In this paper we exploit the
structure of the projections in the eigenspace during the validation stage to
automatically determine a set of increasing distance thresholds. We use these
distance thresholds in the test phase to obtain multiple levels of hierarchy
for the large scale network. The hierarchical structure in the network is
determined in a bottom-up fashion. We empirically showcase that real-world
networks have multilevel hierarchical organization which cannot be detected
efficiently by several state-of-the-art large scale hierarchical community
detection techniques like the Louvain, OSLOM and Infomap methods. We show a
major advantage our proposed approach i.e. the ability to locate good quality
clusters at both the coarser and finer levels of hierarchy using internal
cluster quality metrics on 7 real-life networks.Comment: PLOS ONE, Vol 9, Issue 6, June 201
Cleaning foregrounds from single-dish 21 cm intensity maps with Kernel principal component analysis
The high dynamic range between contaminating foreground emission and the fluctuating 21 cm brightness temperature field is one of the most problematic characteristics of 21 cm intensity mapping data. While these components would ordinarily have distinctive frequency spectra, making it relatively easy to separate them, instrumental effects and calibration errors further complicate matters by modulating and mixing them together. A popular class of foreground cleaning method are unsupervised techniques related to principal component analysis (PCA), which exploit the different shapes and amplitudes of each component's contribution to the covariance of the data in order to segregate the signals. These methods have been shown to be effective at removing foregrounds, while also unavoidably filtering out some of the 21 cm signal too. In this paper we examine, for the first time in the context of 21 cm intensity mapping, a generalized method called Kernel PCA, which instead operates on the covariance of non-linear transformations of the data. This allows more flexible functional bases to be constructed, in principle allowing a cleaner separation between foregrounds and the 21 cm signal to be found. We show that Kernel PCA is effective when applied to simulated single-dish (auto-correlation) 21 cm data under a variety of assumptions about foregrounds models, instrumental effects etc. It presents a different set of behaviours to PCA, e.g. in terms of sensitivity to the data resolution and smoothing scale, outperforming it on intermediate to large scales in most scenarios
Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods
Feature extraction and dimensionality reduction are important tasks in many
fields of science dealing with signal processing and analysis. The relevance of
these techniques is increasing as current sensory devices are developed with
ever higher resolution, and problems involving multimodal data sources become
more common. A plethora of feature extraction methods are available in the
literature collectively grouped under the field of Multivariate Analysis (MVA).
This paper provides a uniform treatment of several methods: Principal Component
Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis
(CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions
derived by means of the theory of reproducing kernel Hilbert spaces. We also
review their connections to other methods for classification and statistical
dependence estimation, and introduce some recent developments to deal with the
extreme cases of large-scale and low-sized problems. To illustrate the wide
applicability of these methods in both classification and regression problems,
we analyze their performance in a benchmark of publicly available data sets,
and pay special attention to specific real applications involving audio
processing for music genre prediction and hyperspectral satellite images for
Earth and climate monitoring
Revisiting Kernelized Locality-Sensitive Hashing for Improved Large-Scale Image Retrieval
We present a simple but powerful reinterpretation of kernelized
locality-sensitive hashing (KLSH), a general and popular method developed in
the vision community for performing approximate nearest-neighbor searches in an
arbitrary reproducing kernel Hilbert space (RKHS). Our new perspective is based
on viewing the steps of the KLSH algorithm in an appropriately projected space,
and has several key theoretical and practical benefits. First, it eliminates
the problematic conceptual difficulties that are present in the existing
motivation of KLSH. Second, it yields the first formal retrieval performance
bounds for KLSH. Third, our analysis reveals two techniques for boosting the
empirical performance of KLSH. We evaluate these extensions on several
large-scale benchmark image retrieval data sets, and show that our analysis
leads to improved recall performance of at least 12%, and sometimes much
higher, over the standard KLSH method.Comment: 15 page
- …