Search CORE

56,429 research outputs found

On Sequence Clustering and Supervised Dimensionality Reduction

Author: Wang Tiexing
Publication venue: SURFACE at Syracuse University
Publication date: 26/06/2020
Field of study

This dissertation studies two machine learning problems: 1) clustering of independent and identically generated random sequences, and 2) dimensionality reduction for classification problems. For sequence clustering, the focus is on large sample performance of classical clustering algorithms, including the k-medoids algorithm and hierarchical agglomerative clustering (HAC) algorithms. Data sequences are generated from unknown continuous distributions that are assumed to form clusters according to some well-defined distance metrics. The goal is to group data sequences according to their underlying distributions with little or no prior knowledge of both the underlying distributions as well as the number of clusters. Upper bounds on the clustering error probability are derived for the k-medoids algorithm and a class of HAC algorithms under mild assumptions on the distribution clusters and distance metrics. For both cases, the error probabilities are shown to decay exponentially fast as the number of samples in each data sequence goes to infinity. The obtained error exponent bound has a simple form when either the Kolmogrov-Smirnov distance or the maximum mean discrepancy is used as the distance metric. Tighter upper bound on the error probability of the single-linkage HAC algorithm is derived by taking advantage of the simplified metric updating scheme. Numerical results are provided to validate the analysis. For dimensionality reduction, the focus is on classification problem where label information in the training data can be leveraged for improved learning performance. A supervised dimensionality reduction method maximizing the difference of average projection energy of samples with different labels is proposed. Both synthetic data and WiFi sensing data are used to validate the effectiveness of the proposed method. The numerical results show that the proposed method outperforms existing supervised dimensionality reduction approaches based on Fisher discriminant analysis (FDA) and Hilbert-Schmidt independent criterion (HSIC). When kernel trick is applied to all three approaches, the performance of the proposed dimensionality reduction method is comparable to FDA and HSIC and is superior over unsupervised principal component analysis

Syracuse University Research Facility and Collaborative Environment

Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

Author: Korenblum Daniel
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2018
Field of study

Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

arXiv.org e-Print Archive

Directory of Open Access Journals

Recommended from our members

SpectralNET – an application for spectral graph analysis and visualization

Author: Clemons Paul A
Forman Joshua J
Haggarty Stephen J
Schreiber Stuart L
Publication venue: BioMed Central
Publication date: 01/01/2005
Field of study

BACKGROUND: Graph theory provides a computational framework for modeling a variety of datasets including those emerging from genomics, proteomics, and chemical genetics. Networks of genes, proteins, small molecules, or other objects of study can be represented as graphs of nodes (vertices) and interactions (edges) that can carry different weights. SpectralNET is a flexible application for analyzing and visualizing these biological and chemical networks. RESULTS: Available both as a standalone .NET executable and as an ASP.NET web application, SpectralNET was designed specifically with the analysis of graph-theoretic metrics in mind, a computational task not easily accessible using currently available applications. Users can choose either to upload a network for analysis using a variety of input formats, or to have SpectralNET generate an idealized random network for comparison to a real-world dataset. Whichever graph-generation method is used, SpectralNET displays detailed information about each connected component of the graph, including graphs of degree distribution, clustering coefficient by degree, and average distance by degree. In addition, extensive information about the selected vertex is shown, including degree, clustering coefficient, various distance metrics, and the corresponding components of the adjacency, Laplacian, and normalized Laplacian eigenvectors. SpectralNET also displays several graph visualizations, including a linear dimensionality reduction for uploaded datasets (Principal Components Analysis) and a non-linear dimensionality reduction that provides an elegant view of global graph structure (Laplacian eigenvectors). CONCLUSION: SpectralNET provides an easily accessible means of analyzing graph-theoretic metrics for data modeling and dimensionality reduction. SpectralNET is publicly available as both a .NET application and an ASP.NET web application from . Source code is available upon request

Harvard University - DASH

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Flexible sampling of discrete data correlations without the marginal distributions

Author: Kalaitzis Alfredo
Silva Ricardo
Publication venue
Publication date: 01/01/2013
Field of study

Learning the joint dependence of discrete variables is a fundamental problem in machine learning, with many applications including prediction, clustering and dimensionality reduction. More recently, the framework of copula modeling has gained popularity due to its modular parametrization of joint distributions. Among other properties, copulas provide a recipe for combining flexible models for univariate marginal distributions with parametric families suitable for potentially high dimensional dependence structures. More radically, the extended rank likelihood approach of Hoff (2007) bypasses learning marginal models completely when such information is ancillary to the learning task at hand as in, e.g., standard dimensionality reduction problems or copula parameter estimation. The main idea is to represent data by their observable rank statistics, ignoring any other information from the marginals. Inference is typically done in a Bayesian framework with Gaussian copulas, and it is complicated by the fact this implies sampling within a space where the number of constraints increases quadratically with the number of data points. The result is slow mixing when using off-the-shelf Gibbs sampling. We present an efficient algorithm based on recent advances on constrained Hamiltonian Markov chain Monte Carlo that is simple to implement and does not require paying for a quadratic cost in sample size.Comment: An overhauled version of the experimental section moved to the main paper. Old experimental section moved to supplementary materia

arXiv.org e-Print Archive

CiteSeerX

UCL Discovery

Clustering based feature selection using Partitioning Around Medoids (PAM)

Author: Ismi Dewi Pramudi
Murinto Murinto
Publication venue: 'Universitas Ahmad Dahlan, Kampus 3'
Publication date: 19/05/2020
Field of study

High-dimensional data contains a large number of features. With many features, high dimensional data requires immense computational resources, including space and time. Several studies indicate that not all features of high dimensional data are relevant to classification result. Dimensionality reduction is inevitable and is required due to classifier performance improvement. Several dimensionality reduction techniques were carried out, including feature selection techniques and feature extraction techniques. Sequential forward feature selection and backward feature selection are feature selection using the greedy approach. The heuristics approach is also applied in feature selection, using the Genetic Algorithm, PSO, and Forest Optimization Algorithm. PCA is the most well-known feature extraction method. Besides, other methods such as multidimensional scaling and linear discriminant analysis. In this work, a different approach is applied to perform feature selection. Cluster analysis based feature selection using Partitioning Around Medoids (PAM) clustering is carried out. Our experiment results showed that classification accuracy gained when using feature vectors' medoids to represent the original dataset is high, above 80%

Journal of Education and Learning (EduLearn)

UAD Journal Management System