6 research outputs found

    Unsupervised Feature Selection Via Orthogonal Basis Clustering and Local Structure Preserving

    Get PDF
    Due to the "curse of dimensionality" issue, how to discard redundant features and select informative features in high-dimensional data has become a critical problem, hence there are many research studies dedicated to solving this problem. Unsupervised feature selection technique, which does not require any prior category information to conduct with, has gained a prominent place in preprocessing high-dimensional data among all feature selection techniques, and it has been applied to many neural networks and learning systems related applications, e.g., pattern classification. In this article, we propose an efficient method for unsupervised feature selection via orthogonal basis clustering and reliable local structure preserving, which is referred to as OCLSP briefly. Our OCLSP method consists of an orthogonal basis clustering together with an adaptive graph regularization, which realizes the functionality of simultaneously achieving excellent cluster separation and preserving the local information of data. Besides, we exploit an efficient alternative optimization algorithm to solve the challenging optimization problem of our proposed OCLSP method, and we perform a theoretical analysis of its computational complexity and convergence. Eventually, we conduct comprehensive experiments on nine real-world datasets to test the validity of our proposed OCLSP method, and the experimental results demonstrate that our proposed OCLSP method outperforms many state-of-the-art unsupervised feature selection methods in terms of clustering accuracy and normalized mutual information, which indicates that our proposed OCLSP method has a strong ability in identifying more important features

    AN EMPIRICAL COMPARISON OF NEO4J AND TIGERGRAPH DATABASES FOR NETWORK CENTRALITY

    Get PDF
    Graph databases have recently gained a lot of attention in areas where the relationships between data and the data itself are equally important, like the semantic web, social networks, and biological networks. A graph database is simply a database designed to store, query, and modify graphs. Recently, several graph database models have been developed. The goal of this research is to evaluate the performance of the two most popular graph databases, Neo4j and TigerGraph, for network centrality metrics including degree centrality, betweenness centrality, closeness centrality, eigenvector centrality, and PageRank. We applied those metrics to a set of real-world networks in both graph databases to see their performance. Experimental results show Neo4j outperforms TigerGraph for computing the centrality metrics used in this study, but TigerGraph performs better during the data loading phase

    Efficient Network Domination for Life Science Applications

    Get PDF
    With the ever-increasing size of data available to researchers, traditional methods of analysis often cannot scale to match problems being studied. Often only a subset of variables may be utilized or studied further, motivating the need of techniques that can prioritize variable selection. This dissertation describes the development and application of graph theoretic techniques, particularly the notion of domination, for this purpose. In the first part of this dissertation, algorithms for vertex prioritization in the field of network controllability are studied. Here, the number of solutions to which a vertex belongs is used to classify said vertex and determine its suitability in controlling a network. Novel efficient scalable algorithms are developed and analyzed. Empirical tests demonstrate the improvement of these algorithms over those already established in the literature. The second part of this dissertation concerns the prioritization of genes for loss-of-function allele studies in mice. The International Mouse Phenotyping Consortium leads the initiative to develop a loss-of-function allele for each protein coding gene in the mouse genome. Only a small proportion of untested genes can be selected for further study. To address the need to prioritize genes, a generalizable data science strategy is developed. This strategy models genes as a gene-similarity graph, and from it selects subset that will be further characterized. Empirical tests demonstrate the method’s utility over that of pseudorandom selection and less computationally demanding methods. Finally, part three addresses the important task of preprocessing in the context of noisy public health data. Many public health databases have been developed to collect, curate, and store a variety of environmental measurements. Idiosyncrasies in these measurements, however, introduce noise to data found in these databases in several ways including missing, incorrect, outlying, and incompatible data. Beyond noisy data, multiple measurements of similar variables can introduce problems of multicollinearity. Domination is again employed in a novel graph method to handle autocorrelation. Empirical results using the Public Health Exposome dataset are reported. Together these three parts demonstrate the utility of subset selection via domination when applied to a multitude of data sources from a variety of disciplines in the life sciences
    corecore