396 research outputs found

    Discovering gene functional relationships using a literature-based NMF model

    Get PDF
    The rapid growth of the biomedical literature and genomic information presents a major challenge for determining the functional relationships among genes. Several bioinformatics tools have been developed to extract and identify gene relationships from various biological databases. However, an intuitive user-interface tool that allows the biologist to determine functional relationships among genes is still not available. In this study, we develop a Web-based bioinformatics software environment called FAUN or Feature Annotation Using Nonnegative matrix factorization (NMF) to facilitate both the discovery and classification of functional relationships among genes. Both the computational complexity and parameterization of NMF for processing gene sets are discussed. We tested FAUN on three manually constructed gene document collections, and then used it to analyze several microarray-derived gene sets obtained from studies of the developing cerebellum in normal and mutant mice. FAUN provides utilities for collaborative knowledge discovery and identification of new gene relationships from text streams and repositories (e.g., MEDLINE). It is particularly useful for the validation and analysis of gene associations suggested by microarray experimentation. The FAUN site is publicly available at http://grits.eecs.utk.edu/faun

    Advances in Nonnegative Matrix Decomposition with Application to Cluster Analysis

    Get PDF
    Nonnegative Matrix Factorization (NMF) has found a wide variety of applications in machine learning and data mining. NMF seeks to approximate a nonnegative data matrix by a product of several low-rank factorizing matrices, some of which are constrained to be nonnegative. Such additive nature often results in parts-based representation of the data, which is a desired property especially for cluster analysis.  This thesis presents advances in NMF with application in cluster analysis. It reviews a class of higher-order NMF methods called Quadratic Nonnegative Matrix Factorization (QNMF). QNMF differs from most existing NMF methods in that some of its factorizing matrices occur twice in the approximation. The thesis also reviews a structural matrix decomposition method based on Data-Cluster-Data (DCD) random walk. DCD goes beyond matrix factorization and has a solid probabilistic interpretation by forming the approximation with cluster assigning probabilities only. Besides, the Kullback-Leibler divergence adopted by DCD is advantageous in handling sparse similarities for cluster analysis.  Multiplicative update algorithms have been commonly used for optimizing NMF objectives, since they naturally maintain the nonnegativity constraint of the factorizing matrix and require no user-specified parameters. In this work, an adaptive multiplicative update algorithm is proposed to increase the convergence speed of QNMF objectives.  Initialization conditions play a key role in cluster analysis. In this thesis, a comprehensive initialization strategy is proposed to improve the clustering performance by combining a set of base clustering methods. The proposed method can better accommodate clustering methods that need a careful initialization such as the DCD.  The proposed methods have been tested on various real-world datasets, such as text documents, face images, protein, etc. In particular, the proposed approach has been applied to the cluster analysis of emotional data

    No Pattern, No Recognition: a Survey about Reproducibility and Distortion Issues of Text Clustering and Topic Modeling

    Full text link
    Extracting knowledge from unlabeled texts using machine learning algorithms can be complex. Document categorization and information retrieval are two applications that may benefit from unsupervised learning (e.g., text clustering and topic modeling), including exploratory data analysis. However, the unsupervised learning paradigm poses reproducibility issues. The initialization can lead to variability depending on the machine learning algorithm. Furthermore, the distortions can be misleading when regarding cluster geometry. Amongst the causes, the presence of outliers and anomalies can be a determining factor. Despite the relevance of initialization and outlier issues for text clustering and topic modeling, the authors did not find an in-depth analysis of them. This survey provides a systematic literature review (2011-2022) of these subareas and proposes a common terminology since similar procedures have different terms. The authors describe research opportunities, trends, and open issues. The appendices summarize the theoretical background of the text vectorization, the factorization, and the clustering algorithms that are directly or indirectly related to the reviewed works

    A Channel Ranking And Selection Scheme Based On Channel Occupancy And SNR For Cognitive Radio Systems

    Get PDF
    Wireless networks and information traffic have grown exponentially over the last decade. Consequently, an increase in demand for radio spectrum frequency bandwidth has resulted. Recent studies have shown that with the current fixed spectrum allocation (FSA), radio frequency band utilization ranges from 15% to 85%. Therefore, there are spectrum holes that are not utilized all the time by the licensed users, and, thus the radio spectrum is inefficiently exploited. To solve the problem of scarcity and inefficient utilization of the spectrum resources, dynamic spectrum access has been proposed as a solution to enable sharing and using available frequency channels. With dynamic spectrum allocation (DSA), unlicensed users can access and use licensed, available channels when primary users are not transmitting. Cognitive Radio technology is one of the next generation technologies that will allow efficient utilization of spectrum resources by enabling DSA. However, dynamic spectrum allocation by a cognitive radio system comes with the challenges of accurately detecting and selecting the best channel based on the channelâs availability and quality of service. Therefore, the spectrum sensing and analysis processes of a cognitive radio system are essential to make accurate decisions. Different spectrum sensing techniques and channel selection schemes have been proposed. However, these techniques only consider the spectrum occupancy rate for selecting the best channel, which can lead to erroneous decisions. Other communication parameters, such as the Signal-to-Noise Ratio (SNR) should also be taken into account. Therefore, the spectrum decision-making process of a cognitive radio system must use techniques that consider spectrum occupancy and channel quality metrics to rank channels and select the best option. This thesis aims to develop a utility function based on spectrum occupancy and SNR measurements to model and rank the sensed channels. An evolutionary algorithm-based SNR estimation technique was developed, which enables adaptively varying key parameters of the existing Eigenvalue-based blind SNR estimation technique. The performance of the improved technique is compared to the existing technique. Results show the evolutionary algorithm-based estimation performing better than the existing technique. The utility-based channel ranking technique was developed by first defining channel utility function that takes into account SNR and spectrum occupancy. Different mathematical functions were investigated to appropriately model the utility of SNR and spectrum occupancy rate. A ranking table is provided with the utility values of the sensed channels and compared with the usual occupancy rate based channel ranking. According to the results, utility-based channel ranking provides a better scope of making an informed decision by considering both channel occupancy rate and SNR. In addition, the efficiency of several noise cancellation techniques was investigated. These techniques can be employed to get rid of the impact of noise on the received or sensed signals during spectrum sensing process of a cognitive radio system. Performance evaluation of these techniques was done using simulations and the results show that the evolutionary algorithm-based noise cancellation techniques, particle swarm optimization and genetic algorithm perform better than the regular gradient descent based technique, which is the least-mean-square algorithm

    Level set segmentation using non-negative matrix factorization with application to brain MRI

    Get PDF
    We address the problem of image segmentation using a new deformable model based on the level set method (LSM) and non-negative matrix factorization (NMF). We describe the use of NMF to reduce the dimension of large images from thousands of pixels to a handful of metapixels or regions. In addition, the exact number of regions is discovered using the nuclear norm of the NMF factors. The proposed NMF-LSM characterizes the histogram of the image, calculated over the image blocks, as nonnegative combinations of basic histograms computed using NMF (V ~ W H). The matrix W represents the histograms of the image regions, whereas the matrix H provides the spatial clustering of the regions. NMF-LSM takes into account the bias field present particularly in medical images. We define two local clustering criteria in terms of the NMF factors. The first criterion defines a local intensity clustering property based on the matrix W by computing the average intensity and standard deviation of every region. The second criterion defines a local spatial clustering using the matrix H. The local clustering is then summed over all regions to give a global criterion of image segmentation. In LSM, these criteria define an energy minimized w.r.t. LSFs and the bias field to achieve the segmentation. The proposed method is validated on synthetic binary and gray-scale images, and then applied to real brain MRI images. NMF-LSM provides a general approach for robust region discovery and segmentation in heterogeneous images

    Nonnegative matrix analysis for data clustering and compression

    Get PDF
    Nonnegative matrix factorization (NMF) has becoming an increasingly popular data processing tool these years, widely used by various communities including computer vision, text mining and bioinformatics. It is able to approximate each data sample in a data collection by a linear combination of a set of nonnegative basis vectors weighted by nonnegative weights. This often enables meaningful interpretation of the data, motivates useful insights and facilitates tasks such as data compression, clustering and classification. These subsequently lead to various active roles of NMF in data analysis, e.g., dimensionality reduction tool [11, 75], clustering tool[94, 82, 13, 39], feature engine [40], source separation tool [38], etc. Different methods based on NMF are proposed in this thesis: The modification of k- means clustering is chosen as one of the initialisation methods for NMF. Experimental results demonstrate the excellence of this method with improved compression performance. Independent principal component analysis (IPCA) which combines the advantage of both principal component analysis (PCA) and independent component analysis (ICA) has been chosen as the significant initialisation method for NMF with improved clustering accuracy. We have proposed the new evolutionary optimization strategy for NMF driven by three proposed update schemes in the solution space, saying NMF rule (or original movement), firefly rule (or beta movement) and survival of the fittest rule (or best movement). This proposed update strategy facilitates both the clustering and compression problems by using the different system objective functions that make use of the clustering and compression quality measurements. A hybrid initialisation approach is used by including the state-of-the-art NMF initialization methods as seed knowledge to increase the rate of convergence. There is no limitation for the number and the type of the initialization methods used for the proposed optimisation approach. Numerous computer experiments using the benchmark datasets verify the theoretical results, make comparisons among the techniques in measures of clustering/compression accuracy. Experimental results demonstrate the excellence of these methods with im- proved clustering/compression performance. In the application of EEG dataset, we employed several standard algorithms to provide clustering on preprocessed EEG data. We also explored ensemble clustering to obtain some tight clusters. We can make some statements based on the results we have got: firstly, normalization is necessary for this EEG brain dataset to obtain reasonable clustering; secondly, k-means, k-medoids and HC-Ward provide relatively better clustering results; thirdly, ensemble clustering enables us to tune the tightness of the clusters so that the research can be focused
    corecore