1,192 research outputs found

    Laplacian Mixture Modeling for Network Analysis and Unsupervised Learning on Graphs

    Full text link
    Laplacian mixture models identify overlapping regions of influence in unlabeled graph and network data in a scalable and computationally efficient way, yielding useful low-dimensional representations. By combining Laplacian eigenspace and finite mixture modeling methods, they provide probabilistic or fuzzy dimensionality reductions or domain decompositions for a variety of input data types, including mixture distributions, feature vectors, and graphs or networks. Provable optimal recovery using the algorithm is analytically shown for a nontrivial class of cluster graphs. Heuristic approximations for scalable high-performance implementations are described and empirically tested. Connections to PageRank and community detection in network analysis demonstrate the wide applicability of this approach. The origins of fuzzy spectral methods, beginning with generalized heat or diffusion equations in physics, are reviewed and summarized. Comparisons to other dimensionality reduction and clustering methods for challenging unsupervised machine learning problems are also discussed.Comment: 13 figures, 35 reference

    SPEECH EMOTION DETECTION USING MACHINE LEARNING TECHNIQUES

    Get PDF
    Communication is the key to express one’s thoughts and ideas clearly. Amongst all forms of communication, speech is the most preferred and powerful form of communications in human. The era of the Internet of Things (IoT) is rapidly advancing in bringing more intelligent systems available for everyday use. These applications range from simple wearables and widgets to complex self-driving vehicles and automated systems employed in various fields. Intelligent applications are interactive and require minimum user effort to function, and mostly function on voice-based input. This creates the necessity for these computer applications to completely comprehend human speech. A speech percept can reveal information about the speaker including gender, age, language, and emotion. Several existing speech recognition systems used in IoT applications are integrated with an emotion detection system in order to analyze the emotional state of the speaker. The performance of the emotion detection system can greatly influence the overall performance of the IoT application in many ways and can provide many advantages over the functionalities of these applications. This research presents a speech emotion detection system with improvements over an existing system in terms of data, feature selection, and methodology that aims at classifying speech percepts based on emotions, more accurately

    Spectral Ranking and Unsupervised Feature Selection for Point, Collective and Contextual Anomaly Detection

    Get PDF
    Anomaly detection problems can be classified into three categories: point anomaly detection, collective anomaly detection and contextual anomaly detection. Many algorithms have been devised to address anomaly detection of a specific type from various application domains. Nevertheless, the exact type of anomalies to be detected in practice is generally unknown under unsupervised setting, and most of the methods exist in literature usually favor one kind of anomalies over the others. Applying an algorithm with an incorrect assumption is unlikely to produce reasonable results. This thesis thereby investigates the possibility of applying a uniform approach that can automatically discover different kinds of anomalies. Specifically, we are primarily interested in Spectral Ranking for Anomalies (SRA) for its potential in detecting point anomalies and collective anomalies simultaneously. We show that the spectral optimization in SRA can be viewed as a relaxation of an unsupervised SVM problem under some assumptions. SRA thereby results in a bi-class classification strength measure that can be used to rank the point anomalies, along with a normal vs. abnormal classification for identifying collective anomalies. However, in dealing with contextual anomaly problems with different contexts defined by different feature subsets, SRA and other popular methods are still not sufficient on their own. Accordingly, we propose an unsupervised backward elimination feature selection algorithm BAHSIC-AD, utilizing Hilbert-Schmidt Independence Critirion (HSIC) in identifying the data instances present as anomalies in the subset of features that have strong dependence with each other. Finally, we demonstrate the effectiveness of SRA combined with BAHSIC-AD by comparing their performance with other popular anomaly detection methods on a few benchmarks, including both synthetic datasets and real world datasets. Our computational results jusitify that, in practice, SRA combined with BAHSIC-AD can be a generally applicable method for detecting different kinds of anomalies

    A Flexible Outlier Detector Based on a Topology Given by Graph Communities

    Get PDF
    Acord transformatiu CRUE-CSICOutlier detection is essential for optimal performance of machine learning methods and statistical predictive models. Their detection is especially determinant in small sample size unbalanced problems, since in such settings outliers become highly influential and significantly bias models. This particular experimental settings are usual in medical applications, like diagnosis of rare pathologies, outcome of experimental personalized treatments or pandemic emergencies. In contrast to population-based methods, neighborhood based local approaches compute an outlier score from the neighbors of each sample, are simple flexible methods that have the potential to perform well in small sample size unbalanced problems. A main concern of local approaches is the impact that the computation of each sample neighborhood has on the method performance. Most approaches use a distance in the feature space to define a single neighborhood that requires careful selection of several parameters, like the number of neighbors. This work presents a local approach based on a local measure of the heterogeneity of sample labels in the feature space considered as a topological manifold. Topology is computed using the communities of a weighted graph codifying mutual nearest neighbors in the feature space. This way, we provide with a set of multiple neighborhoods able to describe the structure of complex spaces without parameter fine tuning. The extensive experiments on real-world and synthetic data sets show that our approach outperforms, both, local and global strategies in multi and single view settings

    Sampling and Subspace Methods for Learning Sparse Group Structures in Computer Vision

    Get PDF
    The unprecedented growth of data in volume and dimension has led to an increased number of computationally-demanding and data-driven decision-making methods in many disciplines, such as computer vision, genomics, finance, etc. Research on big data aims to understand and describe trends in massive volumes of high-dimensional data. High volume and dimension are the determining factors in both computational and time complexity of algorithms. The challenge grows when the data are formed of the union of group-structures of different dimensions embedded in a high-dimensional ambient space. To address the problem of high volume, we propose a sampling method referred to as the Sparse Withdrawal of Inliers in a First Trial (SWIFT), which determines the smallest sample size in one grab so that all group-structures are adequately represented and discovered with high probability. The key features of SWIFT are: (i) sparsity, which is independent of the population size; (ii) no prior knowledge of the distribution of data, or the number of underlying group-structures; and (iii) robustness in the presence of an overwhelming number of outliers. We report a comprehensive study of the proposed sampling method in terms of accuracy, functionality, and effectiveness in reducing the computational cost in various applications of computer vision. In the second part of this dissertation, we study dimensionality reduction for multi-structural data. We propose a probabilistic subspace clustering method that unifies soft- and hard-clustering in a single framework. This is achieved by introducing a delayed association of uncertain points to subspaces of lower dimensions based on a confidence measure. Delayed association yields higher accuracy in clustering subspaces that have ambiguities, i.e. due to intersections and high-level of outliers/noise, and hence leads to more accurate self-representation of underlying subspaces. Altogether, this dissertation addresses the key theoretical and practically issues of size and dimension in big data analysis

    GBMST: An Efficient Minimum Spanning Tree Clustering Based on Granular-Ball Computing

    Full text link
    Most of the existing clustering methods are based on a single granularity of information, such as the distance and density of each data. This most fine-grained based approach is usually inefficient and susceptible to noise. Therefore, we propose a clustering algorithm that combines multi-granularity Granular-Ball and minimum spanning tree (MST). We construct coarsegrained granular-balls, and then use granular-balls and MST to implement the clustering method based on "large-scale priority", which can greatly avoid the influence of outliers and accelerate the construction process of MST. Experimental results on several data sets demonstrate the power of the algorithm. All codes have been released at https://github.com/xjnine/GBMST

    Uncovering the genetic architecture and metabolic basis of amino acid composition in maize kernels using multi-omics integration

    Get PDF
    Seeds are a major source of protein in human and livestock diets. Cereal grains are some of the most consumed seeds by both humans and livestock worldwide, with maize, wheat, and rice alone accounting for ~70% of the total cereal production. Maize is one of the major staple crops used for food, feed, and fuel. A mature maize kernel contains small embryo (10% of the volume) and a large endosperm (~90% of its volume). In terms of composition, majority of the kernel proportion contains around 90% of starch and around 8-10% of protein. Nine of the twenty amino acids cannot be synthesized by monogastric animals, including humans, and must be obtained through the diet and are considered essential amino acids (EAA): lysine, isoleucine, leucine, histidine, methionine, phenylalanine, threonine, tryptophan, and valine. The protein quality is poor in maize endosperm as the primary storage proteins are severely deficient in EAA such as lysine, tryptophan, and methionine. Such deficiencies can be detrimental since corn provides an important source of proteins for food in developing countries and for feed in developed countries such as the U.S. Failure to consume sufficient levels of EAA per day leads to severe malnutrition, even if one's calories requirements are met. Many attempts to increase the EAA has demonstrated only limited success since seed can rebalance their amino acids composition even when major changes are introduced in their proteome. One possible approach to solve this applied problem is by seed EAA biofortification; however, many attempts at this task fall short and strongly indicates that even though we know most of the metabolic pathways of amino acids, we know very little about their regulation especially in seed. Therefore, the first step towards efficient amino acid biofortification is to increase our fundamental understanding of their function, as well as the metabolic regulation and the biology of the plant seeds. Despite the tight regulation within any given genotype seed amino acid composition display extensive natural variation which can be utilized to uncover the genetic basis and identify new targets for seed amino acids biofortification. Hence to uncover the genetic architecture of amino acids composition in maize kernels we used Goodman-Buckler maize association panel that consists of 282 diverse maize inbred lines including stiff stalk, non-stiff stalk, tropical and subtropical, sweetcorn and popcorn lines. I performed genome wide association study (GWAS) on both the protein bound amino acids (PBAA) and free amino acids (FAA). Although, GWAS is widely used to dissect the genetic architecture of complex traits, oftentimes the GWAS outputs the extensive list of genes particularly when using multiple phenotypic traits. To overcome this, I used an integrative multi-omics approach that combines GWAS and co-expression networks modules obtained from ten seed filling stages of B73 to uncover novel key regulatory genes, characterize biological process and prioritized the candidate genes that involved in shaping the natural variation of amino acid composition. Chapter one of the dissertation is the general introduction and literature review on the seed amino acids. It briefly discuss the general introduction of PBAA and FAA, previous attempts done to improve seed PBAA and FAA composition, natural variation used to uncover the genetic architecture of complex traits including metabolic traits such as amino acids and finally discuss the multi-omics integration to uncover the genetic basis of complex traits. Chapter two elaborates the comprehensive genetic basis of PBAA in maize kernels using integrative analysis of 76 PBAA GWAS with protein co-expression network modules. Previous studies have shown that manipulation of storage proteins and amino acid pathway genes have contributed in the improvement of quality protein maize however, my study strongly suggests that in addition to the manipulation of storage protein and amino acid metabolic genes, specific ribosomal genes along with other translation machinery could be the novel target for seed amino acids biofortification. Chapter three discusses the genetic basis of FAA in maize kernels using integrative analysis of 109 FAA GWAS with protein co-expression network modules. I have presented here the comprehensive list of SNPs as well as the candidate genes and several biological processes including the translational machinery responsible for shaping the genetic architecture of FAA in seed. Chapter four includes the conclusion and future works. Maize is an important crop used for both food and feed and possesses great genotypic and phenotypic diversity. The results from my study has validated several previous characterized genes and identified novel key genes that regulate and shape the PBAA and FAA in maize kernels, which could be used further to target for amino acid biofortification.Includes bibliographical references

    Conditional Anomaly Detection with Soft Harmonic Functions

    Get PDF
    International audienceIn this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method on several synthetic and UCI ML datasets in detecting unusual labels when compared to several baseline approaches. We also evaluate the performance of our method on a real-world electronic health record dataset where we seek to identify unusual patient-management decisions
    • …
    corecore