244 research outputs found

    Towards distributed node similarity search on graphs

    Get PDF

    Large Scale Nearest Neighbor Search - Theories, Algorithms, and Applications

    Get PDF
    We are witnessing a data explosion era, in which huge data sets of billions or more samples represented by high-dimensional feature vectors can be easily found on the Web, enterprise data centers, surveillance sensor systems, and so on. On these large scale data sets, nearest neighbor search is fundamental for lots of applications including content based search/retrieval, recommendation, clustering, graph and social network research, as well as many other machine learning and data mining problems. Exhaustive search is the simplest and most straightforward way for nearest neighbor search, but it can not scale up to huge data set at the sizes as mentioned above. To make large scale nearest neighbor search practical, we need the online search step to be sublinear in terms of the database size, which means offline indexing is necessary. Moreover, to achieve sublinear search time, we usually need to make some sacrifice on the search accuracy, and hence we can often only obtain approximate nearest neighbor instead of exact nearest neighbor. In other words, by large scale nearest neighbor search, we aim at approximate nearest neighbor search methods with sublinear online search time via offline indexing. To some extent, indexing a vector dataset for (sublinear time) approximate search can be achieved by partitioning the feature space to different regions, and mapping each point to its closet regions. There are different kinds of partition structures, for example, tree based partition, hashing based partition, clustering/quantization based partition, etc. From the viewpoint of how the data partition function is generated, the partition methods can be grouped into two main categories: 1. data independent (random) partition such as locality sensitive hashing, randomized trees/forests methods, etc.; 2. data dependent (optimized) partition, such as compact hashing, quantization based indexing methods, and some tree based methods like kd-tree, pca tree, etc. With the offline indexing/partitioning, online approximate nearest neighbor search usually consists of three steps: locate the query region that the query point falls in, obtain candidates which are the database points in the regions near the query region, and rerank/return candidates. For large scale nearest neighbor search, the key question is: how to design the optimal offline indexing, such that the online search performance is the best, or more specifically, the online search can be as fast as possible, while meeting a required accuracy? In this thesis, we have studied theories, algorithms, systems and applications for (approximate) nearest neighbor search on large scale data sets, for both indexing with random partition and indexing with learning based partition. Our specific main contributions are: 1. We unify various nearest neighbor search methods into the data partition framework, and provide a general formulation of optimal data partition, which supports fastest search speed while satisfying a required search accuracy. The formulation is general, and can be used to explain most existing (sublinear) large scale approximate nearest neighbor search methods. 2. For indexing with data-independent partitions, we have developed theories on their lower and upper bounds of time and space complexity, based on the optimal data partition formulation. The bounds are applicable for a general group of methods called Nearest Neighbor Preferred Hashing and Nearest Neighbor Preferred Partition, including, locality sensitive hashing, random forest, and many other random hashing methods, etc. Moreover, we also extend the theory to study how to choose the parameters for indexing methods with random partitions. 3. For indexing with data-dependent partitions, I have applied the same formulation to develop a joint optimization approach with two important criteria: nearest neighbor preserving and region size balancing. we have applied the joint optimization to different partition structures such as hashing and clustering, and achieved several new nearest neighbor search methods, outperforming (or at least comparable) to state-of-the-art solutions for large scale nearest neighbor search. 4. we have further studied fundamental problems for nearest neighbor search beyond search methods, for example, what is the difficulty of nearest neighbor search on a given data set (independent of search methods)? What data properties affect the difficulty and how? How will the theoretical analysis and algorithm design of large scale nearest neighbor search problem be affected by the data set difficulty? 5. Finally, we have applied our nearest neighbor search methods for practical applications. We focus on the development of large visual search engines using new indexing methods developed in this thesis. The techniques can be applied to other domains with data-intensive applications, and moreover, be extended to other applications beyond visual search engine, such as large scale machine learning, data mining, and social network analysis, etc

    Constrained Shortest Path Computation

    Get PDF

    A Review of Classification Problems and Algorithms in Renewable Energy Applications

    Get PDF
    Classification problems and their corresponding solving approaches constitute one of the fields of machine learning. The application of classification schemes in Renewable Energy (RE) has gained significant attention in the last few years, contributing to the deployment, management and optimization of RE systems. The main objective of this paper is to review the most important classification algorithms applied to RE problems, including both classical and novel algorithms. The paper also provides a comprehensive literature review and discussion on different classification techniques in specific RE problems, including wind speed/power prediction, fault diagnosis in RE systems, power quality disturbance classification and other applications in alternative RE systems. In this way, the paper describes classification techniques and metrics applied to RE problems, thus being useful both for researchers dealing with this kind of problem and for practitioners of the field

    Trust based attachment

    Full text link
    In social systems subject to indirect reciprocity, a positive reputation is key for increasing one's likelihood of future positive interactions. The flow of gossip can amplify the impact of a person's actions on their reputation depending on how widely it spreads across the social network, which leads to a percolation problem. To quantify this notion, we calculate the expected number of individuals, the "audience", who find out about a particular interaction. For a potential donor, a larger audience constitutes higher reputational stakes, and thus a higher incentive, to perform "good" actions in line with current social norms. For a receiver, a larger audience therefore increases the trust that the partner will be cooperative. This idea can be used for an algorithm that generates social networks, which we call trust based attachment (TBA). TBA produces graphs that share crucial quantitative properties with real-world networks, such as high clustering, small-world behavior, and power law degree distributions. We also show that TBA can be approximated by simple friend-of-friend routines based on triadic closure, which are known to be highly effective at generating realistic social network structures. Therefore, our work provides a new justification for triadic closure in social contexts based on notions of trust, gossip, and social information spread. These factors are thus identified as potential significant influences on how humans form social ties

    Infant Cry Signal Processing, Analysis, and Classification with Artificial Neural Networks

    Get PDF
    As a special type of speech and environmental sound, infant cry has been a growing research area covering infant cry reason classification, pathological infant cry identification, and infant cry detection in the past two decades. In this dissertation, we build a new dataset, explore new feature extraction methods, and propose novel classification approaches, to improve the infant cry classification accuracy and identify diseases by learning infant cry signals. We propose a method through generating weighted prosodic features combined with acoustic features for a deep learning model to improve the performance of asphyxiated infant cry identification. The combined feature matrix captures the diversity of variations within infant cries and the result outperforms all other related studies on asphyxiated baby crying classification. We propose a non-invasive fast method of using infant cry signals with convolutional neural network (CNN) based age classification to diagnose the abnormality of infant vocal tract development as early as 4-month age. Experiments discover the pattern and tendency of the vocal tract changes and predict the abnormality of infant vocal tract by classifying the cry signals into younger age category. We propose an approach of generating hybrid feature set and using prior knowledge in a multi-stage CNNs model for robust infant sound classification. The dominant and auxiliary features within the set are beneficial to enlarge the coverage as well as keeping a good resolution for modeling the diversity of variations within infant sound and the experimental results give encouraging improvements on two relative databases. We propose an approach of graph convolutional network (GCN) with transfer learning for robust infant cry reason classification. Non-fully connected graphs based on the similarities among the relevant nodes are built to consider the short-term and long-term effects of infant cry signals related to inner-class and inter-class messages. With as limited as 20% of labeled training data, our model outperforms that of the CNN model with 80% labeled training data in both supervised and semi-supervised settings. Lastly, we apply mel-spectrogram decomposition to infant cry classification and propose a fusion method to further improve the infant cry classification performance
    • …
    corecore