3,373 research outputs found

    Large Scale Spectral Clustering Using Approximate Commute Time Embedding

    Full text link
    Spectral clustering is a novel clustering method which can detect complex shapes of data clusters. However, it requires the eigen decomposition of the graph Laplacian matrix, which is proportion to O(n3)O(n^3) and thus is not suitable for large scale systems. Recently, many methods have been proposed to accelerate the computational time of spectral clustering. These approximate methods usually involve sampling techniques by which a lot information of the original data may be lost. In this work, we propose a fast and accurate spectral clustering approach using an approximate commute time embedding, which is similar to the spectral embedding. The method does not require using any sampling technique and computing any eigenvector at all. Instead it uses random projection and a linear time solver to find the approximate embedding. The experiments in several synthetic and real datasets show that the proposed approach has better clustering quality and is faster than the state-of-the-art approximate spectral clustering methods

    Robust Mobile Visual Recognition System: From Bag of Visual Words to Deep Learning

    Get PDF
    With billions of images captured by mobile users everyday, automatically recognizing contents in such images has become a particularly important feature for various mobile apps, including augmented reality, product search, visual-based authentication etc. Traditionally, a client-server architecture is adopted such that the mobile client sends captured images/video frames to a cloud server, which runs a set of task-specific computer vision algorithms and sends back the recognition results. However, such scheme may cause problems related to user privacy, network stability/availability and device energy.In this dissertation, we investigate the problem of building a robust mobile visual recognition system that achieves high accuracy, low latency, low energy cost and privacy protection. Generally, we study two broad types of recognition methods: the bag of visual words (BOVW) based retrieval methods, which search the nearest neighbor image to a query image, and the state-of-the-art deep learning based methods, which recognize a given image using a trained deep neural network. The challenges of deploying BOVW based retrieval methods include: size of indexed image database, query latency, feature extraction efficiency and re-ranking performance. To address such challenges, we first proposed EMOD which enables efficient on-device image retrieval on a downloaded context-dependent partial image database. The efficiency is achieved by analyzing the BOVW processing pipeline and optimizing each module with algorithmic improvement.Recent deep learning based recognition approaches have been shown to greatly exceed the performance of traditional approaches. We identify several challenges of applying deep learning based recognition methods on mobile scenarios, namely energy efficiency and privacy protection for real-time visual processing, and mobile visual domain biases. Thus, we proposed two techniques to address them, (i) efficiently splitting the workload across heterogeneous computing resources, i.e., mobile devices and the cloud using our Moca framework, and (ii) using mobile visual domain adaptation as proposed in our collaborative edge-mediated platform DeepCham. Our extensive experiments on large-scale benchmark datasets and off-the-shelf mobile devices show our solutions provide better results than the state-of-the-art solutions

    Fast Machine Learning Algorithms for Massive Datasets with Applications in the Biomedical Domain

    Get PDF
    The continuous increase in the size of datasets introduces computational challenges for machine learning algorithms. In this dissertation, we cover the machine learning algorithms and applications in large-scale data analysis in manufacturing and healthcare. We begin with introducing a multilevel framework to scale the support vector machine (SVM), a popular supervised learning algorithm with a few tunable hyperparameters and highly accurate prediction. The computational complexity of nonlinear SVM is prohibitive on large-scale datasets compared to the linear SVM, which is more scalable for massive datasets. The nonlinear SVM has shown to produce significantly higher classification quality on complex and highly imbalanced datasets. However, a higher classification quality requires a computationally expensive quadratic programming solver and extra kernel parameters for model selection. We introduce a generalized fast multilevel framework for regular, weighted, and instance weighted SVM that achieves similar or better classification quality compared to the state-of-the-art SVM libraries such as LIBSVM. Our framework improves the runtime more than two orders of magnitude for some of the well-known benchmark datasets. We cover multiple versions of our proposed framework and its implementation in detail. The framework is implemented using PETSc library which allows easy integration with scientific computing tasks. Next, we propose an adaptive multilevel learning framework for SVM to reduce the variance between prediction qualities across the levels, improve the overall prediction accuracy, and boost the runtime. We implement multi-threaded support to speed up the parameter fitting runtime that results in more than an order of magnitude speed-up. We design an early stopping criteria to reduce the extra computational cost when we achieve expected prediction quality. This approach provides significant speed-up, especially for massive datasets. Finally, we propose an efficient low dimensional feature extraction over massive knowledge networks. Knowledge networks are becoming more popular in the biomedical domain for knowledge representation. Each layer in knowledge networks can store the information from one or multiple sources of data. The relationships between concepts or between layers represent valuable information. The proposed feature engineering approach provides an efficient and highly accurate prediction of the relationship between biomedical concepts on massive datasets. Our proposed approach utilizes semantics and probabilities to reduce the potential search space for the exploration and learning of machine learning algorithms. The calculation of probabilities is highly scalable with the size of the knowledge network. The number of features is fixed and equivalent to the number of relationships or classes in the data. A comprehensive comparison of well-known classifiers such as random forest, SVM, and deep learning over various features extracted from the same dataset, provides an overview for performance and computational trade-offs. Our source code, documentation and parameters will be available at https://github.com/esadr/

    Scalable and interpretable product recommendations via overlapping co-clustering

    Full text link
    We consider the problem of generating interpretable recommendations by identifying overlapping co-clusters of clients and products, based only on positive or implicit feedback. Our approach is applicable on very large datasets because it exhibits almost linear complexity in the input examples and the number of co-clusters. We show, both on real industrial data and on publicly available datasets, that the recommendation accuracy of our algorithm is competitive to that of state-of-art matrix factorization techniques. In addition, our technique has the advantage of offering recommendations that are textually and visually interpretable. Finally, we examine how to implement our technique efficiently on Graphical Processing Units (GPUs).Comment: In IEEE International Conference on Data Engineering (ICDE) 201

    How To Model Supernovae in Simulations of Star and Galaxy Formation

    Get PDF
    We study the implementation of mechanical feedback from supernovae (SNe) and stellar mass loss in galaxy simulations, within the Feedback In Realistic Environments (FIRE) project. We present the FIRE-2 algorithm for coupling mechanical feedback, which can be applied to any hydrodynamics method (e.g. fixed-grid, moving-mesh, and mesh-less methods), and black hole as well as stellar feedback. This algorithm ensures manifest conservation of mass, energy, and momentum, and avoids imprinting 'preferred directions' on the ejecta. We show that it is critical to incorporate both momentum and thermal energy of mechanical ejecta in a self-consistent manner, accounting for SNe cooling radii when they are not resolved. Using idealized simulations of single SN explosions, we show that the FIRE-2 algorithm, independent of resolution, reproduces converged solutions in both energy and momentum. In contrast, common 'fully-thermal' (energy-dump) or 'fully-kinetic' (particle-kicking) schemes in the literature depend strongly on resolution: when applied at mass resolution >100 solar masses, they diverge by orders-of-magnitude from the converged solution. In galaxy-formation simulations, this divergence leads to orders-of-magnitude differences in galaxy properties, unless those models are adjusted in a resolution-dependent way. We show that all models that individually time-resolve SNe converge to the FIRE-2 solution at sufficiently high resolution. However, in both idealized single-SN simulations and cosmological galaxy-formation simulations, the FIRE-2 algorithm converges much faster than other sub-grid models without re-tuning parameters.Comment: 18 pages, 9 figures (+8 pages, 6 figures in appendices). MNRAS (updated to match published version

    Gossip Algorithms for Distributed Signal Processing

    Full text link
    Gossip algorithms are attractive for in-network processing in sensor networks because they do not require any specialized routing, there is no bottleneck or single point of failure, and they are robust to unreliable wireless network conditions. Recently, there has been a surge of activity in the computer science, control, signal processing, and information theory communities, developing faster and more robust gossip algorithms and deriving theoretical performance guarantees. This article presents an overview of recent work in the area. We describe convergence rate results, which are related to the number of transmitted messages and thus the amount of energy consumed in the network for gossiping. We discuss issues related to gossiping over wireless links, including the effects of quantization and noise, and we illustrate the use of gossip algorithms for canonical signal processing tasks including distributed estimation, source localization, and compression.Comment: Submitted to Proceedings of the IEEE, 29 page

    Improving the performance of web service recommenders using semantic similarity

    Get PDF
    This paper addresses issues related to recommending Semantic Web Services (SWS) using collaborative filtering (CF). The focus is on reducing the problems arising from data sparsity, one of the main difficulties for CF algorithms. Two CF algorithms are presented and discussed: a memory-based algorithm, using the k-NN method, and a model-based algorithm, using the k-means method. In both algorithms, similarity between users is computed using the Pearson Correlation Coefficient (PCC). One of the limitations of using the PCC in this context is that in those instances where users have not rated items in common it is not possible to compute their similarity. In addition, when the number of common items that were rated is low, the reliability of the computed similarity degree may also be low. To overcome these limitations, the presented algorithms compute the similarity between two users taking into account services that both users accessed and also semantically similar services. Likewise, to predict the rating for a not yet accessed target service, the algorithms consider the ratings that neighbor users assigned to the target service, as is normally the case, while also considering the ratings assigned to services that are semantically similar to the target service. The experiments described in the paper show that this approach has a significantly positive impact on prediction accuracy, particularly when the user-item matrix is sparse.Facultad de Informátic
    corecore