19,349 research outputs found

    Memory-Efficient Topic Modeling

    Full text link
    As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure

    DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs

    Full text link
    Subgraph listing is a fundamental problem in graph theory and has wide applications in areas like sociology, chemistry, and social networks. Modern graphs can usually be large-scale as well as highly dynamic, which challenges the efficiency of existing subgraph listing algorithms. Recent works have shown the benefits of partitioning and processing big graphs in a distributed system, however, there is only few work targets subgraph listing on dynamic graphs in a distributed environment. In this paper, we propose an efficient approach, called Distributed and Dynamic Subgraph Listing (DDSL), which can incrementally update the results instead of running from scratch. DDSL follows a general distributed join framework. In this framework, we use a Neighbor-Preserved storage for data graphs, which takes bounded extra space and supports dynamic updating. After that, we propose a comprehensive cost model to estimate the I/O cost of listing subgraphs. Then based on this cost model, we develop an algorithm to find the optimal join tree for a given pattern. To handle dynamic graphs, we propose an efficient left-deep join algorithm to incrementally update the join results. Extensive experiments are conducted on real-world datasets. The results show that DDSL outperforms existing methods in dealing with both static dynamic graphs in terms of the responding time

    DualTable: A Hybrid Storage Model for Update Optimization in Hive

    Full text link
    Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

    Fast Ensemble Smoothing

    Full text link
    Smoothing is essential to many oceanographic, meteorological and hydrological applications. The interval smoothing problem updates all desired states within a time interval using all available observations. The fixed-lag smoothing problem updates only a fixed number of states prior to the observation at current time. The fixed-lag smoothing problem is, in general, thought to be computationally faster than a fixed-interval smoother, and can be an appropriate approximation for long interval-smoothing problems. In this paper, we use an ensemble-based approach to fixed-interval and fixed-lag smoothing, and synthesize two algorithms. The first algorithm produces a linear time solution to the interval smoothing problem with a fixed factor, and the second one produces a fixed-lag solution that is independent of the lag length. Identical-twin experiments conducted with the Lorenz-95 model show that for lag lengths approximately equal to the error doubling time, or for long intervals the proposed methods can provide significant computational savings. These results suggest that ensemble methods yield both fixed-interval and fixed-lag smoothing solutions that cost little additional effort over filtering and model propagation, in the sense that in practical ensemble application the additional increment is a small fraction of either filtering or model propagation costs. We also show that fixed-interval smoothing can perform as fast as fixed-lag smoothing and may be advantageous when memory is not an issue

    Efficient Batch Update of Unique Identifiers in a Distributed Hash Table for Resources in a Mobile Host

    Full text link
    Resources in a distributed system can be identified using identifiers based on random numbers. When using a distributed hash table to resolve such identifiers to network locations, the straightforward approach is to store the network location directly in the hash table entry associated with an identifier. When a mobile host contains a large number of resources, this requires that all of the associated hash table entries must be updated when its network address changes. We propose an alternative approach where we store a host identifier in the entry associated with a resource identifier and the actual network address of the host in a separate host entry. This can drastically reduce the time required for updating the distributed hash table when a mobile host changes its network address. We also investigate under which circumstances our approach should or should not be used. We evaluate and confirm the usefulness of our approach with experiments run on top of OpenDHT.Comment: To be presented at the 2010 International Workshop on Cloud Computing, Applications and Technologie

    Variational Hamiltonian Monte Carlo via Score Matching

    Full text link
    Traditionally, the field of computational Bayesian statistics has been divided into two main subfields: variational methods and Markov chain Monte Carlo (MCMC). In recent years, however, several methods have been proposed based on combining variational Bayesian inference and MCMC simulation in order to improve their overall accuracy and computational efficiency. This marriage of fast evaluation and flexible approximation provides a promising means of designing scalable Bayesian inference methods. In this paper, we explore the possibility of incorporating variational approximation into a state-of-the-art MCMC method, Hamiltonian Monte Carlo (HMC), to reduce the required gradient computation in the simulation of Hamiltonian flow, which is the bottleneck for many applications of HMC in big data problems. To this end, we use a {\it free-form} approximation induced by a fast and flexible surrogate function based on single-hidden layer feedforward neural networks. The surrogate provides sufficiently accurate approximation while allowing for fast exploration of parameter space, resulting in an efficient approximate inference algorithm. We demonstrate the advantages of our method on both synthetic and real data problems
    • …
    corecore