19,349 research outputs found
Memory-Efficient Topic Modeling
As one of the simplest probabilistic topic modeling techniques, latent
Dirichlet allocation (LDA) has found many important applications in text
mining, computer vision and computational biology. Recent training algorithms
for LDA can be interpreted within a unified message passing framework. However,
message passing requires storing previous messages with a large amount of
memory space, increasing linearly with the number of documents or the number of
topics. Therefore, the high memory usage is often a major problem for topic
modeling of massive corpora containing a large number of topics. To reduce the
space complexity, we propose a novel algorithm without storing previous
messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP
relates the message passing algorithms with the non-negative matrix
factorization (NMF) algorithms, which absorb the message updating into the
message passing process, and thus avoid storing previous messages. Experimental
results on four large data sets confirm that TBP performs comparably well or
even better than current state-of-the-art training algorithms for LDA but with
a much less memory consumption. TBP can do topic modeling when massive corpora
cannot fit in the computer memory, for example, extracting thematic topics from
7 GB PUBMED corpora on a common desktop computer with 2GB memory.Comment: 20 pages, 7 figure
DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs
Subgraph listing is a fundamental problem in graph theory and has wide
applications in areas like sociology, chemistry, and social networks. Modern
graphs can usually be large-scale as well as highly dynamic, which challenges
the efficiency of existing subgraph listing algorithms. Recent works have shown
the benefits of partitioning and processing big graphs in a distributed system,
however, there is only few work targets subgraph listing on dynamic graphs in a
distributed environment. In this paper, we propose an efficient approach,
called Distributed and Dynamic Subgraph Listing (DDSL), which can incrementally
update the results instead of running from scratch. DDSL follows a general
distributed join framework. In this framework, we use a Neighbor-Preserved
storage for data graphs, which takes bounded extra space and supports dynamic
updating. After that, we propose a comprehensive cost model to estimate the I/O
cost of listing subgraphs. Then based on this cost model, we develop an
algorithm to find the optimal join tree for a given pattern. To handle dynamic
graphs, we propose an efficient left-deep join algorithm to incrementally
update the join results. Extensive experiments are conducted on real-world
datasets. The results show that DDSL outperforms existing methods in dealing
with both static dynamic graphs in terms of the responding time
DualTable: A Hybrid Storage Model for Update Optimization in Hive
Hive is the most mature and prevalent data warehouse tool providing SQL-like
interface in the Hadoop ecosystem. It is successfully used in many Internet
companies and shows its value for big data processing in traditional
industries. However, enterprise big data processing systems as in Smart Grid
applications usually require complicated business logics and involve many data
manipulation operations like updates and deletes. Hive cannot offer sufficient
support for these while preserving high query performance. Hive using the
Hadoop Distributed File System (HDFS) for storage cannot implement data
manipulation efficiently and Hive on HBase suffers from poor query performance
even though it can support faster data manipulation.There is a project based on
Hive issue Hive-5317 to support update operations, but it has not been finished
in Hive's latest version. Since this ACID compliant extension adopts same data
storage format on HDFS, the update performance problem is not solved.
In this paper, we propose a hybrid storage model called DualTable, which
combines the efficient streaming reads of HDFS and the random write capability
of HBase. Hive on DualTable provides better data manipulation support and
preserves query performance at the same time. Experiments on a TPC-H data set
and on a real smart grid data set show that Hive on DualTable is up to 10 times
faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201
Fast Ensemble Smoothing
Smoothing is essential to many oceanographic, meteorological and hydrological
applications. The interval smoothing problem updates all desired states within
a time interval using all available observations. The fixed-lag smoothing
problem updates only a fixed number of states prior to the observation at
current time. The fixed-lag smoothing problem is, in general, thought to be
computationally faster than a fixed-interval smoother, and can be an
appropriate approximation for long interval-smoothing problems. In this paper,
we use an ensemble-based approach to fixed-interval and fixed-lag smoothing,
and synthesize two algorithms. The first algorithm produces a linear time
solution to the interval smoothing problem with a fixed factor, and the second
one produces a fixed-lag solution that is independent of the lag length.
Identical-twin experiments conducted with the Lorenz-95 model show that for lag
lengths approximately equal to the error doubling time, or for long intervals
the proposed methods can provide significant computational savings. These
results suggest that ensemble methods yield both fixed-interval and fixed-lag
smoothing solutions that cost little additional effort over filtering and model
propagation, in the sense that in practical ensemble application the additional
increment is a small fraction of either filtering or model propagation costs.
We also show that fixed-interval smoothing can perform as fast as fixed-lag
smoothing and may be advantageous when memory is not an issue
Efficient Batch Update of Unique Identifiers in a Distributed Hash Table for Resources in a Mobile Host
Resources in a distributed system can be identified using identifiers based
on random numbers. When using a distributed hash table to resolve such
identifiers to network locations, the straightforward approach is to store the
network location directly in the hash table entry associated with an
identifier. When a mobile host contains a large number of resources, this
requires that all of the associated hash table entries must be updated when its
network address changes.
We propose an alternative approach where we store a host identifier in the
entry associated with a resource identifier and the actual network address of
the host in a separate host entry. This can drastically reduce the time
required for updating the distributed hash table when a mobile host changes its
network address. We also investigate under which circumstances our approach
should or should not be used. We evaluate and confirm the usefulness of our
approach with experiments run on top of OpenDHT.Comment: To be presented at the 2010 International Workshop on Cloud
Computing, Applications and Technologie
Variational Hamiltonian Monte Carlo via Score Matching
Traditionally, the field of computational Bayesian statistics has been
divided into two main subfields: variational methods and Markov chain Monte
Carlo (MCMC). In recent years, however, several methods have been proposed
based on combining variational Bayesian inference and MCMC simulation in order
to improve their overall accuracy and computational efficiency. This marriage
of fast evaluation and flexible approximation provides a promising means of
designing scalable Bayesian inference methods. In this paper, we explore the
possibility of incorporating variational approximation into a state-of-the-art
MCMC method, Hamiltonian Monte Carlo (HMC), to reduce the required gradient
computation in the simulation of Hamiltonian flow, which is the bottleneck for
many applications of HMC in big data problems. To this end, we use a {\it
free-form} approximation induced by a fast and flexible surrogate function
based on single-hidden layer feedforward neural networks. The surrogate
provides sufficiently accurate approximation while allowing for fast
exploration of parameter space, resulting in an efficient approximate inference
algorithm. We demonstrate the advantages of our method on both synthetic and
real data problems
- …