35 research outputs found

    Factoring nonnegative matrices with linear programs

    Get PDF
    This paper describes a new approach, based on linear programming, for computing nonnegative matrix factorizations (NMFs). The key idea is a data-driven model for the factorization where the most salient features in the data are used to express the remaining features. More precisely, given a data matrix X, the algorithm identifies a matrix C such that X approximately equals CX and some linear constraints. The constraints are chosen to ensure that the matrix C selects features; these features can then be used to find a low-rank NMF of X. A theoretical analysis demonstrates that this approach has guarantees similar to those of the recent NMF algorithm of Arora et al. (2012). In contrast with this earlier work, the proposed method extends to more general noise models and leads to efficient, scalable algorithms. Experiments with synthetic and real datasets provide evidence that the new approach is also superior in practice. An optimized C++ implementation can factor a multigigabyte matrix in a matter of minutes.Comment: 17 pages, 10 figures. Modified theorem statement for robust recovery conditions. Revised proof techniques to make arguments more elementary. Results on robustness when rows are duplicated have been superseded by arxiv.org/1211.668

    Comparing global optimization and default settings of stream-based joins

    Get PDF
    One problem encountered in real-time data integration is the join of a continuous incoming data stream with a disk-based relation. In this paper we investigate a stream-based join algorithm, called mesh join (MESHJOIN), and focus on a critical component in the algorithm, called the disk-buffer. In MESHJOIN the size of disk-buffer varies with a change in total memory budget and tuning is required to get the maximum service rate within limited available memory. Until now there was little data on the position of the optimum value depending on the memory size, and no performance comparison has been carried out between the optimum and reasonable default sizes for the disk-buffer. To avoid tuning, we propose a reasonable default value for the disk-buffer size with a small and acceptable performance loss. The experimental results validate our arguments

    Early Grouping Gets the Skew

    Full text link
    We propose a new algorithm for external grouping with large results. Our approach handles skewed data gracefully and lowers the amount of random IO on disk considerably. Contrary to existing grouping algorithms, our new algorithm does not require the optimizer to employ complicated or error-prone procedures adjusting the parameters prior to query plan execution. We implemented several variants of our algorithm as well as the most commonly used algorithms for grouping and carried out extensive experiments on both synthetic and real data. The results of these experiments reveal the dominance of our approach. In case of heavily skewed data we outperform the other algorithms by a factor of two

    Algorithm Choice For Multiple-Query Evaluation

    Get PDF
    Traditional query optimization concentrates on the optimization of the execution of each individual query. More recently, it has been observed that by considering a sequence of multiple queries some additional high-level optimizations can be performed. Once these optimizations have been performed, each operation is translated into executable code. The fundamental insight in this paper is that significant improvements can be gained by careful choice of the algorithm to be used for each operation. This choice is not merely based on efficiency of algorithms for individual operations, but rather on the efficiency of the algorithm choices for the entire multiple-query evaluation. An efficient procedure for automatically optimizing these algorithm choices is given

    Distributive Join Strategy Based on Tuple Inversion

    Get PDF
    In this paper, we propose a new direction for distributive join operations. We assume that there will be a scalable distributed computer system in which many computers (processors) are connected through a communication network that can be in a LAN or as part of the Internet with sufficient bandwidth. A relational database is then distributed across this network of processors. However, in our approach, the distribution of the database is very fine-grained and is based on the Distributed Hash Table (DHT) concept. A tuple of a table is assigned to a specific processor by using a fair hash function applied to its key value. For each joinable attribute, an inverted file list is further generated and distributed again based on the DHT. This pre-distribution is done when the tuple enters the system and therefore does not require any distribution of data tuples on the fly when the join is executed. When a join operation request is broadcast, each processor performs a local join and the results are sent back to a query processor which, in turn, merges the join results and returns them to the user. Note that the distribution of the DHT of the inverted file lists can be either pre-processed or distributed on the fly. If the lists are pre-processed and distributed, they have to be maintained. We evaluate our approach by comparing it empirically to two other approaches: the naive join method and the fully distributed join method. The results show a significantly higher performance of our method for a wide range of possible parameter

    HYBRIDJOIN for near-real-time Data Warehousing

    Get PDF
    An important component of near-real-time data warehouses is the near-real-time integration layer. One important element in near-real-time data integration is the join of a continuous input data stream with a diskbased relation. For high-throughput streams, stream-based algorithms, such as Mesh Join (MESHJOIN), can be used. However, in MESHJOIN the performance of the algorithm is inversely proportional to the size of disk-based relation. The Index Nested Loop Join (INLJ) can be set up so that it processes stream input, and can deal with intermittences in the update stream but it has low throughput. This paper introduces a robust stream-based join algorithm called Hybrid Join (HYBRIDJOIN), which combines the two approaches. A theoretical result shows that HYBRIDJOIN is asymptotically as fast as the fastest of both algorithms. The authors present performance measurements of the implementation. In experiments using synthetic data based on a Zipfian distribution, HYBRIDJOIN performs significantly better for typical parameters of the Zipfian distribution, and in general performs in accordance with the theoretical model while the other two algorithms are unacceptably slow under different settings
    corecore