32 research outputs found

    A Design Framework for Efficient Distributed Analytics on Structured Big Data

    Get PDF
    Distributed analytics architectures are often comprised of two elements: a compute engine and a storage system. Conventional distributed storage systems usually store data in the form of files or key-value pairs. This abstraction simplifies how the data is accessed and reasoned about by an application developer. However, the separation of compute and storage systems makes it difficult to optimize costly disk and network operations. By design the storage system is isolated from the workload and its performance requirements such as block co-location and replication. Furthermore, optimizing fine-grained data access requests becomes difficult as the storage layer is hidden away behind such abstractions. Using a clean slate approach, this thesis proposes a modular distributed analytics system design which is centered around a unified interface for distributed data objects named the DDO. The interface couples key mechanisms that utilize storage, memory, and compute resources. This coupling makes it ideal to optimize data access requests across all memory hierarchy levels, with respect to the workload and its performance requirements. In addition to the DDO, a complementary DDO controller implementation controls the logical view of DDOs, their replication, and distribution across the cluster. A proof-of-concept implementation shows improvement in mean query time by 3-6x on the TPC-H and TPC-DS benchmarks, and more than an order of magnitude improvement in many cases

    Data Privacy Preservation in Collaborative Filtering Based Recommender Systems

    Get PDF
    This dissertation studies data privacy preservation in collaborative filtering based recommender systems and proposes several collaborative filtering models that aim at preserving user privacy from different perspectives. The empirical study on multiple classical recommendation algorithms presents the basic idea of the models and explores their performance on real world datasets. The algorithms that are investigated in this study include a popularity based model, an item similarity based model, a singular value decomposition based model, and a bipartite graph model. Top-N recommendations are evaluated to examine the prediction accuracy. It is apparent that with more customers\u27 preference data, recommender systems can better profile customers\u27 shopping patterns which in turn produces product recommendations with higher accuracy. The precautions should be taken to address the privacy issues that arise during data sharing between two vendors. Study shows that matrix factorization techniques are ideal choices for data privacy preservation by their nature. In this dissertation, singular value decomposition (SVD) and nonnegative matrix factorization (NMF) are adopted as the fundamental techniques for collaborative filtering to make privacy-preserving recommendations. The proposed SVD based model utilizes missing value imputation, randomization technique, and the truncated SVD to perturb the raw rating data. The NMF based models, namely iAux-NMF and iCluster-NMF, take into account the auxiliary information of users and items to help missing value imputation and privacy preservation. Additionally, these models support efficient incremental data update as well. A good number of online vendors allow people to leave their feedback on products. It is considered as users\u27 public preferences. However, due to the connections between users\u27 public and private preferences, if a recommender system fails to distinguish real customers from attackers, the private preferences of real customers can be exposed. This dissertation addresses an attack model in which an attacker holds real customers\u27 partial ratings and tries to obtain their private preferences by cheating recommender systems. To resolve this problem, trustworthiness information is incorporated into NMF based collaborative filtering techniques to detect the attackers and make reasonably different recommendations to the normal users and the attackers. By doing so, users\u27 private preferences can be effectively protected

    Efficient Incremental Data Analysis

    Get PDF
    Many data-intensive applications require real-time analytics over streaming data. In a growing number of domains -- sensor network monitoring, social web applications, clickstream analysis, high-frequency algorithmic trading, and fraud detections to name a few -- applications continuously monitor stream events to promptly react to certain data conditions. These applications demand responsive analytics even when faced with high volume and velocity of incoming changes, large numbers of users, and complex processing requirements. Developing suitable online analytics engine that meets these requirements is challenging. In this thesis, we study techniques for efficient online processing of complex analytical queries, ranging from standard database queries to complex machine learning and digital signal processing workflows. First, we focus on the problem of efficient incremental computation for database queries. We have developed a system, called DBToaster, that compiles declarative queries into high-performance stream processing engines that keep query results (views) fresh at very high update rates. At the heart of our system is a recursive query compilation algorithm that materializes a set of supporting higher-order delta views to achieve a substantially lower view maintenance cost. We study the trade-offs between single-tuple and batch incremental processing in local execution, and we present a novel approach for compiling view maintenance code into data-parallel programs optimized for distributed execution. DBToaster supports millions of complete view refreshes per second for a broad range of queries and outperforms commercial database and stream engines by orders of magnitude. We also study the incremental computation for queries written as iterative linear algebra, which can capture many machine learning and scientific calculations. We have developed a framework, called LINVIEW, for capturing deltas of linear algebra programs and understanding their computational cost. Linear algebra operations tend to cause an avalanche effect where even very local changes to the input matrices spread out and infect all of the intermediate results and the final view, causing incremental view maintenance to lose its performance benefit over re-evaluation. We develop techniques based on matrix factorizations to contain such epidemics of change and make incremental view maintenance of linear algebra practical and usually substantially cheaper than re-evaluation. We show, both analytically and experimentally, the usefulness of these techniques when applied to standard analytics tasks. Our last research question concerns the integration of general-purpose query processors and domain-specific operations to enable deep data exploration in both online and offline analysis. We advocate a deep integration of signal processing operations and general-purpose query processors. We demonstrate that in-situ processing of tempo-relational and signal data through a unified query language empowers users to express end-to-end workflows more succinctly inside one system while at the same time offering orders of magnitude better performance than existing popular data management systems

    Squall: Scalable Real-time Analytics

    Get PDF
    Squall is a scalable online query engine that runs complex analytics in a cluster using skew-resilient, adaptive operators. Squall builds on state-of-the-art partitioning schemes and local algorithms, including some of our own. This paper presents the overview of Squall, including some novel join operators. The paper also presents lessons learned over the five years of working on this system, and outlines the plan for the proposed system demonstration

    A Taylor polynomial expansion line search for large-scale optimization

    Get PDF
    In trying to cope with the Big Data deluge, the landscape of distributed computing has changed. Large commodity hardware clusters, typically operating in some form of MapReduce framework, are becoming prevalent for organizations that require both tremendous storage capacity and fault tolerance. However, the high cost of communication can dominate the computation time in large-scale optimization routines in these frameworks. This thesis considers the problem of how to efficiently conduct univariate line searches in commodity clusters in the context of gradient-based batch optimization algorithms, like the staple limited-memory BFGS (LBFGS) method. In it, a new line search technique is proposed for cases where the underlying objective function is analytic, as in logistic regression and low rank matrix factorization. The technique approximates the objective function by a truncated Taylor polynomial along a fixed search direction. The coefficients of this polynomial may be computed efficiently in parallel with far less communication than needed to transmit the high-dimensional gradient vector, after which the polynomial may be minimized with high accuracy in a neighbourhood of the expansion point without distributed operations. This Polynomial Expansion Line Search (PELS) may be invoked iteratively until the expansion point and minimum are sufficiently accurate, and can provide substantial savings in time and communication costs when multiple iterations in the line search procedure are required. Three applications of the PELS technique are presented herein for important classes of analytic functions: (i) logistic regression (LR), (ii) low-rank matrix factorization (MF) models, and (iii) the feedforward multilayer perceptron (MLP). In addition, for LR and MF, implementations of PELS in the Apache Spark framework for fault-tolerant cluster computing are provided. These implementations conferred significant convergence enhancements to their respective algorithms, and will be of interest to Spark and Hadoop practitioners. For instance, the Spark PELS technique reduced the number of iterations and time required by LBFGS to reach terminal training accuracies for LR models by factors of 1.8--2. Substantial acceleration was also observed for the Nonlinear Conjugate Gradient algorithm for MLP models, which is an interesting case for future study in optimization for neural networks. The PELS technique is applicable to a broad class of models for Big Data processing and large-scale optimization, and can be a useful component of batch optimization routines

    A recommender system for e-retail

    Get PDF
    The e-retail sector in South Africa has a significant opportunity to capture a large portion of the country's retail industry. Central to seizing this opportunity is leveraging the advantages that the online setting affords. In particular, the e-retailer can offer an extremely large catalogue of products; far beyond what a traditional retailer is capable of supporting. However, as the catalogue grows, it becomes increasingly difficult for a customer to efficiently discover desirable products. As a consequence, it is important for the e-retailer to develop tools that automatically explore the catalogue for the customer. In this dissertation, we develop a recommender system (RS), whose purpose is to provide suggestions for products that are most likely of interest to a particular customer. There are two primary contributions of this dissertation. First, we describe a set of six characteristics that all effective RS's should possess, namely; accuracy, responsiveness, durability, scalability, model management, and extensibility. Second, we develop an RS that is capable of serving recommendations in an actual e-retail environment. The design of the RS is an attempt to embody the characteristics mentioned above. In addition, to show how the RS supports model selection, we present a proof-of-concept experiment comparing two popular methods for generating recommendations that we implement for this dissertation, namely, implicit matrix factorisation (IMF) and Bayesian personalised ranking (BPR)

    Efficient analysis of large-scale social networks using big-data platforms

    Get PDF
    Ankara : The Computer Engineering and The Graduate School of Engineering and Science of Bilkent University, 2014.Thesis (Ph. D.) -- Bilkent University, 2014.Includes bibliographical references leaves 133-145.In recent years, the rise of very large, rich content networks re-ignited interest to complex/social network analysis at the big data scale, which makes it possible to understand social interactions at large scale while it poses computation challenges to early works with algorithm complexity greater than O(n). This thesis analyzes social networks at very large-scales to derive important parameters and characteristics in an efficient and effective way using big-data platforms. With the popularization of mobile phone usage, telecommunication networks have turned into a socially binding medium and enables researches to analyze social interactions at very large scales. Degree distribution is one of the most important characteristics of social networks and to study degree characteristics and structural properties in large-scale social networks, in this thesis we first gathered a tera-scale dataset of telecommunication call detail records. Using this data we empirically evaluate some statistical models against the degree distribution of the country’s call graph and determine that a Pareto log-normal distribution provides the best fit, despite claims in the literature that power-law distribution is the best model. We also question and derive answers for how network operator, size, density and location affect degree distribution to understand the parameters governing it in social networks. Besides structural property analysis, community identification is of great interest in practice to learn high cohesive subnetworks about different subjects in a social network. In graph theory, k-core is a key metric used to identify subgraphs of high cohesion, also known as the ‘dense’ regions of a graph. As the real world graphs such as social network graphs grow in size, the contents get richer and the topologies change dynamically, we are challenged not only to materialize k-core subgraphs for one time but also to maintain them in order to keep up with continuous updates. These challenges inspired us to propose a new set of distributed algorithms for k-core view construction and maintenance on a horizontally scaling storage and computing platform. Experimental evaluation results demonstrated orders of magnitude speedup and advantages of maintaining k-core incrementally and in batch windows over complete reconstruction approaches. Moreover, the intensity of community engagement can be distinguished at multiple levels, resulting in a multiresolution community representation that has to be maintained over time. We also propose distributed algorithms to construct and maintain a multi-k-core graphs, implemented on the scalable big-data platform Apache HBase. Our experimental evaluation results demonstrate orders of magnitude speedup by maintaining multi-k-core incrementally over complete reconstruction. Furthermore, we propose a graph aware cache system designed for distributed graph processing. Experimental results demonstrate up to 15x speedup compared to traditional LRU based cache systems.Aksu, HidayetPh.D

    Scalable and fault-tolerant data stream processing on multi-core architectures

    Get PDF
    With increasing data volumes and velocity, many applications are shifting from the classical “process-after-store” paradigm to a stream processing model: data is produced and consumed as continuous streams. Stream processing captures latency-sensitive applications as diverse as credit card fraud detection and high-frequency trading. These applications are expressed as queries of algebraic operations (e.g., aggregation) over the most recent data using windows, i.e., finite evolving views over the input streams. To guarantee correct results, streaming applications require precise window semantics (e.g., temporal ordering) for operations that maintain state. While high processing throughput and low latency are performance desiderata for stateful streaming applications, achieving both poses challenges. Computing the state of overlapping windows causes redundant aggregation operations: incremental execution (i.e., reusing previous results) reduces latency but prevents parallelization; at the same time, parallelizing window execution for stateful operations with precise semantics demands ordering guarantees and state access coordination. Finally, streams and state must be recovered to produce consistent and repeatable results in the event of failures. Given the rise of shared-memory multi-core CPU architectures and high-speed networking, we argue that it is possible to address these challenges in a single node without compromising window semantics, performance, or fault-tolerance. In this thesis, we analyze, design, and implement stream processing engines (SPEs) that achieve high performance on multi-core architectures. To this end, we introduce new approaches for in-memory processing that address the previous challenges: (i) for overlapping windows, we provide a family of window aggregation techniques that enable computation sharing based on the algebraic properties of aggregation functions; (ii) for parallel window execution, we balance parallelism and incremental execution by developing abstractions for both and combining them to a novel design; and (iii) for reliable single-node execution, we enable strong fault-tolerance guarantees without sacrificing performance by reducing the required disk I/O bandwidth using a novel persistence model. We combine the above to implement an SPE that processes hundreds of millions of tuples per second with sub-second latencies. These results reveal the opportunity to reduce resource and maintenance footprint by replacing cluster-based SPEs with single-node deployments.Open Acces

    Incremental Processing and Optimization of Update Streams

    Get PDF
    Over the recent years, we have seen an increasing number of applications in networking, sensor networks, cloud computing, and environmental monitoring, which monitor, plan, control, and make decisions over data streams from multiple sources. We are interested in extending traditional stream processing techniques to meet the new challenges of these applications. Generally, in order to support genuine continuous query optimization and processing over data streams, we need to systematically understand how to address incremental optimization and processing of update streams for a rich class of queries commonly used in the applications. Our general thesis is that efficient incremental processing and re-optimization of update streams can be achieved by various incremental view maintenance techniques if we cast the problems as incremental view maintenance problems over data streams. We focus on two incremental processing of update streams challenges currently not addressed in existing work on stream query processing: incremental processing of transitive closure queries over data streams, and incremental re-optimization of queries. In addition to addressing these specific challenges, we also develop a working prototype system Aspen, which serves as an end-to-end stream processing system that has been deployed as the foundation for a case study of our SmartCIS application. We validate our solutions both analytically and empirically on top of our prototype system Aspen, over a variety of benchmark workloads such as TPC-H and LinearRoad Benchmarks
    corecore