219 research outputs found

    Discrete deep learning for fast content-aware recommendation

    Get PDF
    Cold-start problem and recommendation efficiency have been regarded as two crucial challenges in the recommender system. In this paper, we propose a hashing based deep learning framework called Discrete Deep Learning (DDL), to map users and items to Hamming space, where a user's preference for an item can be efficiently calculated by Hamming distance, and this computation scheme significantly improves the efficiency of online recommendation. Besides, DDL unifies the user-item interaction information and the item content information to overcome the issues of data sparsity and cold-start. To be more specific, to integrate content information into our DDL framework, a deep learning model, Deep Belief Network (DBN), is applied to extract effective item representation from the item content information. Besides, the framework imposes balance and irrelevant constraints on binary codes to derive compact but informative binary codes. Due to the discrete constraints in DDL, we propose an efficient alternating optimization method consisting of iteratively solving a series of mixed-integer programming subproblems. Extensive experiments have been conducted to evaluate the performance of our DDL framework on two different Amazon datasets, and the experimental results demonstrate the superiority of DDL over the state-of-the-art methods regarding online recommendation efficiency and cold-start recommendation accuracy

    Manipulating the Capacity of Recommendation Models in Recall-Coverage Optimization

    Get PDF
    Traditional approaches in Recommender Systems ignore the problem of long-tail recommendations. There is no systematic approach to control the magnitude of long-tail recommendations generated by the models, and there is not even proper methodology to evaluate the quality of long-tail recommendations. This thesis addresses the long-tail recommendation problem from both the algorithmic and evaluation perspective. We proposed controlling the magnitude of long-tail recommendations generated by models through the manipulation with capacity hyperparameters of learning algorithms, and we dene such hyperparameters for multiple state-of-the-art algorithms. We also summarize multiple such algorithms under the common framework of the score function, which allows us to apply popularity-based regularization to all of them. We propose searching for Pareto-optimal states in the Recall-Coverage plane as the right way to search for long-tail, high-accuracy models. On the set of exhaustive experiments, we empirically demonstrate the corectness of our theory on a mixture of public and industrial datasets for 5 dierent algorithms and their dierent versions.Traditional approaches in Recommender Systems ignore the problem of long-tail recommendations. There is no systematic approach to control the magnitude of long-tail recommendations generated by the models, and there is not even proper methodology to evaluate the quality of long-tail recommendations. This thesis addresses the long-tail recommendation problem from both the algorithmic and evaluation perspective. We proposed controlling the magnitude of long-tail recommendations generated by models through the manipulation with capacity hyperparameters of learning algorithms, and we dene such hyperparameters for multiple state-of-the-art algorithms. We also summarize multiple such algorithms under the common framework of the score function, which allows us to apply popularity-based regularization to all of them. We propose searching for Pareto-optimal states in the Recall-Coverage plane as the right way to search for long-tail, high-accuracy models. On the set of exhaustive experiments, we empirically demonstrate the corectness of our theory on a mixture of public and industrial datasets for 5 dierent algorithms and their dierent versions

    Learning compact hashing codes with complex objectives from multiple sources for large scale similarity search

    Get PDF
    Similarity search is a key problem in many real world applications including image and text retrieval, content reuse detection and collaborative filtering. The purpose of similarity search is to identify similar data examples given a query example. Due to the explosive growth of the Internet, a huge amount of data such as texts, images and videos has been generated, which indicates that efficient large scale similarity search becomes more important.^ Hashing methods have become popular for large scale similarity search due to their computational and memory efficiency. These hashing methods design compact binary codes to represent data examples so that similar examples are mapped into similar codes. This dissertation addresses five major problems for utilizing supervised information from multiple sources in hashing with respect to different objectives. Firstly, we address the problem of incorporating semantic tags by modeling the latent correlations between tags and data examples. More precisely, the hashing codes are learned in a unified semi-supervised framework by simultaneously preserving the similarities between data examples and ensuring the tag consistency via a latent factor model. Secondly, we solve the missing data problem by latent subspace learning from multiple sources. The hashing codes are learned by enforcing the data consistency among different sources. Thirdly, we address the problem of hashing on structured data by graph learning. A weighted graph is constructed based on the structured knowledge from the data. The hashing codes are then learned by preserving the graph similarities. Fourthly, we address the problem of learning high ranking quality hashing codes by utilizing the relevance judgments from users. The hashing code/function is learned via optimizing a commonly used non-smooth non-convex ranking measure, NDCG. Finally, we deal with the problem of insufficient supervision by active learning. We propose to actively select the most informative data examples and tags in a joint manner based on the selection criteria that both the data examples and tags should be most uncertain and dissimilar with each other.^ Extensive experiments on several large scale datasets demonstrate the superior performance of the proposed approaches over several state-of-the-art hashing methods from different perspectives

    Fast Similarity Graph Construction via Data Sketching Techniques

    Get PDF
    Graphs are mathematical structures used to model objects and their pairwise relationships. Due to their simple but expressive abstract representation, they are commonly used to model various types of relations and processes in technological, social or biological systems and have found numerous applications. A special type of graph is the similarity graph in which nodes represent entities and there is an edge connecting two nodes if the two entities are similar based on some similarity measure. In a typical scenario, raw data of entities are provided in the form of a relational dataset, matrix or a tensor and a similarity graph is built to facilitate graph-based analysis like node importance, node classification, link prediction, community detection, outlier detection, and more. The ability to construct similarity graphs fast is important and with a potential for high impact, thus several approximation techniques have been proposed. In this work, we propose data sketching based methods for fast approximate similarity graph construction. Data sketching techniques are applied on the raw data and are designed to achieve desired error guarantees. They can drastically reduce the size of raw data on which we operate, allowing for faster construction and analysis of similarity graphs, but with approximate results. This is a desirable tradeoff for many applications in diverse domains. Through a thorough experimental evaluation, we demonstrate that our sketching methods outperform sensible baselines and competitor methods proposed for the problem. First, they are much faster than exact methods while maintaining high accuracy in constructing the similarity graph. Furthermore, our methods demonstrate significantly higher accuracy than competitive methods on generic graph analysis tasks. We demonstrate the effectiveness of our methods on different real-world graph applications

    Extending low-rank matrix factorizations for emerging applications

    Get PDF
    Low-rank matrix factorizations have become increasingly popular to project high dimensional data into latent spaces with small dimensions in order to obtain better understandings of the data and thus more accurate predictions. In particular, they have been widely applied to important applications such as collaborative filtering and social network analysis. In this thesis, I investigate the applications and extensions of the ideas of the low-rank matrix factorization to solve several practically important problems arise from collaborative filtering and social network analysis. A key challenge in recommendation system research is how to effectively profile new users, a problem generally known as \emph{cold-start} recommendation. In the first part of this work, we extend the low-rank matrix factorization by allowing the latent factors to have more complex structures --- decision trees to solve the problem of cold-start recommendations. In particular, we present \emph{functional matrix factorization} (fMF), a novel cold-start recommendation method that solves the problem of adaptive interview construction based on low-rank matrix factorizations. The second part of this work considers the efficiency problem of making recommendations in the context of large user and item spaces. Specifically, we address the problem through learning binary codes for collaborative filtering, which can be viewed as restricting the latent factors in low-rank matrix factorizations to be binary vectors that represent the binary codes for both users and items. In the third part of this work, we investigate the applications of low-rank matrix factorizations in the context of social network analysis. Specifically, we propose a convex optimization approach to discover the hidden network of social influence with low-rank and sparse structure by modeling the recurrent events at different individuals as multi-dimensional Hawkes processes, emphasizing the mutual-excitation nature of the dynamics of event occurrences. The proposed framework combines the estimation of mutually exciting process and the low-rank matrix factorization in a principled manner. In the fourth part of this work, we estimate the triggering kernels for the Hawkes process. In particular, we focus on estimating the triggering kernels from an infinite dimensional functional space with the Euler Lagrange equation, which can be viewed as applying the idea of low-rank factorizations in the functional space.Ph.D

    Secure Outsourced Computation on Encrypted Data

    Get PDF
    Homomorphic encryption (HE) is a promising cryptographic technique that supports computations on encrypted data without requiring decryption first. This ability allows sensitive data, such as genomic, financial, or location data, to be outsourced for evaluation in a resourceful third-party such as the cloud without compromising data privacy. Basic homomorphic primitives support addition and multiplication on ciphertexts. These primitives can be utilized to represent essential computations, such as logic gates, which subsequently can support more complex functions. We propose the construction of efficient cryptographic protocols as building blocks (e.g., equality, comparison, and counting) that are commonly used in data analytics and machine learning. We explore the use of these building blocks in two privacy-preserving applications. One application leverages our secure prefix matching algorithm, which builds on top of the equality operation, to process geospatial queries on encrypted locations. The other applies our secure comparison protocol to perform conditional branching in private evaluation of decision trees. There are many outsourced computations that require joint evaluation on private data owned by multiple parties. For example, Genome-Wide Association Study (GWAS) is becoming feasible because of the recent advances of genome sequencing technology. Due to the sensitivity of genomic data, this data is encrypted using different keys possessed by different data owners. Computing on ciphertexts encrypted with multiple keys is a non-trivial task. Current solutions often require a joint key setup before any computation such as in threshold HE or incur large ciphertext size (at best, grows linearly in the number of involved keys) such as in multi-key HE. We propose a hybrid approach that combines the advantages of threshold and multi-key HE to support computations on ciphertexts encrypted with different keys while vastly reducing ciphertext size. Moreover, we propose the SparkFHE framework to support large-scale secure data analytics in the Cloud. SparkFHE integrates Apache Spark with Fully HE to support secure distributed data analytics and machine learning and make two novel contributions: (1) enabling Spark to perform efficient computation on large datasets while preserving user privacy, and (2) accelerating intensive homomorphic computation through parallelization of tasks across clusters of computing nodes. To our best knowledge, SparkFHE is the first addressing these two needs simultaneously

    FROM RAW DATA TO PROCESSABLE INFORMATIVE DATA: TRAINING DATA MANAGEMENT FOR BIG DATA ANALYTICS

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore