14,196 research outputs found

    k-Nearest Neighbour Classifiers: 2nd Edition (with Python examples)

    Get PDF
    Perhaps the most straightforward classifier in the arsenal or machine learning techniques is the Nearest Neighbour Classifier -- classification is achieved by identifying the nearest neighbours to a query example and using those neighbours to determine the class of the query. This approach to classification is of particular importance because issues of poor run-time performance is not such a problem these days with the computational power that is available. This paper presents an overview of techniques for Nearest Neighbour classification focusing on; mechanisms for assessing similarity (distance), computational issues in identifying nearest neighbours and mechanisms for reducing the dimension of the data. This paper is the second edition of a paper previously published as a technical report. Sections on similarity measures for time-series, retrieval speed-up and intrinsic dimensionality have been added. An Appendix is included providing access to Python code for the key methods.Comment: 22 pages, 15 figures: An updated edition of an older tutorial on kN

    Learning Tree-based Deep Model for Recommender Systems

    Full text link
    Model-based methods for recommender systems have been studied extensively in recent years. In systems with large corpus, however, the calculation cost for the learnt model to predict all user-item preferences is tremendous, which makes full corpus retrieval extremely difficult. To overcome the calculation barriers, models such as matrix factorization resort to inner product form (i.e., model user-item preference as the inner product of user, item latent factors) and indexes to facilitate efficient approximate k-nearest neighbor searches. However, it still remains challenging to incorporate more expressive interaction forms between user and item features, e.g., interactions through deep neural networks, because of the calculation cost. In this paper, we focus on the problem of introducing arbitrary advanced models to recommender systems with large corpus. We propose a novel tree-based method which can provide logarithmic complexity w.r.t. corpus size even with more expressive models such as deep neural networks. Our main idea is to predict user interests from coarse to fine by traversing tree nodes in a top-down fashion and making decisions for each user-node pair. We also show that the tree structure can be jointly learnt towards better compatibility with users' interest distribution and hence facilitate both training and prediction. Experimental evaluations with two large-scale real-world datasets show that the proposed method significantly outperforms traditional methods. Online A/B test results in Taobao display advertising platform also demonstrate the effectiveness of the proposed method in production environments.Comment: Accepted by KDD 201

    Telepath: Understanding Users from a Human Vision Perspective in Large-Scale Recommender Systems

    Full text link
    Designing an e-commerce recommender system that serves hundreds of millions of active users is a daunting challenge. From a human vision perspective, there're two key factors that affect users' behaviors: items' attractiveness and their matching degree with users' interests. This paper proposes Telepath, a vision-based bionic recommender system model, which understands users from such perspective. Telepath is a combination of a convolutional neural network (CNN), a recurrent neural network (RNN) and deep neural networks (DNNs). Its CNN subnetwork simulates the human vision system to extract key visual signals of items' attractiveness and generate corresponding activations. Its RNN and DNN subnetworks simulate cerebral cortex to understand users' interest based on the activations generated from browsed items. In practice, the Telepath model has been launched to JD's recommender system and advertising system. For one of the major item recommendation blocks on the JD app, click-through rate (CTR), gross merchandise value (GMV) and orders have increased 1.59%, 8.16% and 8.71% respectively. For several major ads publishers of JD demand-side platform, CTR, GMV and return on investment have increased 6.58%, 61.72% and 65.57% respectively by the first launch, and further increased 2.95%, 41.75% and 41.37% respectively by the second launch.Comment: 8 pages, 11 figures, 1 tabl

    Assessment of multi-temporal, multi-sensor radar and ancillary spatial data for grasslands monitoring in Ireland using machine learning approaches

    Get PDF
    Accurate inventories of grasslands are important for studies of carbon dynamics, biodiversity conservation and agricultural management. For regions with persistent cloud cover the use of multi-temporal synthetic aperture radar (SAR) data provides an attractive solution for generating up-to-date inventories of grasslands. This is even more appealing considering the data that will be available from upcoming missions such as Sentinel-1 and ALOS-2. In this study, the performance of three machine learning algorithms; Random Forests (RF), Support Vector Machines (SVM) and the relatively underused Extremely Randomised Trees (ERT) is evaluated for discriminating between grassland types over two large heterogeneous areas of Ireland using multi-temporal, multi-sensor radar and ancillary spatial datasets. A detailed accuracy assessment shows the efficacy of the three algorithms to classify different types of grasslands. Overall accuracies ≥ 88.7% (with kappa coefficient of 0.87) were achieved for the single frequency classifications and maximum accuracies of 97.9% (kappa coefficient of 0.98) for the combined frequency classifications. For most datasets, the ERT classifier outperforms SVM and RF

    High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests

    Get PDF
    We propose a new data-structure, the generalized randomized kd forest, or kgeraf, for approximate nearest neighbor searching in high dimensions. In particular, we introduce new randomization techniques to specify a set of independently constructed trees where search is performed simultaneously, hence increasing accuracy. We omit backtracking, and we optimize distance computations, thus accelerating queries. We release public domain software geraf and we compare it to existing implementations of state-of-the-art methods including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and product quantization. Experimental results indicate that our method would be the method of choice in dimensions around 1,000, and probably up to 10,000, and pointsets of cardinality up to a few hundred thousands or even one million; this range of inputs is encountered in many critical applications today. For instance, we handle a real dataset of 10610^6 images represented in 960 dimensions with a query time of less than 11sec on average and 90\% responses being true nearest neighbors
    • …
    corecore