33,062 research outputs found

    Ontology of core data mining entities

    Get PDF
    In this article, we present OntoDM-core, an ontology of core data mining entities. OntoDM-core defines themost essential datamining entities in a three-layered ontological structure comprising of a specification, an implementation and an application layer. It provides a representational framework for the description of mining structured data, and in addition provides taxonomies of datasets, data mining tasks, generalizations, data mining algorithms and constraints, based on the type of data. OntoDM-core is designed to support a wide range of applications/use cases, such as semantic annotation of data mining algorithms, datasets and results; annotation of QSAR studies in the context of drug discovery investigations; and disambiguation of terms in text mining. The ontology has been thoroughly assessed following the practices in ontology engineering, is fully interoperable with many domain resources and is easy to extend

    Multi-resolution Tensor Learning for Large-Scale Spatial Data

    Get PDF
    High-dimensional tensor models are notoriously computationally expensive to train. We present a meta-learning algorithm, MMT, that can significantly speed up the process for spatial tensor models. MMT leverages the property that spatial data can be viewed at multiple resolutions, which are related by coarsening and finegraining from one resolution to another. Using this property, MMT learns a tensor model by starting from a coarse resolution and iteratively increasing the model complexity. In order to not "over-train" on coarse resolution models, we investigate an information-theoretic fine-graining criterion to decide when to transition into higher-resolution models. We provide both theoretical and empirical evidence for the advantages of this approach. When applied to two real-world large-scale spatial datasets for basketball player and animal behavior modeling, our approach demonstrate 3 key benefits: 1) it efficiently captures higher-order interactions (i.e., tensor latent factors), 2) it is orders of magnitude faster than fixed resolution learning and scales to very fine-grained spatial resolutions, and 3) it reliably yields accurate and interpretable models

    Learning Multiple Defaults for Machine Learning Algorithms

    Get PDF
    The performance of modern machine learning methods highly depends on their hyperparameter configurations. One simple way of selecting a configuration is to use default settings, often proposed along with the publication and implementation of a new algorithm. Those default values are usually chosen in an ad-hoc manner to work good enough on a wide variety of datasets. To address this problem, different automatic hyperparameter configuration algorithms have been proposed, which select an optimal configuration per dataset. This principled approach usually improves performance, but adds additional algorithmic complexity and computational costs to the training procedure. As an alternative to this, we propose learning a set of complementary default values from a large database of prior empirical results. Selecting an appropriate configuration on a new dataset then requires only a simple, efficient and embarrassingly parallel search over this set. We demonstrate the effectiveness and efficiency of the approach we propose in comparison to random search and Bayesian Optimization

    Factorizing LambdaMART for cold start recommendations

    Full text link
    Recommendation systems often rely on point-wise loss metrics such as the mean squared error. However, in real recommendation settings only few items are presented to a user. This observation has recently encouraged the use of rank-based metrics. LambdaMART is the state-of-the-art algorithm in learning to rank which relies on such a metric. Despite its success it does not have a principled regularization mechanism relying in empirical approaches to control model complexity leaving it thus prone to overfitting. Motivated by the fact that very often the users' and items' descriptions as well as the preference behavior can be well summarized by a small number of hidden factors, we propose a novel algorithm, LambdaMART Matrix Factorization (LambdaMART-MF), that learns a low rank latent representation of users and items using gradient boosted trees. The algorithm factorizes lambdaMART by defining relevance scores as the inner product of the learned representations of the users and items. The low rank is essentially a model complexity controller; on top of it we propose additional regularizers to constraint the learned latent representations that reflect the user and item manifolds as these are defined by their original feature based descriptors and the preference behavior. Finally we also propose to use a weighted variant of NDCG to reduce the penalty for similar items with large rating discrepancy. We experiment on two very different recommendation datasets, meta-mining and movies-users, and evaluate the performance of LambdaMART-MF, with and without regularization, in the cold start setting as well as in the simpler matrix completion setting. In both cases it outperforms in a significant manner current state of the art algorithms
    corecore