13 research outputs found

    Transductive Learning for Spatial Data Classification

    Full text link
    Learning classifiers of spatial data presents several issues, such as the heterogeneity of spatial objects, the implicit definition of spatial relationships among objects, the spatial autocorrelation and the abundance of unlabelled data which potentially convey a large amount of information. The first three issues are due to the inherent structure of spatial units of analysis, which can be easily accommodated if a (multi-)relational data mining approach is considered. The fourth issue demands for the adoption of a transductive setting, which aims to make predictions for a given set of unlabelled data. Transduction is also motivated by the contiguity of the concept of positive autocorrelation, which typically affect spatial phenomena, with the smoothness assumption which characterize the transductive setting. In this work, we investigate a relational approach to spatial classification in a transductive setting. Computational solutions to the main difficulties met in this approach are presented. In particular, a relational upgrade of the nave Bayes classifier is proposed as discriminative model, an iterative algorithm is designed for the transductive classification of unlabelled data, and a distance measure between relational descriptions of spatial objects is defined in order to determine the k-nearest neighbors of each example in the dataset. Computational solutions have been tested on two real-world spatial datasets. The transformation of spatial data into a multi-relational representation and experimental results are reported and commented

    Data Programming: Creating Large Training Sets, Quickly

    Get PDF
    Abstract Large labeled training sets are the critical building blocks of supervised learning methods and are key enablers of deep learning techniques. For some applications, creating labeled training sets is the most time-consuming and expensive part of applying machine learning. We therefore propose a paradigm for the programmatic creation of training sets called data programming in which users express weak supervision strategies or domain heuristics as labeling functions, which are programs that label subsets of the data, but that are noisy and may conflict. We show that by explicitly representing this training set labeling process as a generative model, we can "denoise" the generated training set, and establish theoretically that we can recover the parameters of these generative models in a handful of settings. We then show how to modify a discriminative loss function to make it noise-aware, and demonstrate our method over a range of discriminative models including logistic regression and LSTMs. Experimentally, on the 2014 TAC-KBP Slot Filling challenge, we show that data programming would have led to a new winning score, and also show that applying data programming to an LSTM model leads to a TAC-KBP score almost 6 F1 points over a state-of-the-art LSTM baseline (and into second place in the competition). Additionally, in initial user studies we observed that data programming may be an easier way for non-experts to create machine learning models when training data is limited or unavailable

    Semi-generative modelling: learning with cause and effect features

    Get PDF
    We consider a case of covariate shift where prior causal inference or expert knowledge has identified some features as effects, and show how this setting, when analysed from a causal perspective, gives rise to a semi-generative modelling framework: P(Y,X_eff|Xcau)

    Multi-relational learning, text mining, and semi-supervised learning for functional genomics

    No full text
    Abstract. We focus on the problem of predicting functional properties of the proteins corresponding to genes in the yeast genome. Our goal is to study the effectiveness of approaches that utilize all data sources that are available in this problem setting, including relational data, abstracts of research papers, and unlabeled data. We investigate a propositionalization approach which uses relational gene interaction data. We study the benefit of text classification and information extraction for utilizing a collection of scientific abstracts. We study transduction and co-training for using unlabeled data. We report on both, positive and negative results on the investigated approaches. The studied tasks are KDD Cup tasks of 2001 and 2002. The solutions which we describe achieved the highest score for task 2 in 2001, the fourth rank for task 3 in 2001, the highest score for one of the two subtasks and the third place for the overall task 2 in 2002

    Random Relational Rules

    Get PDF
    In the field of machine learning, methods for learning from single-table data have received much more attention than those for learning from multi-table, or relational data, which are generally more computationally complex. However, a significant amount of the world's data is relational. This indicates a need for algorithms that can operate efficiently on relational data and exploit the larger body of work produced in the area of single-table techniques. This thesis presents algorithms for learning from relational data that mitigate, to some extent, the complexity normally associated with such learning. All algorithms in this thesis are based on the generation of random relational rules. The assumption is that random rules enable efficient and effective relational learning, and this thesis presents evidence that this is indeed the case. To this end, a system for generating random relational rules is described, and algorithms using these rules are evaluated. These algorithms include direct classification, classification by propositionalisation, clustering, semi-supervised learning and generating random forests. The experimental results show that these algorithms perform competitively with previously published results for the datasets used, while often exhibiting lower runtime than other tested systems. This demonstrates that sufficient information for classification and clustering is retained in the rule generation process and that learning with random rules is efficient. Further applications of random rules are investigated. Propositionalisation allows single-table algorithms for classification and clustering to be applied to the resulting data, reducing the amount of relational processing required. Further results show that techniques for utilising additional unlabeled training data improve accuracy of classification in the semi-supervised setting. The thesis also develops a novel algorithm for building random forests by makingefficient use of random rules to generate trees and leaves in parallel

    Apprentissage profond faiblement supervisé et semi-supervisé pour la détection d'évènements sonores

    Get PDF
    La quantité de données produite par les médias tel que Youtube est une mine d'or d'information pour les algorithmes d'apprentissage machine. Une mine d'or inatteignable tant que ces informations n'ont pas été raffinées. Pour les algorithmes dits supervisés, il est nécessaire d'associer à chaque information disponible une étiquette permettant de l'identifier et de l'utiliser. C'est un travail fastidieux, lent et coûteux, réalisé par des annotateurs humains de manière bénévole ou professionnellement. Cependant, la quantité d'information générée chaque jour excède largement nos capacités d'annotation humaine. Il est alors nécessaire de se tourner vers des méthodes d'apprentissage capables d'utiliser l'information dans sa forme brute ou légèrement travaillée. Cette problématique est au coeur de ma thèse, où il s'agit, dans une première partie, d'exploiter des annotations humaines dites " faibles ", puis d'exploiter des données partiellement annotées dans une seconde partie. La détection automatique d'évènements sonores polyphoniques est une problématique difficile à résoudre. Les évènements sonores se superposent, se répètent et varient dans le domaine fréquentiel même au sein d'une même catégorie. Toutes ces difficultés rendent la tâche d'annotation encore plus difficile, non seulement pour un annotateur humain, mais aussi pour des systèmes entraînés à la classification. La classification audio de manière semi-supervisée, c'est-à-dire lorsqu'une partie conséquente du jeu de données n'a pas été annotée, est l'une des solutions proposées à la problématique de l'immense quantité de données générée chaque jour. Les méthodes d'apprentissage profond semi-supervisées sont nombreuses et utilisent différents mécanismes permettant d'extraire implicitement des informations des données non-annotées, les rendant ainsi utiles et directement utilisables. L'objectif de cette thèse est dans un premier temps, d'étudier et proposer des approches faiblement supervisées pour la tâche de détection d'évènements sonores, mises en oeuvre lors de notre participation à la tâche quatre du défi international DCASE. Il s'agit ici d'enregistrements audio faiblement supervisés réalistes, de type bruits domestique. Afin de résoudre cette tâche, nous avons proposé deux solutions fondées sur les réseaux de neurones convolutifs récurrents, ainsi que sur des hypothèses statistiques contraignant l'entraînement. Dans un second temps, nous nous pencherons sur l'apprentissage profond semi-supervisé, lorsqu'une majorité de l'information n'est pas annotée. Nous comparons des approches développées pour la classification d'images au départ, avant de proposer leur application À la classification audio. Nous montrons que les approches les plus récentes permettent d'obtenir des résultats aussi bons qu'un entraînement entièrement supervisé, qui lui aurait eu accès à l'intégralité des annotations.The amount of information produced by media such as Youtube, Facebook, or Instagram is a gold mine of information for machine and deep learning algorithms. A gold mine that cannot be reached until this information has been refined. For supervised algorithms, it is necessary to associate a label to each available piece of information allowing to identify and use it. This is a tedious, slow, and costly task, performed by human annotators on a voluntary or professional basis. However, the amount of information generated each day far exceeds our human annotation capabilities. It is then necessary to turn to learning methods capable of using the information in its raw or slightly processed form. For that, we will focus on weak annotations in the first part, then on partial annotations in the second part. The detection of sound events in a polyphonic environment is a difficult problem to solve. The sound events overlap, repeat or vary in the frequency domain. All these difficulties make the annotation task even more challenging, not only for a human annotator but also for systems trained in simple classification (mono phone). Semi-supervised audio classification, i.e. when a significant part of the dataset has not been annotated, is another proposed solution to the problem of the huge amount of data generated every day. Semi-supervised deep learning methods are numerous and use different mechanisms to implicitly extract information from these unannotated data, making them useful and directly usable. The objectives of this thesis are two folds. Firstly, to study and propose weakly supervised approaches for the sound event detection task in our participation in the DCASE international challenge task four, which provides realistic weakly supervised audio recordings extracted from domestic scenes. To solve this task, we suggest two solutions based on recurrent neural networks and statistical assumptions constraining the training. Secondly, we focus on semi-supervised deep learning when most of the information is not annotated. We compare approaches developed for image classification before proposing their application to audio classification and a substantial improvement. We show that the most recent approaches can achieve results as good as fully supervised training, which would have had access to all annotations
    corecore