54 research outputs found

    Reflexive Space. A Constructionist Model of the Russian Reflexive Marker

    Get PDF
    This study examines the structure of the Russian Reflexive Marker ( ся/-сь) and offers a usage-based model building on Construction Grammar and a probabilistic view of linguistic structure. Traditionally, reflexive verbs are accounted for relative to non-reflexive verbs. These accounts assume that linguistic structures emerge as pairs. Furthermore, these accounts assume directionality where the semantics and structure of a reflexive verb can be derived from the non-reflexive verb. However, this directionality does not necessarily hold diachronically. Additionally, the semantics and the patterns associated with a particular reflexive verb are not always shared with the non-reflexive verb. Thus, a model is proposed that can accommodate the traditional pairs as well as for the possible deviations without postulating different systems. A random sample of 2000 instances marked with the Reflexive Marker was extracted from the Russian National Corpus and the sample used in this study contains 819 unique reflexive verbs. This study moves away from the traditional pair account and introduces the concept of Neighbor Verb. A neighbor verb exists for a reflexive verb if they share the same phonological form excluding the Reflexive Marker. It is claimed here that the Reflexive Marker constitutes a system in Russian and the relation between the reflexive and neighbor verbs constitutes a cross-paradigmatic relation. Furthermore, the relation between the reflexive and the neighbor verb is argued to be of symbolic connectivity rather than directionality. Effectively, the relation holding between particular instantiations can vary. The theoretical basis of the present study builds on this assumption. Several new variables are examined in order to systematically model variability of this symbolic connectivity, specifically the degree and strength of connectivity between items. In usage-based models, the lexicon does not constitute an unstructured list of items. Instead, items are assumed to be interconnected in a network. This interconnectedness is defined as Neighborhood in this study. Additionally, each verb carves its own niche within the Neighborhood and this interconnectedness is modeled through rhyme verbs constituting the degree of connectivity of a particular verb in the lexicon. The second component of the degree of connectivity concerns the status of a particular verb relative to its rhyme verbs. The connectivity within the neighborhood of a particular verb varies and this variability is quantified by using the Levenshtein distance. The second property of the lexical network is the strength of connectivity between items. Frequency of use has been one of the primary variables in functional linguistics used to probe this. In addition, a new variable called Constructional Entropy is introduced in this study building on information theory. It is a quantification of the amount of information carried by a particular reflexive verb in one or more argument constructions. The results of the lexical connectivity indicate that the reflexive verbs have statistically greater neighborhood distances than the neighbor verbs. This distributional property can be used to motivate the traditional observation that the reflexive verbs tend to have idiosyncratic properties. A set of argument constructions, generalizations over usage patterns, are proposed for the reflexive verbs in this study. In addition to the variables associated with the lexical connectivity, a number of variables proposed in the literature are explored and used as predictors in the model. The second part of this study introduces the use of a machine learning algorithm called Random Forests. The performance of the model indicates that it is capable, up to a degree, of disambiguating the proposed argument construction types of the Russian Reflexive Marker. Additionally, a global ranking of the predictors used in the model is offered. Finally, most construction grammars assume that argument construction form a network structure. A new method is proposed that establishes generalization over the argument constructions referred to as Linking Construction. In sum, this study explores the structural properties of the Russian Reflexive Marker and a new model is set forth that can accommodate both the traditional pairs and potential deviations from it in a principled manner.Siirretty Doriast

    Advances in random forests with application to classification

    Get PDF
    Thesis (MCom)--Stellenbosch University, 2016.ENGLISH SUMMARY : Since their introduction, random forests have successfully been employed in a vast array of application areas. Fairly recently, a number of algorithms that adhere to Leo Breiman’s definition of a random forest have been proposed in the literature. Breiman’s popular random forest algorithm (Forest-RI), and related ensemble classification algorithms which followed, form the focus of this study. A review of random forest algorithms that were developed since the introduction of Forest-RI is given. This includes a novel taxonomy of random forest classification algorithms, which is based on their sources of randomization, and on deterministic modifications. Also, a visual conceptualization of contributions to random forest algorithms in the literature is provided by means of multidimensional scaling. Towards an analysis of advances in random forest algorithms, decomposition of the expected prediction error into bias and variance components is considered. In classification, such decompositions are not as straightforward as in the case of using squared-error loss for regression. Hence various definitions of bias and variance for classification can be found in the literature. Using a particular bias-variance decomposition, an empirical study of ensemble learners, including bagging, boosting and Forest-RI, is presented. From the empirical results and insights into the way in which certain mechanisms of random forests affect bias and variance, a novel random forest framework, viz. oblique random rotation forests, is proposed. Although not entirely satisfactory, the framework serves as an example of a heuristic approach towards novel proposals based on bias-variance analyses, instead of an ad hoc approach, as is often found in the literature. The analysis of comparative studies regarding advances in random forest algorithms is also considered. It is of interest to critically evaluate the conclusions that can be drawn from these studies, and to infer whether novel random forest algorithms are found to significantly outperform Forest-RI. For this purpose, a meta-analysis is conducted in which an evaluation is given of the state of research on random forests based on all (34) papers that could be found in which a novel random forest algorithm was proposed and compared to already existing random forest algorithms. Using the reported performances in each paper, a novel two-step procedure is proposed, which allows for multiple algorithms to be compared over multiple data sets, and across different papers. The meta analysis results indicate weighted voting strategies and variable weighting in high-dimensional settings to provide significantly improved performances over the performance of Breiman’s popular Forest-RI algorithm.AFRIKAANSE OPSOMMING : Sedert hulle bekendstelling is random forests met groot sukses in ’n wye verskeidenheid toepassings geimplementeer. ’n Aantal algoritmes wat aan Leo Breiman se definisie van ’n random forest voldoen, is redelik onlangs in die literatuur voorgestel. Breiman se gewilde random forest (Forest-RI) algoritme en verwante ensemble klassifikasie algoritmes wat daaruit ontwikkel is, vorm die fokus van die studie. ’n Oorsig van nuut ontwikkelde random forest algoritmes wat sedert die bekendstellig van Forest-RI voorgestel is, word gegee. Dit sluit ’n nuwe kategoriseringsraamwerk van random forest algoritmes in, wat gebaseer is op hulle bron van ewekansigheid, asook op hulle tipe deterministiese wysigings. Met behulp van meerdimensionele skalering word ’n visuele voorstelling van bydraes in die literatuur ten opsigte van random forest algoritmes ook gegee. Met die oog op ’n analise van ontwikkelings rondom random forest algoritmes, word die opdeling van die verwagte vooruitskattingsfout in ’n sydigheiden variansie komponent beskou. In vergelyking met regressie wanneer die gekwadreerde-fout verliesfunksie gebruik word, is hierdie opdeling in klassifikasie minder voor-die-hand-liggend. Derhalwe kom verskeie definisies van sydigheid en variansie vir klassifikasie in die literatuur voor. Deur gebruik te maak van ’n spesifieke sydigheid-variansie opdeling word ’n empiriese studie van ensemble algoritmes, ingesluit bagging, boosting en Forest-RI, uitgevoer. Uit die empiriese resultate en insigte rakende die manier waarop sekere meganismes van random forests sydigheid en variansie beinvloed, word ’n nuwe random forest raamwerk voorgestel, nl. oblique random rotation forests. Hoewel nie in geheel bevredigend nie, dien die raamwerk as ’n voorbeeld van ’n heuristiese benadering tot nuwe voorstelle gebaseer op sydigheid-variansie analises in plaas van ’n ad hoc benadering, soos wat dikwels gevind word in die literatuur. Verder word vergelykende studies met betrekking tot random forests geanaliseer. Hier is dit van belang om gevolgtrekkings wat uit vergelykende studies gemaak is, krities te evalueer, en om te verifieer of nuwe random forest algoritmes betekenisvol verbeter op Forest-RI. Met bogaande doelwitte in gedagte is ’n meta-analise uitgevoer waarin die stand van random forest navorsing geevalueer is. Die analise is gebaseer op al (34) artikels waarin ’n nuwe random forest algoritme voorgestel is en vergelyk word met reeds bestaande random forest algoritmes. Deur gebruik te maak van die gerapporteerde prestasie-maatstawwe in elke artikel, is ’n nuwe prosedure voorgestel waarvolgens ’n aantal algoritmes oor ’n aantal datastelle en oor verskillende artikels vergelyk kan word. Die resultate van die meta-analise toon aan dat geweegde stem-strategiee en die weging van veranderlikes in hoe-dimensionele data ’n betekenisvolle verbetering lewer op die akkuraatheid van Breiman se gewilde Forest-RI algoritme

    Exploiting random projections and sparsity with random forests and gradient boosting methods - Application to multi-label and multi-output learning, random forest model compression and leveraging input sparsity

    Full text link
    Within machine learning, the supervised learning field aims at modeling the input-output relationship of a system, from past observations of its behavior. Decision trees characterize the input-output relationship through a series of nested ``if-then-else'' questions, the testing nodes, leading to a set of predictions, the leaf nodes. Several of such trees are often combined together for state-of-the-art performance: random forest ensembles average the predictions of randomized decision trees trained independently in parallel, while tree boosting ensembles train decision trees sequentially to refine the predictions made by the previous ones. The emergence of new applications requires scalable supervised learning algorithms in terms of computational power and memory space with respect to the number of inputs, outputs, and observations without sacrificing accuracy. In this thesis, we identify three main areas where decision tree methods could be improved for which we provide and evaluate original algorithmic solutions: (i) learning over high dimensional output spaces, (ii) learning with large sample datasets and stringent memory constraints at prediction time and (iii) learning over high dimensional sparse input spaces. A first approach to solve learning tasks with a high dimensional output space, called binary relevance or single target, is to train one decision tree ensemble per output. However, it completely neglects the potential correlations existing between the outputs. An alternative approach called multi-output decision trees fits a single decision tree ensemble targeting simultaneously all the outputs, assuming that all outputs are correlated. Nevertheless, both approaches have (i) exactly the same computational complexity and (ii) target extreme output correlation structures. In our first contribution, we show how to combine random projection of the output space, a dimensionality reduction method, with the random forest algorithm decreasing the learning time complexity. The accuracy is preserved, and may even be improved by reaching a different bias-variance tradeoff. In our second contribution, we first formally adapt the gradient boosting ensemble method to multi-output supervised learning tasks such as multi-output regression and multi-label classification. We then propose to combine single random projections of the output space with gradient boosting on such tasks to adapt automatically to the output correlation structure. The random forest algorithm often generates large ensembles of complex models thanks to the availability of a large number of observations. However, the space complexity of such models, proportional to their total number of nodes, is often prohibitive, and therefore these modes are not well suited under stringent memory constraints at prediction time. In our third contribution, we propose to compress these ensembles by solving a L1-based regularization problem over the set of indicator functions defined by all their nodes. Some supervised learning tasks have a high dimensional but sparse input space, where each observation has only a few of the input variables that have non zero values. Standard decision tree implementations are not well adapted to treat sparse input spaces, unlike other supervised learning techniques such as support vector machines or linear models. In our fourth contribution, we show how to exploit algorithmically the input space sparsity within decision tree methods. Our implementation yields a significant speed up both on synthetic and real datasets, while leading to exactly the same model. It also reduces the required memory to grow such models by exploiting sparse instead of dense memory storage for the input matrix.Parmi les techniques d'apprentissage automatique, l'apprentissage supervisé vise à modéliser les relations entrée-sortie d'un système, à partir d'observations de son fonctionnement. Les arbres de décision caractérisent cette relation entrée-sortie à partir d'un ensemble hiérarchique de questions appelées les noeuds tests amenant à une prédiction, les noeuds feuilles. Plusieurs de ces arbres sont souvent combinés ensemble afin d'atteindre les performances de l'état de l'art: les ensembles de forêts aléatoires calculent la moyenne des prédictions d'arbres de décision randomisés, entraînés indépendamment et en parallèle alors que les ensembles d'arbres de boosting entraînent des arbres de décision séquentiellement, améliorant ainsi les prédictions faites par les précédents modèles de l'ensemble. L'apparition de nouvelles applications requiert des algorithmes d'apprentissage supervisé efficaces en terme de puissance de calcul et d'espace mémoire par rapport au nombre d'entrées, de sorties, et d'observations sans sacrifier la précision du modèle. Dans cette thèse, nous avons identifié trois domaines principaux où les méthodes d'arbres de décision peuvent être améliorées pour lequel nous fournissons et évaluons des solutions algorithmiques originales: (i) apprentissage sur des espaces de sortie de haute dimension, (ii) apprentissage avec de grands ensembles d'échantillons et des contraintes mémoires strictes au moment de la prédiction et (iii) apprentissage sur des espaces d'entrée creux de haute dimension. Une première approche pour résoudre des tâches d'apprentissage avec un espace de sortie de haute dimension, appelée "binary relevance" ou "single target", est l’apprentissage d’un ensemble d'arbres de décision par sortie. Toutefois, cette approche néglige complètement les corrélations potentiellement existantes entre les sorties. Une approche alternative, appelée "arbre de décision multi-sorties", est l’apprentissage d’un seul ensemble d'arbres de décision pour toutes les sorties, faisant l'hypothèse que toutes les sorties sont corrélées. Cependant, les deux approches ont (i) exactement la même complexité en temps de calcul et (ii) visent des structures de corrélation de sorties extrêmes. Dans notre première contribution, nous montrons comment combiner des projections aléatoires (une méthode de réduction de dimensionnalité) de l'espace de sortie avec l'algorithme des forêts aléatoires diminuant la complexité en temps de calcul de la phase d'apprentissage. La précision est préservée, et peut même être améliorée en atteignant un compromis biais-variance différent. Dans notre seconde contribution, nous adaptons d'abord formellement la méthode d'ensemble "gradient boosting" à la régression multi-sorties et à la classification multi-labels. Nous proposons ensuite de combiner une seule projection aléatoire de l'espace de sortie avec l’algorithme de "gradient boosting" sur de telles tâches afin de s'adapter automatiquement à la structure des corrélations existant entre les sorties. Les algorithmes de forêts aléatoires génèrent souvent de grands ensembles de modèles complexes grâce à la disponibilité d'un grand nombre d'observations. Toutefois, la complexité mémoire, proportionnelle au nombre total de noeuds, de tels modèles est souvent prohibitive, et donc ces modèles ne sont pas adaptés à des contraintes mémoires fortes lors de la phase de prédiction. Dans notre troisième contribution, nous proposons de compresser ces ensembles en résolvant un problème de régularisation basé sur la norme L1 sur l'ensemble des fonctions indicatrices défini par tous leurs noeuds. Certaines tâches d'apprentissage supervisé ont un espace d'entrée de haute dimension mais creux, où chaque observation possède seulement quelques variables d'entrée avec une valeur non-nulle. Les implémentations standards des arbres de décision ne sont pas adaptées pour traiter des espaces d'entrée creux, contrairement à d'autres techniques d'apprentissage supervisé telles que les machines à vecteurs de support ou les modèles linéaires. Dans notre quatrième contribution, nous montrons comment exploiter algorithmiquement le creux de l'espace d'entrée avec les méthodes d'arbres de décision. Notre implémentation diminue significativement le temps de calcul sur des ensembles de données synthétiques et réelles, tout en fournissant exactement le même modèle. Cela permet aussi de réduire la mémoire nécessaire pour apprendre de tels modèles en exploitant des méthodes de stockage appropriées pour la matrice des entrées

    A machine learning-remote sensing framework for modelling water stress in Shiraz vineyards

    Get PDF
    Thesis (MA)--Stellenbosch University, 2018.ENGLISH ABSTRACT: Water is a limited natural resource and a major environmental constraint for crop production in viticulture. The unpredictability of rainfall patterns, combined with the potentially catastrophic effects of climate change, further compound water scarcity, presenting dire future scenarios of undersupplied irrigation systems. Major water shortages could lead to devastating loses in grape production, which would negatively affect job security and national income. It is, therefore, imperative to develop management schemes and farming practices that optimise water usage and safeguard grape production. Hyperspectral remote sensing techniques provide a solution for the monitoring of vineyard water status. Hyperspectral data, combined with the quantitative analysis of machine learning ensembles, enables the detection of water-stressed vines, thereby facilitating precision irrigation practices and ensuring quality crop yields. To this end, the thesis set out to develop a machine learning–remote sensing framework for modelling water stress in a Shiraz vineyard. The thesis comprises two components. Component one assesses the utility of terrestrial hyperspectral imagery and machine learning ensembles to detect water-stressed Shiraz vines. The Random Forest (RF) and Extreme Gradient Boosting (XGBoost) ensembles were employed to discriminate between water-stressed and non-stressed Shiraz vines. Results showed that both ensemble learners could effectively discriminate between water-stressed and non-stressed vines. When using all wavebands (p = 176), RF yielded a test accuracy of 83.3% (KHAT = 0.67), with XGBoost producing a test accuracy of 80.0% (KHAT = 0.6). Component two explores semi-automated feature selection approaches and hyperparameter value optimisation to improve the developed framework. The utility of the Kruskal-Wallis (KW) filter, Sequential Floating Forward Selection (SFFS) wrapper, and a Filter-Wrapper (FW) approach, was evaluated. When using optimised hyperparameter values, an increase in test accuracy ranging from 0.8% to 5.0% was observed for both RF and XGBoost. In general, RF was found to outperform XGBoost. In terms of predictive competency and computational efficiency, the developed FW approach was the most successful feature selection method implemented. The developed machine learning–remote sensing framework warrants further investigation to confirm its efficacy. However, the thesis answered key research questions, with the developed framework providing a point of departure for future studies.AFRIKAANSE OPSOMMING: Water is 'n beperkte natuurlike hulpbron en 'n groot omgewingsbeperking vir gewasproduksie in wingerdkunde. Die onvoorspelbaarheid van reënvalpatrone, gekombineer met die potensiële katastrofiese gevolge van klimaatsverandering, voorspel ‘n toekoms van water tekorte vir besproeiingstelsels. Groot water tekorte kan lei tot groot verliese in druiweproduksie, wat 'n negatiewe uitwerking op werksekuriteit en nasionale inkomste sal hê. Dit is dus noodsaaklik om bestuurskemas en boerderypraktyke te ontwikkel wat die gebruik van water optimaliseer en druiweproduksie beskerm. Hyperspectrale afstandswaarnemingstegnieke bied 'n oplossing vir die monitering van wingerd water status. Hiperspektrale data, gekombineer met die kwantitatiewe analise van masjienleer klassifikasies, fasiliteer die opsporing van watergestresde wingerdstokke. Sodoende verseker dit presiese besproeiings praktyke en kwaliteit gewasopbrengs. Vir hierdie doel het die tesis probeer 'n masjienleer-afstandswaarnemings raamwerk ontwikkel vir die modellering van waterstres in 'n Shiraz-wingerd. Die tesis bestaan uit twee komponente. Komponent 1 het die nut van terrestriële hiperspektrale beelde en masjienleer klassifikasies gebruik om watergestresde Shiraz-wingerde op te spoor. Die Ewekansige Woud (RF) en Ekstreme Gradiënt Bevordering (XGBoost) algoritme was gebruik om te onderskei tussen watergestresde en nie-gestresde Shiraz-wingerde. Resultate het getoon dat beide RF en XGBoost effektief kan diskrimineer tussen watergestresde en nie-gestresde wingerdstokke. Met die gebruik van alle golfbande (p = 176) het RF 'n toets akkuraatheid van 83.3% (KHAT = 0.67) behaal en XGBoost het 'n toets akkuraatheid van 80.0% (KHAT = 0.6) gelewer. Komponent twee het die gebruik van semi-outomatiese veranderlike seleksie benaderings en hiperparameter waarde optimalisering ondersoek om die ontwikkelde raamwerk te verbeter. Die nut van die Kruskal-Wallis (KW) filter, sekwensiële drywende voorkoms seleksie (SFFS) wrapper en 'n Filter-Wrapper (FW) benadering is geëvalueer. Die gebruik van optimaliseerde hiperparameter waardes het gelei tot 'n toename in toets akkuraatheid (van 0.8% tot 5.0%) vir beide RF en XGBoost. In die algeheel het RF beter presteer as XGBoost. In terme van voorspellende bevoegdheid en berekenings doeltreffendheid was die ontwikkelde FW benadering die mees suksesvolle veranderlike seleksie metode. Die ontwikkelde masjienleer-afstandwaarnemende raamwerk benodig verder navorsing om sy doeltreffendheid te bevestig. Die tesis het egter sleutelnavorsingsvrae beantwoord, met die ontwikkelde raamwerk wat 'n vertrekpunt vir toekomstige studies verskaf.Master

    An application of machine learning to explore relationships between factors of organisational silence and culture, with specific focus on predicting silence behaviours

    Get PDF
    Research indicates that there are many individual reasons why people do not speak up when confronted with situations that may concern them within their working environment. One of the areas that requires more focused research is the role culture plays in why a person may remain silent when such situations arise. The purpose of this study is to use data science techniques to explore the patterns in a data set that would lead a person to engage in organisational silence. The main research question the thesis asks is: Is Machine Learning a tool that Social Scientists can use with respect to Organisational Silence and Culture, that augments commonly used statistical analysis approaches in this domain. This study forms part of a larger study being run by the third supervisor of this thesis. A questionnaire was developed by organisational psychologists within this group to collect data covering six traits of silence as well as cultural and individual attributes that could be used to determine if someone would engage in silence or not. This thesis explores three of those cultures to find main effects and interactions between variables that could influence silence behaviours. Data analysis was carried out on data collected in three European countries, Italy, Germany and Poland (n=774). The data analysis comprised of (1) exploring the characteristics of the data and determining the validity and reliability of the questionnaire; (2) identifying a suitable classification algorithm which displayed good predictive accuracy and modelled the data well based on eight already confirmed hypotheses from the organisational silence literature and (3) investigate newly discovered patterns and interactions within the data, that were previously not documented in the Silence literature on how culture plays a role in predicting silence. It was found that all the silence constructs showed good validity with the exception of Opportunistic Silence and Disengaged Silence. Validation of the cultural dimensions was found to be poor for all constructs when aggregated to individual level with the exception of Humane Orientation Organisational Practices, Power Distance Organisational Practices, Humane Orientation Societal Practices and Power Distance Societal Practices. In addition, not all constructs were invariant across countries. For example, a number of constructs showed invariance across the Poland and Germany samples, but failed for the Italian sample. Ten models were trained to identify predictors of a binary variable, engaged in Organisational Silence. Two of the most accurate models were chosen for further analysis of the main effects and interactions within the dataset, namely Random Forest (AUC = 0.655) and Conditional Inference Forests (AUC = 0.647). Models confirmed 9 out of 16 of the known relationships, and identified three additional potential interactions within the data that were previously not documented in the silence literature on how culture plays a role in predicting silence. For example, Climate for Authenticity was discovered to moderate the effect of both Power Distance Societal Practices and Diffident Silence in reducing the probability of someone engaging in silence. This is the first time this instrument was validated via statistical techniques for suitability to be used across cultures. The techniques of modelling the silence data using classification algorithms with Partial Dependency Plots is a novel and previously unexplored method of exploring organisational silence. In addition, the results identified new information on how culture plays a role in silence behaviours. The results also highlighted that models such as ensembles that identify non-linear relationships without making assumptions about the data, and visualisations depicting interactions identified by such models, can offer new insights over and above the current toolbox of analysis techniques prevalent in social science research

    Apprentissage supervisés sous contraintes

    Full text link
    As supervised learning occupies a larger and larger place in our everyday life, it is met with more and more constrained settings. Dealing with those constraints is a key to fostering new progress in the field, expanding ever further the limit of machine learning---a likely necessary step to reach artificial general intelligence. Supervised learning is an inductive paradigm in which time and data are refined into knowledge, in the form of predictive models. Models which can sometimes be, it must be conceded, opaque, memory demanding and energy consuming. Given this setting, a constraint can mean any number of things. Essentially, a constraint is anything that stand in the way of supervised learning, be it the lack of time, of memory, of data, or of understanding. Additionally, the scope of applicability of supervised learning is so vast it can appear daunting. Usefulness can be found in areas including medical analysis and autonomous driving---areas for which strong guarantees are required. All those constraints (time, memory, data, interpretability, reliability) might somewhat conflict with the traditional goal of supervised learning. In such a case, finding a balance between the constraints and the standard objective is problem-dependent, thus requiring generic solutions. Alternatively, concerns might arise after learning, in which case solutions must be developed under sub-optimal conditions, resulting in constraints adding up. An example of such situations is trying to enforce reliability once the data is no longer available. After detailing the background (what is supervised learning and why is it difficult, what algorithms will be used, where does it land in the broader scope of knowledge) in which this thesis integrates itself, we will discuss four different scenarios. The first one is about trying to learn a good decision forest model of a limited size, without learning first a large model and then compressing it. For that, we have developed the Globally Induced Forest (GIF) algorithm, which mixes local and global optimizations to produce accurate predictions under memory constraints in reasonable time. More specifically, the global part allows to sidestep the redundancy inherent in traditional decision forests. It is shown that the proposed method is more than competitive with standard tree-based ensembles under corresponding constraints, and can sometimes even surpass much larger models. The second scenario corresponds to the example given above: trying to enforce reliability without data. More specifically, the focus in on out-of-distribution (OOD) detection: recognizing samples which do not come from the original distribution the model was learned from. Tackling this problem with utter lack of data is challenging. Our investigation focuses on image classification with convolutional neural networks. Indicators which can be computed alongside the prediction with little additional cost are proposed. These indicators prove useful, stable and complementary for OOD detection. We also introduce a surprisingly simple, yet effective summary indicator, shown to perform well across several networks and datasets. It can easily be tuned further as soon as samples become available. Overall, interesting results can be reached in all but the most severe settings, for which it was a priori doubtful to come up with a data-free solution. The third scenario relates to transferring the knowledge of a large model in a smaller one in the absence of data. To do so, we propose to leverage a collection of unlabeled data which are easy to come up with in domains such as image classification. Two schemes are proposed (and then analyzed) to provide optimal transfer. Firstly, we proposed a biasing mechanism in the choice of unlabeled data to use so that the focus is on the more relevant samples. Secondly, we designed a teaching mechanism, applicable for almost all pairs of large and small networks, which allows for a much better knowledge transfer between the networks. Overall, good results are obtainable in decent time provided the collection of data actually contains relevant samples. The fourth scenario tackles the problem of interpretability: what knowledge can be gleaned more or less indirectly from data. We discuss two subproblems. The first one is to showcase that GIFs (cf. supra) can be used to derive intrinsically interpretable models. The second consists in a comparative study between methods and types of models (namely decision forests and neural networks) for the specific purpose of quantifying how much each variable is important in a given problem. After a preliminary study on benchmark datasets, the analysis turns to a concrete biological problem: inferring gene regulatory network from data. An ambivalent conclusion is reached: neural networks can be made to perform better than decision forests at predicting in almost all instances but struggle to identify the relevant variables in some situations. It would seem that better (motivated) methods need to be proposed for neural networks, especially in the face of highly non-linear problems

    Optimización de procesos industriales con técnicas de Minería de Datos: mantenimiento de aerogeneradores y fabricación con tecnologías láser

    Get PDF
    En este trabajo se emplean técnicas de Minería de Datos para mejorar la eficiencia de dos procesos industriales: el diagnóstico de fallos en aerogeneradores y la fabricación de piezas metálicas de geometría compleja mediante tecnologías láser. Se mejora la validación experimental de estudios anteriores, en los que no se usó validación cruzada ni se tuvieron en cuenta algunas particularidades de los problemas analizados. Para el diagnóstico de fallos en aerogeneradores, se identifica la técnica de clasificación más adecuada para relacionar medidas de vibraciones con el tipo de fallo. Además, se define la métrica más adecuada para evaluar su precisión. Para la fabricación de piezas metálicas de geometría compleja, se estima la técnica de clasificación más adecuada para predecir la calidad superficial obtenida con pulido superficial láser, así como la técnica de regresión para predecir los errores en los distintos requerimientos geométricos de piezas obtenidas mediante microfresado 3D láser.Ministerio de Economía y Competitividad, proyecto TIN-2011-24046

    UAVs for the Environmental Sciences

    Get PDF
    This book gives an overview of the usage of UAVs in environmental sciences covering technical basics, data acquisition with different sensors, data processing schemes and illustrating various examples of application

    Essentials of Business Analytics

    Get PDF
    corecore