    Student Modeling From Different Aspects

    With the wide usage of online tutoring systems, researchers become interested in mining data from logged files of these systems, so as to get better understanding of students. Varieties of aspects of students’ learning have become focus of studies, such as modeling students’ mastery status and affects. On the other hand, Randomized Controlled Trial (RCT), which is an unbiased method for getting insights of education, finds its way in Intelligent Tutoring System. Firstly, people are curious about what kind of settings would work better. Secondly, such a tutoring system, with lots of students and teachers using it, provides an opportunity for building a RCT infrastructure underlying the system. With the increasing interest in Data mining and RCTs, the thesis focuses on these two aspects. In the first part, we focus on analyzing and mining data from ASSISTments, an online tutoring system run by a team in Worcester Polytechnic Institute. Through the data, we try to answer several questions from different aspects of students learning. The first question we try to answer is what matters more to student modeling, skill information or student information. The second question is whether it is necessary to model students’ learning at different opportunity count. The third question is about the benefits of using partial credit, rather than binary credit as measurement of students’ learning in RCTs. The fourth question focuses on the amount that students spent Wheel Spinning in the tutoring system. The fifth questions studies the tradeoff between the mastery threshold and the time spent in the tutoring system. By answering the five questions, we both propose machine learning methodology that can be applied in educational data mining, and present findings from analyzing and mining the data. In the second part, we focused on RCTs within ASSISTments. Firstly, we looked at a pilot study of reassessment and relearning, which suggested a better system setting to improve students’ robust learning. Secondly, we proposed the idea to build an infrastructure of learning within ASSISTments, which provides the opportunities to improve the whole educational environment

    A local feature engineering strategy to improve network anomaly detection

    The dramatic increase in devices and services that has characterized modern societies in recent decades, boosted by the exponential growth of ever faster network connections and the predominant use of wireless connection technologies, has materialized a very crucial challenge in terms of security. The anomaly-based intrusion detection systems, which for a long time have represented some of the most efficient solutions to detect intrusion attempts on a network, have to face this new and more complicated scenario. Well-known problems, such as the difficulty of distinguishing legitimate activities from illegitimate ones due to their similar characteristics and their high degree of heterogeneity, today have become even more complex, considering the increase in the network activity. After providing an extensive overview of the scenario under consideration, this work proposes a Local Feature Engineering (LFE) strategy aimed to face such problems through the adoption of a data preprocessing strategy that reduces the number of possible network event patterns, increasing at the same time their characterization. Unlike the canonical feature engineering approaches, which take into account the entire dataset, it operates locally in the feature space of each single event. The experiments conducted on real-world data showed that this strategy, which is based on the introduction of new features and the discretization of their values, improves the performance of the canonical state-of-the-art solutions

    Adapting by copying. Towards a sustainable machine learning

    [eng] Despite the rapid growth of machine learning in the past decades, deploying automated decision making systems in practice remains a challenge for most companies. On an average day, data scientists face substantial barriers to serving models into production. Production environments are complex ecosystems, still largely based on on-premise technology, where modifications are timely and costly. Given the rapid pace with which the machine learning environment changes these days, companies struggle to stay up-to-date with the latest software releases, the changes in regulation and the newest market trends. As a result, machine learning often fails to deliver according to expectations. And more worryingly, this can result in unwanted risks for users, for the company itself and even for the society as a whole, insofar the negative impact of these risks is perpetuated in time. In this context, adaptation is an instrument that is both necessary and crucial for ensuring a sustainable deployment of industrial machine learning. This dissertation is devoted to developing theoretical and practical tools to enable adaptation of machine learning models in company production environments. More precisely, we focus on devising mechanisms to exploit the knowledge acquired by models to train future generations that are better fit to meet the stringent demands of a changing ecosystem. We introduce copying as a mechanism to replicate the decision behaviour of a model using another that presents differential characteristics, in cases where access to both the models and their training data are restricted. We discuss the theoretical implications of this methodology and show how it can be performed and evaluated in practice. Under the conceptual framework of actionable accountability we also explore how copying can be used to ensure risk mitigation in circumstances where deployment of a machine learning solution results in a negative impact to individuals or organizations.[spa] A pesar del rápido crecimiento del aprendizaje automático en últimas décadas, la implementación de sistemas automatizados para la toma de decisiones sigue siendo un reto para muchas empresas. Los científicos de datos se enfrentan a diario a numerosas barreras a la hora de desplegar los modelos en producción. Los entornos de producción son ecosistemas complejos, mayoritariamente basados en tecnologías on- premise, donde los cambios son costosos. Es por eso que las empresas tienen serias dificultades para mantenerse al día con las últimas versiones de software, los cambios en la regulación vigente o las nuevas tendencias del mercado. Como consecuencia, el rendimiento del aprendizaje automático está a menudo muy por debajo de las expectativas. Y lo que es más preocupante, esto puede derivar en riesgos para los usuarios, para las propias empresas e incluso para la sociedad en su conjunto, en la medida en que el impacto negativo de dichos riesgos se perpetúe en el tiempo. En este contexto, la adaptación se revela como un elemento necesario e imprescindible para asegurar la sostenibilidad del desarrollo industrial del aprendizaje automático. Este trabajo está dedicado a desarrollar las herramientas teóricas y prácticas necesarias para posibilitar la adaptación de los modelos de aprendizaje automático en entornos de producción. En concreto, nos centramos en concebir mecanismos que permitan reutilizar el conocimiento adquirido por los modelos para entrenar futuras generaciones que estén mejor preparadas para satisfacer las demandas de un entorno altamente cambiante. Introducimos la idea de copiar, como un mecanismo que permite replicar el comportamiento decisorio de un modelo utilizando un segundo que presenta características diferenciales, en escenarios donde el acceso tanto a los datos como al propio modelo está restringido. Es en este contexto donde discutimos las implicaciones teóricas de esta metodología y demostramos como las copias pueden ser entrenadas y evaluadas en la práctica. Bajo el marco de la responsabilidad accionable, exploramos también cómo las copias pueden explotarse como herramienta para la mitigación de riesgos en circunstancias en que el despliegue de una solución basada en el aprendizaje automático pueda tener un impacto negativo sobre las personas o las organizaciones

    Unsupervised Intrusion Detection with Cross-Domain Artificial Intelligence Methods

    Cybercrime is a major concern for corporations, business owners, governments and citizens, and it continues to grow in spite of increasing investments in security and fraud prevention. The main challenges in this research field are: being able to detect unknown attacks, and reducing the false positive ratio. The aim of this research work was to target both problems by leveraging four artificial intelligence techniques. The first technique is a novel unsupervised learning method based on skip-gram modeling. It was designed, developed and tested against a public dataset with popular intrusion patterns. A high accuracy and a low false positive rate were achieved without prior knowledge of attack patterns. The second technique is a novel unsupervised learning method based on topic modeling. It was applied to three related domains (network attacks, payments fraud, IoT malware traffic). A high accuracy was achieved in the three scenarios, even though the malicious activity significantly differs from one domain to the other. The third technique is a novel unsupervised learning method based on deep autoencoders, with feature selection performed by a supervised method, random forest. Obtained results showed that this technique can outperform other similar techniques. The fourth technique is based on an MLP neural network, and is applied to alert reduction in fraud prevention. This method automates manual reviews previously done by human experts, without significantly impacting accuracy

    Formalisation et étude de problématiques de scoring en risque de crédit: Inférence de rejet, discrétisation de variables et interactions, arbres de régression logistique

    This manuscript deals with model-based statistical learning in the binary classification setting. As an application, credit scoring is widely examined with a special attention on its specificities. Proposed and existing approaches are illustrated on real data from Crédit Agricole Consumer Finance, a financial institute specialized in consumer loans which financed this PhD through a CIFRE funding.First, we consider the so-called reject inference problem, which aims at taking advantage of the information collected on rejected credit applicants for which no repayment performance can be observed (i.e. unlabelled observations). This industrial problem led to a research one by reinterpreting unlabelled observations as an information loss that can be compensated by modelling missing data. This interpretation sheds light on existing reject inference methods and allows to conclude that none of them should be recommended since they lack proper modelling assumptions that make them suitable for classical statistical model selection tools.Next, yet another industrial problem, corresponding to the discretization of continuous features or grouping of levels of categorical features before any modelling step, was tackled. This is motivated by practical (interpretability) and theoretical reasons (predictive power). To perform these quantizations, ad hoc heuristics are often used, which are empirical and time-consuming for practitioners. They are seen here as a latent variable problem, setting us back to a model selection problem. The high combinatorics of this model space necessitated a new cost-effective and automatic exploration strategy which involves either a particular neural network architecture or a Stochastic-EM algorithm and gives precise statistical guarantees.Third, as an extension to the preceding problem, interactions of covariates may be introduced in the problem in order to improve the predictive performance. This task, up to now again manually processed by practitioners and highly combinatorial, presents an accrued risk of misselecting a ``good'' model. It is performed here with a Metropolis-Hastings sampling procedure which finds the best interactions in an automatic fashion while ensuring its standard convergence properties, thus good predictive performance is guaranteed.Finally, contrary to the preceding problems which tackled a particular scorecard, we look at the scoring system as a whole. It generally consists of a tree-like structure composed of many scorecards (each relative to a particular population segment), which is often not optimized but rather imposed by the company's culture and / or history. Again, ad hoc industrial procedures are used, which lead to suboptimal performance. We propose some lines of approach to optimize this logistic regression tree which result in good empirical performance and new research directions illustrating the predictive strength and interpretability of a mix of parametric and non-parametric models.This manuscript is concluded by a discussion on potential scientific obstacles, among which the high dimensionality (in the number of features). The financial industry is indeed investing massively in unstructured data storage, which remains to this day largely unused for Credit Scoring applications. Doing so will need statistical guarantees to achieve the additional predictive performance that was hoped for.Cette thèse se place dans le cadre des modèles d’apprentissage automatique de classification binaire. Le cas d’application est le scoring de risque de crédit. En particulier, les méthodes proposées ainsi que les approches existantes sont illustrées par des données réelles de Crédit Agricole Consumer Finance, acteur majeur en Europe du crédit à la consommation, à l’origine de cette thèse grâce à un financement CIFRE.Premièrement, on s’intéresse à la problématique dite de ``réintégration des refusés''. L’objectif est de tirer parti des informations collectées sur les clients refusés, donc par définition sans étiquette connue, quant à leur remboursement de crédit. L’enjeu a été de reformuler cette problématique industrielle classique dans un cadre rigoureux, celui de la modélisation pour données manquantes. Cette approche a permis de donner tout d’abord un nouvel éclairage aux méthodes standards de réintégration, et ensuite de conclure qu’aucune d’entre elles n’était réellement à recommander tant que leur modélisation, lacunaire en l’état, interdisait l’emploi de méthodes de choix de modèles statistiques.Une autre problématique industrielle classique correspond à la discrétisation des variables continues et le regroupement des modalités de variables catégorielles avant toute étape de modélisation. La motivation sous-jacente correspond à des raisons à la fois pratiques (interprétabilité) et théoriques (performance de prédiction). Pour effectuer ces quantifications, des heuristiques, souvent manuelles et chronophages, sont cependant utilisées. Nous avons alors reformulé cette pratique courante de perte d’information comme un problème de modélisation à variables latentes, revenant ainsi à une sélection de modèle. Par ailleurs, la combinatoire associé à cet espace de modèles nous a conduit à proposer des stratégies d’exploration, soit basées sur un réseau de neurone avec un gradient stochastique, soit basées sur un algorithme de type EM stochastique.Comme extension du problème précédent, il est également courant d’introduire des interactions entre variables afin, comme toujours, d’améliorer la performance prédictive des modèles. La pratique classiquement répandue est de nouveau manuelle et chronophage, avec des risques accrus étant donnée la surcouche combinatoire que cela engendre. Nous avons alors proposé un algorithme de Metropolis-Hastings permettant de rechercher les meilleures interactions de façon quasi-automatique tout en garantissant de bonnes performances grâce à ses propriétés de convergence standards.La dernière problématique abordée vise de nouveau à formaliser une pratique répandue, consistant à définir le système d’acceptation non pas comme un unique score mais plutôt comme un arbre de scores. Chaque branche de l’arbre est alors relatif à un segment de population particulier. Pour lever la sous-optimalité des méthodes classiques utilisées dans les entreprises, nous proposons une approche globale optimisant le système d’acceptation dans son ensemble. Les résultats empiriques qui en découlent sont particulièrement prometteurs, illustrant ainsi la flexibilité d’un mélange de modélisation paramétrique et non paramétrique.Enfin, nous anticipons sur les futurs verrous qui vont apparaître en Credit Scoring et qui sont pour beaucoup liés la grande dimension (en termes de prédicteurs). En effet, l’industrie financière investit actuellement dans le stockage de données massives et non structurées, dont la prochaine utilisation dans les règles de prédiction devra s’appuyer sur un minimum de garanties théoriques pour espérer atteindre les espoirs de performance prédictive qui ont présidé à cette collecte

    On Transforming Reinforcement Learning by Transformer: The Development Trajectory

    Transformer, originally devised for natural language processing, has also attested significant success in computer vision. Thanks to its super expressive power, researchers are investigating ways to deploy transformers to reinforcement learning (RL) and the transformer-based models have manifested their potential in representative RL benchmarks. In this paper, we collect and dissect recent advances on transforming RL by transformer (transformer-based RL or TRL), in order to explore its development trajectory and future trend. We group existing developments in two categories: architecture enhancement and trajectory optimization, and examine the main applications of TRL in robotic manipulation, text-based games, navigation and autonomous driving. For architecture enhancement, these methods consider how to apply the powerful transformer structure to RL problems under the traditional RL framework, which model agents and environments much more precisely than deep RL methods, but they are still limited by the inherent defects of traditional RL algorithms, such as bootstrapping and "deadly triad". For trajectory optimization, these methods treat RL problems as sequence modeling and train a joint state-action model over entire trajectories under the behavior cloning framework, which are able to extract policies from static datasets and fully use the long-sequence modeling capability of the transformer. Given these advancements, extensions and challenges in TRL are reviewed and proposals about future direction are discussed. We hope that this survey can provide a detailed introduction to TRL and motivate future research in this rapidly developing field.Comment: 26 page

    Active Learning of Classification Models from Enriched Label-related Feedback

    Our ability to learn accurate classification models from data is often limited by the number of available labeled data instances. This limitation is of particular concern when data instances need to be manually labeled by human annotators and when the labeling process carries a significant cost. Recent years witnessed increased research interest in developing methods in different directions capable of learning models from a smaller number of examples. One such direction is active learning, which finds the most informative unlabeled instances to be labeled next. Another, more recent direction showing a great promise utilizes enriched label-related feedback. In this case, such feedback from the human annotator provides additional information reflecting the relations among possible labels. The cost of such feedback is often negligible compared with the cost of instance review. The enriched label-related feedback may come in different forms. In this work, we propose, develop and study classification models for binary, multi-class and multi-label classification problems that utilize the different forms of enriched label-related feedback. We show that this new feedback can help us improve the quality of classification models compared with the standard class-label feedback. For each of the studied feedback forms, we also develop new active learning strategies for selecting the most informative unlabeled instances that are compatible with the respective feedback form, effectively combining two approaches for reducing the number of required labeled instances. We demonstrate the effectiveness of our new framework on both simulated and real-world datasets

    Feature selection and personalized modeling on medical adverse outcome prediction

    This thesis is about the medical adverse outcome prediction and is composed of three parts, i.e. feature selection, time-to-event prediction and personalized modeling. For feature selection, we proposed a three-stage feature selection method which is an ensemble of filter, embedded and wrapper selection techniques. We combine them in a way to select a both stable and predictive set of features as well as reduce the computation burden. Datasets on two adverse outcome prediction problems, 30-day hip fracture readmission and diabetic retinopathy prognosis are derived from electronic health records and exemplified to prove the effectiveness of the proposed method. With the selected features, we investigated the application of some classical survival analysis models, namely the accelerated failure time models, Cox proportional hazard regression models and mixture cure models on adverse outcome prediction. Unlike binary classifiers, survival analysis methods consider both the status and time-to-event information and provide more flexibility when we are interested in the occurrence of adverse outcome in different time windows. Lastly, we introduced the use of personalized modeling(PM) to predict adverse outcome based on the most similar patients of each query patient. Different from the commonly used global modeling approach, PM builds prediction model on smaller but more similar patient cohort thus leading to a more individual-based prediction and customized risk factor profile. Both static and metric learning distance measures are used to identify similar patient cohort. We show that PM together with feature selection achieves better prediction performance by using only similar patients, compared with using data from all available patients in one-size-fits-all model