970 research outputs found

    MĂ©thodes statistiques de dĂ©tection d’observations atypiques pour des donnĂ©es en grande dimension

    Get PDF
    La dĂ©tection d’observations atypiques de maniĂšre non-supervisĂ©e est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la dĂ©tection de dĂ©fauts industriels, cette tĂąche est d’une importance capitale pour assurer une production de haute qualitĂ©. Avec l’accroissement exponentiel du nombre de mesures effectuĂ©es sur les composants Ă©lectroniques, la problĂ©matique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spĂ©cialiste en statistique industrielle et dĂ©tection d’anomalies, s’est associĂ©e au laboratoire de recherche TSE-R en finançant ce travail de thĂšse. Le premier chapitre commence par prĂ©senter le contexte du contrĂŽle de qualitĂ© et les diffĂ©rentes procĂ©dures dĂ©jĂ  mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne rĂ©pondent pas aux nouvelles attentes requises par le traitement de donnĂ©es en grande dimension, d’autres solutions doivent ĂȘtre envisagĂ©es. La suite du chapitre rĂ©sume l’ensemble des mĂ©thodes multivariĂ©es et non supervisĂ©es de dĂ©tection d’observations atypiques existantes, en insistant tout particuliĂšrement sur celles qui gĂšrent des donnĂ©es en grande dimension. Le Chapitre 2 montre thĂ©oriquement que la trĂšs connue distance de Mahalanobis n’est pas adaptĂ©e Ă  la dĂ©tection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la mĂ©thode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intĂ©ressante Ă  la mise en Ă©vidence de la structure des donnĂ©es atypiques. Une mĂ©thodologie pour sĂ©lectionner seulement les composantes d’intĂ©rĂȘt est proposĂ©e et ses performances sont comparĂ©es aux standards habituels sur des simulations ainsi que sur des exemples rĂ©els industriels. Cette nouvelle procĂ©dure a Ă©tĂ© mise en oeuvre dans un package R, ICSOutlier, prĂ©sentĂ© dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des consĂ©quences directes de l’augmentation du nombre de dimensions est la singularitĂ© des estimateurs de dispersion multivariĂ©s, dĂšs que certaines variables sont colinĂ©aires ou que leur nombre excĂšde le nombre d’individus. Or, la dĂ©finition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion dĂ©finis positifs. Le Chapitre 4 envisage diffĂ©rentes pistes pour adapter le critĂšre d’ICS et investigue de maniĂšre thĂ©orique les propriĂ©tĂ©s de chacune des propositions prĂ©sentĂ©es. La question de l’affine invariance de la mĂ©thode est en particulier Ă©tudiĂ©e. Enfin le dernier chapitre, se consacre Ă  l’algorithme dĂ©veloppĂ© pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idĂ©es gĂ©nĂ©rales et prĂ©cise les challenges relevĂ©s, notamment numĂ©riques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development

    Novelty, distillation, and federation in machine learning for medical imaging

    Get PDF
    The practical application of deep learning methods in the medical domain has many challenges. Pathologies are diverse and very few examples may be available for rare cases. Where data is collected it may lie in multiple institutions and cannot be pooled for practical and ethical reasons. Deep learning is powerful for image segmentation problems but ultimately its output must be interpretable at the patient level. Although clearly not an exhaustive list, these are the three problems tackled in this thesis. To address the rarity of pathology I investigate novelty detection algorithms to find outliers from normal anatomy. The problem is structured as first finding a low-dimension embedding and then detecting outliers in that embedding space. I evaluate for speed and accuracy several unsupervised embedding and outlier detection methods. Data consist of Magnetic Resonance Imaging (MRI) for interstitial lung disease for which healthy and pathological patches are available; only the healthy patches are used in model training. I then explore the clinical interpretability of a model output. I take related work by the Canon team — a model providing voxel-level detection of acute ischemic stroke signs — and deliver the Alberta Stroke Programme Early CT Score (ASPECTS, a measure of stroke severity). The data are acute head computed tomography volumes of suspected stroke patients. I convert from the voxel level to the brain region level and then to the patient level through a series of rules. Due to the real world clinical complexity of the problem, there are at each level — voxel, region and patient — multiple sources of “truth”; I evaluate my results appropriately against these truths. Finally, federated learning is used to train a model on data that are divided between multiple institutions. I introduce a novel evolution of this algorithm — dubbed “soft federated learning” — that avoids the central coordinating authority, and takes into account domain shift (covariate shift) and dataset size. I first demonstrate the key properties of these two algorithms on a series of MNIST (handwritten digits) toy problems. Then I apply the methods to the BraTS medical dataset, which contains MRI brain glioma scans from multiple institutions, to compare these algorithms in a realistic setting

    MĂ©thodes statistiques de dĂ©tection d’observations atypiques pour des donnĂ©es en grande dimension

    Get PDF
    La dĂ©tection d’observations atypiques de maniĂšre non-supervisĂ©e est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la dĂ©tection de dĂ©fauts industriels, cette tĂąche est d’une importance capitale pour assurer une production de haute qualitĂ©. Avec l’accroissement exponentiel du nombre de mesures effectuĂ©es sur les composants Ă©lectroniques, la problĂ©matique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spĂ©cialiste en statistique industrielle et dĂ©tection d’anomalies, s’est associĂ©e au laboratoire de recherche TSE-R en finançant ce travail de thĂšse. Le premier chapitre commence par prĂ©senter le contexte du contrĂŽle de qualitĂ© et les diffĂ©rentes procĂ©dures dĂ©jĂ  mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne rĂ©pondent pas aux nouvelles attentes requises par le traitement de donnĂ©es en grande dimension, d’autres solutions doivent ĂȘtre envisagĂ©es. La suite du chapitre rĂ©sume l’ensemble des mĂ©thodes multivariĂ©es et non supervisĂ©es de dĂ©tection d’observations atypiques existantes, en insistant tout particuliĂšrement sur celles qui gĂšrent des donnĂ©es en grande dimension. Le Chapitre 2 montre thĂ©oriquement que la trĂšs connue distance de Mahalanobis n’est pas adaptĂ©e Ă  la dĂ©tection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la mĂ©thode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intĂ©ressante Ă  la mise en Ă©vidence de la structure des donnĂ©es atypiques. Une mĂ©thodologie pour sĂ©lectionner seulement les composantes d’intĂ©rĂȘt est proposĂ©e et ses performances sont comparĂ©es aux standards habituels sur des simulations ainsi que sur des exemples rĂ©els industriels. Cette nouvelle procĂ©dure a Ă©tĂ© mise en oeuvre dans un package R, ICSOutlier, prĂ©sentĂ© dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des consĂ©quences directes de l’augmentation du nombre de dimensions est la singularitĂ© des estimateurs de dispersion multivariĂ©s, dĂšs que certaines variables sont colinĂ©aires ou que leur nombre excĂšde le nombre d’individus. Or, la dĂ©finition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion dĂ©finis positifs. Le Chapitre 4 envisage diffĂ©rentes pistes pour adapter le critĂšre d’ICS et investigue de maniĂšre thĂ©orique les propriĂ©tĂ©s de chacune des propositions prĂ©sentĂ©es. La question de l’affine invariance de la mĂ©thode est en particulier Ă©tudiĂ©e. Enfin le dernier chapitre, se consacre Ă  l’algorithme dĂ©veloppĂ© pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idĂ©es gĂ©nĂ©rales et prĂ©cise les challenges relevĂ©s, notamment numĂ©riques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development

    Anomaly detection in smart city wireless sensor networks

    Get PDF
    Aquesta tesi proposa una plataforma de detecciĂł d’intrusions per a revelar atacs a les xarxes de sensors sense fils (WSN, per les sigles en anglĂšs) de les ciutats intel·ligents (smart cities). La plataforma estĂ  dissenyada tenint en compte les necessitats dels administradors de la ciutat intel·ligent, els quals necessiten accĂ©s a una arquitectura centralitzada que pugui gestionar alarmes de seguretat en un sistema altament heterogeni i distribuĂŻt. En aquesta tesi s’identifiquen els diversos passos necessaris des de la recollida de dades fins a l’execuciĂł de les tĂšcniques de detecciĂł d’intrusions i s’avalua que el procĂ©s sigui escalable i capaç de gestionar dades tĂ­piques de ciutats intel·ligents. A mĂ©s, es comparen diversos algorismes de detecciĂł d’anomalies i s’observa que els mĂštodes de vectors de suport d’una mateixa classe (one-class support vector machines) resulten la tĂšcnica multivariant mĂ©s adequada per a descobrir atacs tenint en compte les necessitats d’aquest context. Finalment, es proposa un esquema per a ajudar els administradors a identificar els tipus d’atacs rebuts a partir de les alarmes disparades.Esta tesis propone una plataforma de detecciĂłn de intrusiones para revelar ataques en las redes de sensores inalĂĄmbricas (WSN, por las siglas en inglĂ©s) de las ciudades inteligentes (smart cities). La plataforma estĂĄ diseñada teniendo en cuenta la necesidad de los administradores de la ciudad inteligente, los cuales necesitan acceso a una arquitectura centralizada que pueda gestionar alarmas de seguridad en un sistema altamente heterogĂ©neo y distribuido. En esta tesis se identifican los varios pasos necesarios desde la recolecciĂłn de datos hasta la ejecuciĂłn de las tĂ©cnicas de detecciĂłn de intrusiones y se evalĂșa que el proceso sea escalable y capaz de gestionar datos tĂ­picos de ciudades inteligentes. AdemĂĄs, se comparan varios algoritmos de detecciĂłn de anomalĂ­as y se observa que las mĂĄquinas de vectores de soporte de una misma clase (one-class support vector machines) resultan la tĂ©cnica multivariante mĂĄs adecuada para descubrir ataques teniendo en cuenta las necesidades de este contexto. Finalmente, se propone un esquema para ayudar a los administradores a identificar los tipos de ataques recibidos a partir de las alarmas disparadas.This thesis proposes an intrusion detection platform which reveals attacks in smart city wireless sensor networks (WSN). The platform is designed taking into account the needs of smart city administrators, who need access to a centralized architecture that can manage security alarms in a highly heterogeneous and distributed system. In this thesis, we identify the various necessary steps from gathering WSN data to running the detection techniques and we evaluate whether the procedure is scalable and capable of handling typical smart city data. Moreover, we compare several anomaly detection algorithms and we observe that one-class support vector machines constitute the most suitable multivariate technique to reveal attacks, taking into account the requirements in this context. Finally, we propose a classification schema to assist administrators in identifying the types of attacks compromising their networks

    Context dependent spectral unmixing.

    Get PDF
    A hyperspectral unmixing algorithm that finds multiple sets of endmembers is proposed. The algorithm, called Context Dependent Spectral Unmixing (CDSU), is a local approach that adapts the unmixing to different regions of the spectral space. It is based on a novel function that combines context identification and unmixing. This joint objective function models contexts as compact clusters and uses the linear mixing model as the basis for unmixing. Several variations of the CDSU, that provide additional desirable features, are also proposed. First, the Context Dependent Spectral unmixing using the Mahalanobis Distance (CDSUM) offers the advantage of identifying non-spherical clusters in the high dimensional spectral space. Second, the Cluster and Proportion Constrained Multi-Model Unmixing (CC-MMU and PC-MMU) algorithms use partial supervision information, in the form of cluster or proportion constraints, to guide the search process and narrow the space of possible solutions. The supervision information could be provided by an expert, generated by analyzing the consensus of multiple unmixing algorithms, or extracted from co-located data from a different sensor. Third, the Robust Context Dependent Spectral Unmixing (RCDSU) introduces possibilistic memberships into the objective function to reduce the effect of noise and outliers in the data. Finally, the Unsupervised Robust Context Dependent Spectral Unmixing (U-RCDSU) algorithm learns the optimal number of contexts in an unsupervised way. The performance of each algorithm is evaluated using synthetic and real data. We show that the proposed methods can identify meaningful and coherent contexts, and appropriate endmembers within each context. The second main contribution of this thesis is consensus unmixing. This approach exploits the diversity and similarity of the large number of existing unmixing algorithms to identify an accurate and consistent set of endmembers in the data. We run multiple unmixing algorithms using different parameters, and combine the resulting unmixing ensemble using consensus analysis. The extracted endmembers will be the ones that have a consensus among the multiple runs. The third main contribution consists of developing subpixel target detectors that rely on the proposed CDSU algorithms to adapt target detection algorithms to different contexts. A local detection statistic is computed for each context and then all scores are combined to yield a final detection score. The context dependent unmixing provides a better background description and limits target leakage, which are two essential properties for target detection algorithms

    Wind Turbine Fault Detection: an Unsupervised vs Semi-Supervised Approach

    Get PDF
    The need for renewable energy has been growing in recent years for the reasons we all know, wind power is no exception. Wind turbines are complex and expensive structures and the need for maintenance exists. Conditioning Monitoring Systems that make use of supervised machine learning techniques have been recently studied and the results are quite promising. Though, such systems still require the physical presence of professionals but with the advantage of gaining insight of the operating state of the machine in use, to decide upon maintenance interventions beforehand. The wind turbine failure is not an abrupt process but a gradual one. The main goal of this dissertation is: to compare semi-supervised methods to at tack the problem of automatic recognition of anomalies in wind turbines; to develop an approach combining the Mahalanobis Taguchi System (MTS) with two popular fuzzy partitional clustering algorithms like the fuzzy c-means and archetypal analysis, for the purpose of anomaly detection; and finally to develop an experimental protocol to com paratively study the two types of algorithms. In this work, the algorithms Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score (HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means (FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) were explored. The data used consisted of SCADA data sets regarding turbine sensorial data, 8 to tal, from a wind farm in the North of Portugal. Each data set comprises between 1070 and 1096 data cases and characterized by 5 features, for the years 2011, 2012 and 2013. The analysis of the results using 7 different validity measures show that, the CBLOF al gorithm got the best results in the semi-supervised approach while LoMST won in the unsupervised scenario. The extension of both FCM and AA got promissing results.A necessidade de produzir energia renovĂĄvel tem vindo a crescer nos Ășltimos anos pelas razĂ”es que todos sabemos, a energia eĂłlica nĂŁo Ă© excepção. As turbinas eĂłlicas sĂŁo es truturas complexas e caras e a necessidade de manutenção existe. Sistemas de Condição Monitorizada utilizando tĂ©cnicas de aprendizagem supervisionada tĂȘm vindo a ser estu dados recentemente e os resultados sĂŁo bastante promissores. No entanto, estes sistemas ainda exigem a presença fĂ­sica de profissionais, mas com a vantagem de obter informa çÔes sobre o estado operacional da mĂĄquina em uso, para decidir sobre intervençÔes de manutenção antemĂŁo. O principal objetivo desta dissertação Ă©: comparar mĂ©todos semi-supervisionados para atacar o problema de reconhecimento automĂĄtico de anomalias em turbinas eĂłlicas; desenvolver um mĂ©todo que combina o Mahalanobis Taguchi System (MTS) com dois mĂ© todos de agrupamento difuso bem conhecidos como fuzzy c-means e archetypal analysis, no Ăąmbito de deteção de anomalias; e finalmente desenvolver um protocolo experimental onde Ă© possĂ­vel o estudo comparativo entre os dois diferentes tipos de algoritmos. Neste trabalho, os algoritmos Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score (HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means (FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) foram explorados. Os conjuntos de dados utilizados provĂȘm do sistema SCADA, referentes a dados sen soriais de turbinas, 8 no total, com origem num parque eĂłlico no Norte de Portugal. Cada um estĂĄ compreendendido entre 1070 e 1096 observaçÔes e caracterizados por 5 caracte rĂ­sticas, para os anos 2011, 2012 e 2013. A ĂĄnalise dos resultados atravĂ©s de 7 mĂ©tricas de validação diferentes mostraram que, o algoritmo CBLOF obteve os melhores resultados na abordagem semi-supervisionada enquanto que o LoMST ganhou na abordagem nĂŁo supervisionada. A extensĂŁo do FCM e do AA originou resultados promissores

    Reconstruction Error and Principal Component Based Anomaly Detection in Hyperspectral imagery

    Get PDF
    The rapid expansion of remote sensing and information collection capabilities demands methods to highlight interesting or anomalous patterns within an overabundance of data. This research addresses this issue for hyperspectral imagery (HSI). Two new reconstruction based HSI anomaly detectors are outlined: one using principal component analysis (PCA), and the other a form of non-linear PCA called logistic principal component analysis. Two very effective, yet relatively simple, modifications to the autonomous global anomaly detector are also presented, improving algorithm performance and enabling receiver operating characteristic analysis. A novel technique for HSI anomaly detection dubbed multiple PCA is introduced and found to perform as well or better than existing detectors on HYDICE data while using only linear deterministic methods. Finally, a response surface based optimization is performed on algorithm parameters such as to affect consistent desired algorithm performance
    • 

    corecore