Search CORE

970 research outputs found

Méthodes statistiques de détection d’observations atypiques pour des données en grande dimension

Author: Archimbaud Aurore
Publication venue
Publication date: 26/01/2018
Field of study

La détection d’observations atypiques de manière non-supervisée est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la détection de défauts industriels, cette tâche est d’une importance capitale pour assurer une production de haute qualité. Avec l’accroissement exponentiel du nombre de mesures effectuées sur les composants électroniques, la problématique de la grande dimension se pose lors de la recherche d’anomalies. Pour relever ce challenge, l’entreprise ippon innovation, spécialiste en statistique industrielle et détection d’anomalies, s’est associée au laboratoire de recherche TSE-R en finançant ce travail de thèse. Le premier chapitre commence par présenter le contexte du contrôle de qualité et les différentes procédures déjà mises en place, principalement dans les entreprises de semi-conducteurs pour l’automobile. Comme ces pratiques ne répondent pas aux nouvelles attentes requises par le traitement de données en grande dimension, d’autres solutions doivent être envisagées. La suite du chapitre résume l’ensemble des méthodes multivariées et non supervisées de détection d’observations atypiques existantes, en insistant tout particulièrement sur celles qui gèrent des données en grande dimension. Le Chapitre 2 montre théoriquement que la très connue distance de Mahalanobis n’est pas adaptée à la détection d’anomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la méthode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intéressante à la mise en évidence de la structure des données atypiques. Une méthodologie pour sélectionner seulement les composantes d’intérêt est proposée et ses performances sont comparées aux standards habituels sur des simulations ainsi que sur des exemples réels industriels. Cette nouvelle procédure a été mise en oeuvre dans un package R, ICSOutlier, présenté dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des conséquences directes de l’augmentation du nombre de dimensions est la singularité des estimateurs de dispersion multivariés, dès que certaines variables sont colinéaires ou que leur nombre excède le nombre d’individus. Or, la définition d’ICS par Tyler et al. (2009) se base sur des estimateurs de dispersion définis positifs. Le Chapitre 4 envisage différentes pistes pour adapter le critère d’ICS et investigue de manière théorique les propriétés de chacune des propositions présentées. La question de l’affine invariance de la méthode est en particulier étudiée. Enfin le dernier chapitre, se consacre à l’algorithme développé pour l’entreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idées générales et précise les challenges relevés, notamment numériques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development

Toulouse Capitole Publications

Novelty, distillation, and federation in machine learning for medical imaging

Author: Daykin Matthew
Publication venue: 'Heriot-Watt University'
Publication date: 01/04/2020
Field of study

The practical application of deep learning methods in the medical domain has many challenges. Pathologies are diverse and very few examples may be available for rare cases. Where data is collected it may lie in multiple institutions and cannot be pooled for practical and ethical reasons. Deep learning is powerful for image segmentation problems but ultimately its output must be interpretable at the patient level. Although clearly not an exhaustive list, these are the three problems tackled in this thesis. To address the rarity of pathology I investigate novelty detection algorithms to find outliers from normal anatomy. The problem is structured as first finding a low-dimension embedding and then detecting outliers in that embedding space. I evaluate for speed and accuracy several unsupervised embedding and outlier detection methods. Data consist of Magnetic Resonance Imaging (MRI) for interstitial lung disease for which healthy and pathological patches are available; only the healthy patches are used in model training. I then explore the clinical interpretability of a model output. I take related work by the Canon team — a model providing voxel-level detection of acute ischemic stroke signs — and deliver the Alberta Stroke Programme Early CT Score (ASPECTS, a measure of stroke severity). The data are acute head computed tomography volumes of suspected stroke patients. I convert from the voxel level to the brain region level and then to the patient level through a series of rules. Due to the real world clinical complexity of the problem, there are at each level — voxel, region and patient — multiple sources of “truth”; I evaluate my results appropriately against these truths. Finally, federated learning is used to train a model on data that are divided between multiple institutions. I introduce a novel evolution of this algorithm — dubbed “soft federated learning” — that avoids the central coordinating authority, and takes into account domain shift (covariate shift) and dataset size. I first demonstrate the key properties of these two algorithms on a series of MNIST (handwritten digits) toy problems. Then I apply the methods to the BraTS medical dataset, which contains MRI brain glioma scans from multiple institutions, to compare these algorithms in a realistic setting

ROS: The Research Output Service. Heriot-Watt University Edinburgh

Méthodes statistiques de détection d’observations atypiques pour des données en grande dimension

Author: Archimbaud Aurore
Publication venue
Publication date: 26/01/2018
Field of study

Toulouse Capitole Publications

Toulouse 1 Capitole Publications

Anomaly detection in smart city wireless sensor networks

Author: Garcia Font Víctor
Publication venue: 'Fundacio per la Universitat Oberta de Catalunya'
Publication date: 01/01/2017
Field of study

Aquesta tesi proposa una plataforma de detecció d’intrusions per a revelar atacs a les xarxes de sensors sense fils (WSN, per les sigles en anglès) de les ciutats intel·ligents (smart cities). La plataforma està dissenyada tenint en compte les necessitats dels administradors de la ciutat intel·ligent, els quals necessiten accés a una arquitectura centralitzada que pugui gestionar alarmes de seguretat en un sistema altament heterogeni i distribuït. En aquesta tesi s’identifiquen els diversos passos necessaris des de la recollida de dades fins a l’execució de les tècniques de detecció d’intrusions i s’avalua que el procés sigui escalable i capaç de gestionar dades típiques de ciutats intel·ligents. A més, es comparen diversos algorismes de detecció d’anomalies i s’observa que els mètodes de vectors de suport d’una mateixa classe (one-class support vector machines) resulten la tècnica multivariant més adequada per a descobrir atacs tenint en compte les necessitats d’aquest context. Finalment, es proposa un esquema per a ajudar els administradors a identificar els tipus d’atacs rebuts a partir de les alarmes disparades.Esta tesis propone una plataforma de detección de intrusiones para revelar ataques en las redes de sensores inalámbricas (WSN, por las siglas en inglés) de las ciudades inteligentes (smart cities). La plataforma está diseñada teniendo en cuenta la necesidad de los administradores de la ciudad inteligente, los cuales necesitan acceso a una arquitectura centralizada que pueda gestionar alarmas de seguridad en un sistema altamente heterogéneo y distribuido. En esta tesis se identifican los varios pasos necesarios desde la recolección de datos hasta la ejecución de las técnicas de detección de intrusiones y se evalúa que el proceso sea escalable y capaz de gestionar datos típicos de ciudades inteligentes. Además, se comparan varios algoritmos de detección de anomalías y se observa que las máquinas de vectores de soporte de una misma clase (one-class support vector machines) resultan la técnica multivariante más adecuada para descubrir ataques teniendo en cuenta las necesidades de este contexto. Finalmente, se propone un esquema para ayudar a los administradores a identificar los tipos de ataques recibidos a partir de las alarmas disparadas.This thesis proposes an intrusion detection platform which reveals attacks in smart city wireless sensor networks (WSN). The platform is designed taking into account the needs of smart city administrators, who need access to a centralized architecture that can manage security alarms in a highly heterogeneous and distributed system. In this thesis, we identify the various necessary steps from gathering WSN data to running the detection techniques and we evaluate whether the procedure is scalable and capable of handling typical smart city data. Moreover, we compare several anomaly detection algorithms and we observe that one-class support vector machines constitute the most suitable multivariate technique to reveal attacks, taking into account the requirements in this context. Finally, we propose a classification schema to assist administrators in identifying the types of attacks compromising their networks

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

The Oberta in open access

Context dependent spectral unmixing.

Author: Jenzri Hamdi
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/08/2014
Field of study

A hyperspectral unmixing algorithm that finds multiple sets of endmembers is proposed. The algorithm, called Context Dependent Spectral Unmixing (CDSU), is a local approach that adapts the unmixing to different regions of the spectral space. It is based on a novel function that combines context identification and unmixing. This joint objective function models contexts as compact clusters and uses the linear mixing model as the basis for unmixing. Several variations of the CDSU, that provide additional desirable features, are also proposed. First, the Context Dependent Spectral unmixing using the Mahalanobis Distance (CDSUM) offers the advantage of identifying non-spherical clusters in the high dimensional spectral space. Second, the Cluster and Proportion Constrained Multi-Model Unmixing (CC-MMU and PC-MMU) algorithms use partial supervision information, in the form of cluster or proportion constraints, to guide the search process and narrow the space of possible solutions. The supervision information could be provided by an expert, generated by analyzing the consensus of multiple unmixing algorithms, or extracted from co-located data from a different sensor. Third, the Robust Context Dependent Spectral Unmixing (RCDSU) introduces possibilistic memberships into the objective function to reduce the effect of noise and outliers in the data. Finally, the Unsupervised Robust Context Dependent Spectral Unmixing (U-RCDSU) algorithm learns the optimal number of contexts in an unsupervised way. The performance of each algorithm is evaluated using synthetic and real data. We show that the proposed methods can identify meaningful and coherent contexts, and appropriate endmembers within each context. The second main contribution of this thesis is consensus unmixing. This approach exploits the diversity and similarity of the large number of existing unmixing algorithms to identify an accurate and consistent set of endmembers in the data. We run multiple unmixing algorithms using different parameters, and combine the resulting unmixing ensemble using consensus analysis. The extracted endmembers will be the ones that have a consensus among the multiple runs. The third main contribution consists of developing subpixel target detectors that rely on the proposed CDSU algorithms to adapt target detection algorithms to different contexts. A local detection statistic is computed for each context and then all scores are combined to yield a final detection score. The context dependent unmixing provides a better background description and limits target leakage, which are two essential properties for target detection algorithms

University of Louisville

Wind Turbine Fault Detection: an Unsupervised vs Semi-Supervised Approach

Author: Cabanas Dinis Costa
Publication venue
Publication date: 01/02/2021
Field of study

The need for renewable energy has been growing in recent years for the reasons we all know, wind power is no exception. Wind turbines are complex and expensive structures and the need for maintenance exists. Conditioning Monitoring Systems that make use of supervised machine learning techniques have been recently studied and the results are quite promising. Though, such systems still require the physical presence of professionals but with the advantage of gaining insight of the operating state of the machine in use, to decide upon maintenance interventions beforehand. The wind turbine failure is not an abrupt process but a gradual one. The main goal of this dissertation is: to compare semi-supervised methods to at tack the problem of automatic recognition of anomalies in wind turbines; to develop an approach combining the Mahalanobis Taguchi System (MTS) with two popular fuzzy partitional clustering algorithms like the fuzzy c-means and archetypal analysis, for the purpose of anomaly detection; and finally to develop an experimental protocol to com paratively study the two types of algorithms. In this work, the algorithms Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score (HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means (FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) were explored. The data used consisted of SCADA data sets regarding turbine sensorial data, 8 to tal, from a wind farm in the North of Portugal. Each data set comprises between 1070 and 1096 data cases and characterized by 5 features, for the years 2011, 2012 and 2013. The analysis of the results using 7 different validity measures show that, the CBLOF al gorithm got the best results in the semi-supervised approach while LoMST won in the unsupervised scenario. The extension of both FCM and AA got promissing results.A necessidade de produzir energia renovável tem vindo a crescer nos últimos anos pelas razões que todos sabemos, a energia eólica não é excepção. As turbinas eólicas são es truturas complexas e caras e a necessidade de manutenção existe. Sistemas de Condição Monitorizada utilizando técnicas de aprendizagem supervisionada têm vindo a ser estu dados recentemente e os resultados são bastante promissores. No entanto, estes sistemas ainda exigem a presença física de profissionais, mas com a vantagem de obter informa ções sobre o estado operacional da máquina em uso, para decidir sobre intervenções de manutenção antemão. O principal objetivo desta dissertação é: comparar métodos semi-supervisionados para atacar o problema de reconhecimento automático de anomalias em turbinas eólicas; desenvolver um método que combina o Mahalanobis Taguchi System (MTS) com dois mé todos de agrupamento difuso bem conhecidos como fuzzy c-means e archetypal analysis, no âmbito de deteção de anomalias; e finalmente desenvolver um protocolo experimental onde é possível o estudo comparativo entre os dois diferentes tipos de algoritmos. Neste trabalho, os algoritmos Local Outlier Factor (LOF), Connectivity-based Outlier Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score (HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means (FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) foram explorados. Os conjuntos de dados utilizados provêm do sistema SCADA, referentes a dados sen soriais de turbinas, 8 no total, com origem num parque eólico no Norte de Portugal. Cada um está compreendendido entre 1070 e 1096 observações e caracterizados por 5 caracte rísticas, para os anos 2011, 2012 e 2013. A ánalise dos resultados através de 7 métricas de validação diferentes mostraram que, o algoritmo CBLOF obteve os melhores resultados na abordagem semi-supervisionada enquanto que o LoMST ganhou na abordagem não supervisionada. A extensão do FCM e do AA originou resultados promissores

Repositório da Universidade Nova de Lisboa

Reconstruction Error and Principal Component Based Anomaly Detection in Hyperspectral imagery

Author: Jablonski James A.
Publication venue: AFIT Scholar
Publication date: 14/03/2014
Field of study

The rapid expansion of remote sensing and information collection capabilities demands methods to highlight interesting or anomalous patterns within an overabundance of data. This research addresses this issue for hyperspectral imagery (HSI). Two new reconstruction based HSI anomaly detectors are outlined: one using principal component analysis (PCA), and the other a form of non-linear PCA called logistic principal component analysis. Two very effective, yet relatively simple, modifications to the autonomous global anomaly detector are also presented, improving algorithm performance and enabling receiver operating characteristic analysis. A novel technique for HSI anomaly detection dubbed multiple PCA is introduced and found to perform as well or better than existing detectors on HYDICE data while using only linear deterministic methods. Finally, a response surface based optimization is performed on algorithm parameters such as to affect consistent desired algorithm performance

AFTI Scholar (Air Force Institute of Technology)