970 research outputs found
MĂ©thodes statistiques de dĂ©tection dâobservations atypiques pour des donnĂ©es en grande dimension
La dĂ©tection dâobservations atypiques de maniĂšre non-supervisĂ©e est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la dĂ©tection de dĂ©fauts industriels, cette tĂąche est dâune importance capitale pour assurer une production de haute qualitĂ©. Avec lâaccroissement exponentiel du nombre de mesures effectuĂ©es sur les composants Ă©lectroniques, la problĂ©matique de la grande dimension se pose lors de la recherche dâanomalies. Pour relever ce challenge, lâentreprise ippon innovation, spĂ©cialiste en statistique industrielle et dĂ©tection dâanomalies, sâest associĂ©e au laboratoire de recherche TSE-R en finançant ce travail de thĂšse. Le premier chapitre commence par prĂ©senter le contexte du contrĂŽle de qualitĂ© et les diffĂ©rentes procĂ©dures dĂ©jĂ mises en place, principalement dans les entreprises de semi-conducteurs pour lâautomobile. Comme ces pratiques ne rĂ©pondent pas aux nouvelles attentes requises par le traitement de donnĂ©es en grande dimension, dâautres solutions doivent ĂȘtre envisagĂ©es. La suite du chapitre rĂ©sume lâensemble des mĂ©thodes multivariĂ©es et non supervisĂ©es de dĂ©tection dâobservations atypiques existantes, en insistant tout particuliĂšrement sur celles qui gĂšrent des donnĂ©es en grande dimension. Le Chapitre 2 montre thĂ©oriquement que la trĂšs connue distance de Mahalanobis nâest pas adaptĂ©e Ă la dĂ©tection dâanomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la mĂ©thode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intĂ©ressante Ă la mise en Ă©vidence de la structure des donnĂ©es atypiques. Une mĂ©thodologie pour sĂ©lectionner seulement les composantes dâintĂ©rĂȘt est proposĂ©e et ses performances sont comparĂ©es aux standards habituels sur des simulations ainsi que sur des exemples rĂ©els industriels. Cette nouvelle procĂ©dure a Ă©tĂ© mise en oeuvre dans un package R, ICSOutlier, prĂ©sentĂ© dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des consĂ©quences directes de lâaugmentation du nombre de dimensions est la singularitĂ© des estimateurs de dispersion multivariĂ©s, dĂšs que certaines variables sont colinĂ©aires ou que leur nombre excĂšde le nombre dâindividus. Or, la dĂ©finition dâICS par Tyler et al. (2009) se base sur des estimateurs de dispersion dĂ©finis positifs. Le Chapitre 4 envisage diffĂ©rentes pistes pour adapter le critĂšre dâICS et investigue de maniĂšre thĂ©orique les propriĂ©tĂ©s de chacune des propositions prĂ©sentĂ©es. La question de lâaffine invariance de la mĂ©thode est en particulier Ă©tudiĂ©e. Enfin le dernier chapitre, se consacre Ă lâalgorithme dĂ©veloppĂ© pour lâentreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idĂ©es gĂ©nĂ©rales et prĂ©cise les challenges relevĂ©s, notamment numĂ©riques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development
Novelty, distillation, and federation in machine learning for medical imaging
The practical application of deep learning methods in the medical domain
has many challenges. Pathologies are diverse and very few examples may
be available for rare cases. Where data is collected it may lie in multiple
institutions and cannot be pooled for practical and ethical reasons. Deep
learning is powerful for image segmentation problems but ultimately its output
must be interpretable at the patient level. Although clearly not an exhaustive
list, these are the three problems tackled in this thesis.
To address the rarity of pathology I investigate novelty detection algorithms
to find outliers from normal anatomy. The problem is structured as first finding
a low-dimension embedding and then detecting outliers in that embedding
space. I evaluate for speed and accuracy several unsupervised embedding and
outlier detection methods. Data consist of Magnetic Resonance Imaging (MRI)
for interstitial lung disease for which healthy and pathological patches are
available; only the healthy patches are used in model training.
I then explore the clinical interpretability of a model output. I take related
work by the Canon team â a model providing voxel-level detection of acute
ischemic stroke signs â and deliver the Alberta Stroke Programme Early CT
Score (ASPECTS, a measure of stroke severity). The data are acute head
computed tomography volumes of suspected stroke patients. I convert from
the voxel level to the brain region level and then to the patient level through a
series of rules. Due to the real world clinical complexity of the problem, there
are at each level â voxel, region and patient â multiple sources of âtruthâ; I
evaluate my results appropriately against these truths.
Finally, federated learning is used to train a model on data that are divided
between multiple institutions. I introduce a novel evolution of this algorithm
â dubbed âsoft federated learningâ â that avoids the central coordinating
authority, and takes into account domain shift (covariate shift) and dataset
size. I first demonstrate the key properties of these two algorithms on a series
of MNIST (handwritten digits) toy problems. Then I apply the methods to the
BraTS medical dataset, which contains MRI brain glioma scans from multiple
institutions, to compare these algorithms in a realistic setting
MĂ©thodes statistiques de dĂ©tection dâobservations atypiques pour des donnĂ©es en grande dimension
La dĂ©tection dâobservations atypiques de maniĂšre non-supervisĂ©e est un enjeu crucial dans la pratique de la statistique. Dans le domaine de la dĂ©tection de dĂ©fauts industriels, cette tĂąche est dâune importance capitale pour assurer une production de haute qualitĂ©. Avec lâaccroissement exponentiel du nombre de mesures effectuĂ©es sur les composants Ă©lectroniques, la problĂ©matique de la grande dimension se pose lors de la recherche dâanomalies. Pour relever ce challenge, lâentreprise ippon innovation, spĂ©cialiste en statistique industrielle et dĂ©tection dâanomalies, sâest associĂ©e au laboratoire de recherche TSE-R en finançant ce travail de thĂšse. Le premier chapitre commence par prĂ©senter le contexte du contrĂŽle de qualitĂ© et les diffĂ©rentes procĂ©dures dĂ©jĂ mises en place, principalement dans les entreprises de semi-conducteurs pour lâautomobile. Comme ces pratiques ne rĂ©pondent pas aux nouvelles attentes requises par le traitement de donnĂ©es en grande dimension, dâautres solutions doivent ĂȘtre envisagĂ©es. La suite du chapitre rĂ©sume lâensemble des mĂ©thodes multivariĂ©es et non supervisĂ©es de dĂ©tection dâobservations atypiques existantes, en insistant tout particuliĂšrement sur celles qui gĂšrent des donnĂ©es en grande dimension. Le Chapitre 2 montre thĂ©oriquement que la trĂšs connue distance de Mahalanobis nâest pas adaptĂ©e Ă la dĂ©tection dâanomalies si celles-ci sont contenues dans un sous-espace de petite dimension alors que le nombre de variables est grand.Dans ce contexte, la mĂ©thode Invariant Coordinate Selection (ICS) est alors introduite comme une alternative intĂ©ressante Ă la mise en Ă©vidence de la structure des donnĂ©es atypiques. Une mĂ©thodologie pour sĂ©lectionner seulement les composantes dâintĂ©rĂȘt est proposĂ©e et ses performances sont comparĂ©es aux standards habituels sur des simulations ainsi que sur des exemples rĂ©els industriels. Cette nouvelle procĂ©dure a Ă©tĂ© mise en oeuvre dans un package R, ICSOutlier, prĂ©sentĂ© dans le Chapitre 3 ainsi que dans une application R shiny (package ICSShiny) qui rend son utilisation plus simple et plus attractive.Une des consĂ©quences directes de lâaugmentation du nombre de dimensions est la singularitĂ© des estimateurs de dispersion multivariĂ©s, dĂšs que certaines variables sont colinĂ©aires ou que leur nombre excĂšde le nombre dâindividus. Or, la dĂ©finition dâICS par Tyler et al. (2009) se base sur des estimateurs de dispersion dĂ©finis positifs. Le Chapitre 4 envisage diffĂ©rentes pistes pour adapter le critĂšre dâICS et investigue de maniĂšre thĂ©orique les propriĂ©tĂ©s de chacune des propositions prĂ©sentĂ©es. La question de lâaffine invariance de la mĂ©thode est en particulier Ă©tudiĂ©e. Enfin le dernier chapitre, se consacre Ă lâalgorithme dĂ©veloppĂ© pour lâentreprise. Bien que cet algorithme soit confidentiel, le chapitre donne les idĂ©es gĂ©nĂ©rales et prĂ©cise les challenges relevĂ©s, notamment numĂ©riques.The unsupervised outlier detection is a crucial issue in statistics. More specifically, in the industrial context of fault detection, this task is of great importance for ensuring a high quality production. With the exponential increase in the number of measurements on electronic components, the concern of high dimensional data arises in the identification of outlying observations. The ippon innovation company, an expert in industrial statistics and anomaly detection, wanted to deal with this new situation. So, it collaborated with the TSE-R research laboratory by financing this thesis work. The first chapter presents the quality control context and the different procedures mainly used in the automotive industry of semiconductors. However, these practices do not meet the new expectations required in dealing with high dimensional data, so other solutions need to be considered. The remainder of the chapter summarizes unsupervised multivariate methods for outlier detection, with a particular emphasis on those dealing with high dimensional data. Chapter 2 demonstrates that the well-known Mahalanobis distance presents some difficulties to detect the outlying observations that lie in a smaller subspace while the number of variables is large. In this context, the Invariant Coordinate Selection (ICS) method is introduced as an interesting alternative for highlighting the structure of outlierness. A methodology for selecting only the relevant components is proposed. A simulation study provides a comparison with benchmark methods. The performance of our proposal is also evaluated on real industrial data sets. This new procedure has been implemented in an R package, ICSOutlier, presented in Chapter 3, and in an R shiny application (package ICSShiny) that makes it more user-friendly. When the number of dimensions increases, the multivariate scatter matrices turn out to be singular as soon as some variables are collinear or if their number exceeds the number of individuals. However, in the presentation of ICS by Tyler et al. (2009), the scatter estimators are defined as positive definite matrices. Chapter 4 proposes three different ways for adapting the ICS method to singular scatter matrices and theoretically investigates their properties. The question of affine invariance is analyzed in particular. Finally, the last chapter is dedicated to the algorithm developed for the company. Although the algorithm is confidential, the chapter presents the main ideas and the challenges, mostly numerical, encountered during its development
Anomaly detection in smart city wireless sensor networks
Aquesta tesi proposa una plataforma de detecciĂł dâintrusions per a revelar atacs a les xarxes de sensors sense fils (WSN, per les sigles en anglĂšs) de les ciutats intel·ligents (smart cities). La plataforma estĂ dissenyada tenint en compte les necessitats dels administradors de la ciutat intel·ligent, els quals necessiten accĂ©s a una arquitectura centralitzada que pugui gestionar alarmes de seguretat en un sistema altament heterogeni i distribuĂŻt. En aquesta tesi sâidentifiquen els diversos passos necessaris des de la recollida de dades fins a lâexecuciĂł de les tĂšcniques de detecciĂł dâintrusions i sâavalua que el procĂ©s sigui escalable i capaç de gestionar dades tĂpiques de ciutats intel·ligents. A mĂ©s, es comparen diversos algorismes de detecciĂł dâanomalies i sâobserva que els mĂštodes de vectors de suport dâuna mateixa classe (one-class support vector machines) resulten la tĂšcnica multivariant mĂ©s adequada per a descobrir atacs tenint en compte les necessitats dâaquest context. Finalment, es proposa un esquema per a ajudar els administradors a identificar els tipus dâatacs rebuts a partir de les alarmes disparades.Esta tesis propone una plataforma de detecciĂłn de intrusiones para revelar ataques en las redes de sensores inalĂĄmbricas (WSN, por las siglas en inglĂ©s) de las ciudades inteligentes (smart cities). La plataforma estĂĄ diseñada teniendo en cuenta la necesidad de los administradores de la ciudad inteligente, los cuales necesitan acceso a una arquitectura centralizada que pueda gestionar alarmas de seguridad en un sistema altamente heterogĂ©neo y distribuido. En esta tesis se identifican los varios pasos necesarios desde la recolecciĂłn de datos hasta la ejecuciĂłn de las tĂ©cnicas de detecciĂłn de intrusiones y se evalĂșa que el proceso sea escalable y capaz de gestionar datos tĂpicos de ciudades inteligentes. AdemĂĄs, se comparan varios algoritmos de detecciĂłn de anomalĂas y se observa que las mĂĄquinas de vectores de soporte de una misma clase (one-class support vector machines) resultan la tĂ©cnica multivariante mĂĄs adecuada para descubrir ataques teniendo en cuenta las necesidades de este contexto. Finalmente, se propone un esquema para ayudar a los administradores a identificar los tipos de ataques recibidos a partir de las alarmas disparadas.This thesis proposes an intrusion detection platform which reveals attacks in smart city wireless sensor networks (WSN). The platform is designed taking into account the needs of smart city administrators, who need access to a centralized architecture that can manage security alarms in a highly heterogeneous and distributed system. In this thesis, we identify the various necessary steps from gathering WSN data to running the detection techniques and we evaluate whether the procedure is scalable and capable of handling typical smart city data. Moreover, we compare several anomaly detection algorithms and we observe that one-class support vector machines constitute the most suitable multivariate technique to reveal attacks, taking into account the requirements in this context. Finally, we propose a classification schema to assist administrators in identifying the types of attacks compromising their networks
Context dependent spectral unmixing.
A hyperspectral unmixing algorithm that finds multiple sets of endmembers is proposed. The algorithm, called Context Dependent Spectral Unmixing (CDSU), is a local approach that adapts the unmixing to different regions of the spectral space. It is based on a novel function that combines context identification and unmixing. This joint objective function models contexts as compact clusters and uses the linear mixing model as the basis for unmixing. Several variations of the CDSU, that provide additional desirable features, are also proposed. First, the Context Dependent Spectral unmixing using the Mahalanobis Distance (CDSUM) offers the advantage of identifying non-spherical clusters in the high dimensional spectral space. Second, the Cluster and Proportion Constrained Multi-Model Unmixing (CC-MMU and PC-MMU) algorithms use partial supervision information, in the form of cluster or proportion constraints, to guide the search process and narrow the space of possible solutions. The supervision information could be provided by an expert, generated by analyzing the consensus of multiple unmixing algorithms, or extracted from co-located data from a different sensor. Third, the Robust Context Dependent Spectral Unmixing (RCDSU) introduces possibilistic memberships into the objective function to reduce the effect of noise and outliers in the data. Finally, the Unsupervised Robust Context Dependent Spectral Unmixing (U-RCDSU) algorithm learns the optimal number of contexts in an unsupervised way. The performance of each algorithm is evaluated using synthetic and real data. We show that the proposed methods can identify meaningful and coherent contexts, and appropriate endmembers within each context. The second main contribution of this thesis is consensus unmixing. This approach exploits the diversity and similarity of the large number of existing unmixing algorithms to identify an accurate and consistent set of endmembers in the data. We run multiple unmixing algorithms using different parameters, and combine the resulting unmixing ensemble using consensus analysis. The extracted endmembers will be the ones that have a consensus among the multiple runs. The third main contribution consists of developing subpixel target detectors that rely on the proposed CDSU algorithms to adapt target detection algorithms to different contexts. A local detection statistic is computed for each context and then all scores are combined to yield a final detection score. The context dependent unmixing provides a better background description and limits target leakage, which are two essential properties for target detection algorithms
Wind Turbine Fault Detection: an Unsupervised vs Semi-Supervised Approach
The need for renewable energy has been growing in recent years for the reasons we all
know, wind power is no exception. Wind turbines are complex and expensive structures
and the need for maintenance exists. Conditioning Monitoring Systems that make use of
supervised machine learning techniques have been recently studied and the results are
quite promising. Though, such systems still require the physical presence of professionals
but with the advantage of gaining insight of the operating state of the machine in use, to
decide upon maintenance interventions beforehand. The wind turbine failure is not an
abrupt process but a gradual one.
The main goal of this dissertation is: to compare semi-supervised methods to at tack the problem of automatic recognition of anomalies in wind turbines; to develop an
approach combining the Mahalanobis Taguchi System (MTS) with two popular fuzzy
partitional clustering algorithms like the fuzzy c-means and archetypal analysis, for the
purpose of anomaly detection; and finally to develop an experimental protocol to com paratively study the two types of algorithms.
In this work, the algorithms Local Outlier Factor (LOF), Connectivity-based Outlier
Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score
(HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means
(FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) were
explored.
The data used consisted of SCADA data sets regarding turbine sensorial data, 8 to tal, from a wind farm in the North of Portugal. Each data set comprises between 1070
and 1096 data cases and characterized by 5 features, for the years 2011, 2012 and 2013.
The analysis of the results using 7 different validity measures show that, the CBLOF al gorithm got the best results in the semi-supervised approach while LoMST won in the
unsupervised scenario. The extension of both FCM and AA got promissing results.A necessidade de produzir energia renovĂĄvel tem vindo a crescer nos Ășltimos anos pelas
razÔes que todos sabemos, a energia eólica não é excepção. As turbinas eólicas são es truturas complexas e caras e a necessidade de manutenção existe. Sistemas de Condição
Monitorizada utilizando tĂ©cnicas de aprendizagem supervisionada tĂȘm vindo a ser estu dados recentemente e os resultados sĂŁo bastante promissores. No entanto, estes sistemas
ainda exigem a presença fĂsica de profissionais, mas com a vantagem de obter informa çÔes sobre o estado operacional da mĂĄquina em uso, para decidir sobre intervençÔes de
manutenção antemão.
O principal objetivo desta dissertação é: comparar métodos semi-supervisionados
para atacar o problema de reconhecimento automĂĄtico de anomalias em turbinas eĂłlicas;
desenvolver um método que combina o Mahalanobis Taguchi System (MTS) com dois mé todos de agrupamento difuso bem conhecidos como fuzzy c-means e archetypal analysis,
no ùmbito de deteção de anomalias; e finalmente desenvolver um protocolo experimental
onde Ă© possĂvel o estudo comparativo entre os dois diferentes tipos de algoritmos.
Neste trabalho, os algoritmos Local Outlier Factor (LOF), Connectivity-based Outlier
Factor (COF), Cluster-based Local Outlier Factor (CBLOF), Histogram-based Outlier Score
(HBOS), k-nearest-neighbours (k-NN), Subspace Outlier Detection (SOD), Fuzzy c-means
(FCM), Archetypal Analysis (AA) and Local Minimum Spanning Tree (LoMST) foram
explorados.
Os conjuntos de dados utilizados provĂȘm do sistema SCADA, referentes a dados sen soriais de turbinas, 8 no total, com origem num parque eĂłlico no Norte de Portugal. Cada
um estĂĄ compreendendido entre 1070 e 1096 observaçÔes e caracterizados por 5 caracte rĂsticas, para os anos 2011, 2012 e 2013. A ĂĄnalise dos resultados atravĂ©s de 7 mĂ©tricas de
validação diferentes mostraram que, o algoritmo CBLOF obteve os melhores resultados
na abordagem semi-supervisionada enquanto que o LoMST ganhou na abordagem nĂŁo
supervisionada. A extensĂŁo do FCM e do AA originou resultados promissores
Reconstruction Error and Principal Component Based Anomaly Detection in Hyperspectral imagery
The rapid expansion of remote sensing and information collection capabilities demands methods to highlight interesting or anomalous patterns within an overabundance of data. This research addresses this issue for hyperspectral imagery (HSI). Two new reconstruction based HSI anomaly detectors are outlined: one using principal component analysis (PCA), and the other a form of non-linear PCA called logistic principal component analysis. Two very effective, yet relatively simple, modifications to the autonomous global anomaly detector are also presented, improving algorithm performance and enabling receiver operating characteristic analysis. A novel technique for HSI anomaly detection dubbed multiple PCA is introduced and found to perform as well or better than existing detectors on HYDICE data while using only linear deterministic methods. Finally, a response surface based optimization is performed on algorithm parameters such as to affect consistent desired algorithm performance
- âŠ