516 research outputs found

    Modern approaches for evaluating treatment effect heterogeneity from clinical trials and observational data

    Full text link
    In this paper we review recent advances in statistical methods for the evaluation of the heterogeneity of treatment effects (HTE), including subgroup identification and estimation of individualized treatment regimens, from randomized clinical trials and observational studies. We identify several types of approaches using the features introduced in Lipkovich, Dmitrienko and D'Agostino (2017) that distinguish the recommended principled methods from basic methods for HTE evaluation that typically rely on rules of thumb and general guidelines (the methods are often referred to as common practices). We discuss the advantages and disadvantages of various principled methods as well as common measures for evaluating their performance. We use simulated data and a case study based on a historical clinical trial to illustrate several new approaches to HTE evaluation

    Matching in Selective and Balanced Representation Space for Treatment Effects Estimation

    Full text link
    The dramatically growing availability of observational data is being witnessed in various domains of science and technology, which facilitates the study of causal inference. However, estimating treatment effects from observational data is faced with two major challenges, missing counterfactual outcomes and treatment selection bias. Matching methods are among the most widely used and fundamental approaches to estimating treatment effects, but existing matching methods have poor performance when facing data with high dimensional and complicated variables. We propose a feature selection representation matching (FSRM) method based on deep representation learning and matching, which maps the original covariate space into a selective, nonlinear, and balanced representation space, and then conducts matching in the learned representation space. FSRM adopts deep feature selection to minimize the influence of irrelevant variables for estimating treatment effects and incorporates a regularizer based on the Wasserstein distance to learn balanced representations. We evaluate the performance of our FSRM method on three datasets, and the results demonstrate superiority over the state-of-the-art methods.Comment: Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20

    Multimodel Approaches for Plasma Glucose Estimation in Continuous Glucose Monitoring. Development of New Calibration Algorithms

    Full text link
    ABSTRACT Diabetes Mellitus (DM) embraces a group of metabolic diseases which main characteristic is the presence of high glucose levels in blood. It is one of the diseases with major social and health impact, both for its prevalence and also the consequences of the chronic complications that it implies. One of the research lines to improve the quality of life of people with diabetes is of technical focus. It involves several lines of research, including the development and improvement of devices to estimate "online" plasma glucose: continuous glucose monitoring systems (CGMS), both invasive and non-invasive. These devices estimate plasma glucose from sensor measurements from compartments alternative to blood. Current commercially available CGMS are minimally invasive and offer an estimation of plasma glucose from measurements in the interstitial fluid CGMS is a key component of the technical approach to build the artificial pancreas, aiming at closing the loop in combination with an insulin pump. Yet, the accuracy of current CGMS is still poor and it may partly depend on low performance of the implemented Calibration Algorithm (CA). In addition, the sensor-to-patient sensitivity is different between patients and also for the same patient in time. It is clear, then, that the development of new efficient calibration algorithms for CGMS is an interesting and challenging problem. The indirect measurement of plasma glucose through interstitial glucose is a main confounder of CGMS accuracy. Many components take part in the glucose transport dynamics. Indeed, physiology might suggest the existence of different local behaviors in the glucose transport process. For this reason, local modeling techniques may be the best option for the structure of the desired CA. Thus, similar input samples are represented by the same local model. The integration of all of them considering the input regions where they are valid is the final model of the whole data set. Clustering is tBarceló Rico, F. (2012). Multimodel Approaches for Plasma Glucose Estimation in Continuous Glucose Monitoring. Development of New Calibration Algorithms [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/17173Palanci

    If interpretability is the answer, what is the question?

    Get PDF
    Due to the ability to model even complex dependencies, machine learning (ML) can be used to tackle a broad range of (high-stakes) prediction problems. The complexity of the resulting models comes at the cost of transparency, meaning that it is difficult to understand the model by inspecting its parameters. This opacity is considered problematic since it hampers the transfer of knowledge from the model, undermines the agency of individuals affected by algorithmic decisions, and makes it more challenging to expose non-robust or unethical behaviour. To tackle the opacity of ML models, the field of interpretable machine learning (IML) has emerged. The field is motivated by the idea that if we could understand the model's behaviour -- either by making the model itself interpretable or by inspecting post-hoc explanations -- we could also expose unethical and non-robust behaviour, learn about the data generating process, and restore the agency of affected individuals. IML is not only a highly active area of research, but the developed techniques are also widely applied in both industry and the sciences. Despite the popularity of IML, the field faces fundamental criticism, questioning whether IML actually helps in tackling the aforementioned problems of ML and even whether it should be a field of research in the first place: First and foremost, IML is criticised for lacking a clear goal and, thus, a clear definition of what it means for a model to be interpretable. On a similar note, the meaning of existing methods is often unclear, and thus they may be misunderstood or even misused to hide unethical behaviour. Moreover, estimating conditional-sampling-based techniques poses a significant computational challenge. With the contributions included in this thesis, we tackle these three challenges for IML. We join a range of work by arguing that the field struggles to define and evaluate "interpretability" because incoherent interpretation goals are conflated. However, the different goals can be disentangled such that coherent requirements can inform the derivation of the respective target estimands. We demonstrate this with the examples of two interpretation contexts: recourse and scientific inference. To tackle the misinterpretation of IML methods, we suggest deriving formal interpretation rules that link explanations to aspects of the model and data. In our work, we specifically focus on interpreting feature importance. Furthermore, we collect interpretation pitfalls and communicate them to a broader audience. To efficiently estimate conditional-sampling-based interpretation techniques, we propose two methods that leverage the dependence structure in the data to simplify the estimation problems for Conditional Feature Importance (CFI) and SAGE. A causal perspective proved to be vital in tackling the challenges: First, since IML problems such as algorithmic recourse are inherently causal; Second, since causality helps to disentangle the different aspects of model and data and, therefore, to distinguish the insights that different methods provide; And third, algorithms developed for causal structure learning can be leveraged for the efficient estimation of conditional-sampling based IML methods.Aufgrund der Fähigkeit, selbst komplexe Abhängigkeiten zu modellieren, kann maschinelles Lernen (ML) zur Lösung eines breiten Spektrums von anspruchsvollen Vorhersageproblemen eingesetzt werden. Die Komplexität der resultierenden Modelle geht auf Kosten der Interpretierbarkeit, d. h. es ist schwierig, das Modell durch die Untersuchung seiner Parameter zu verstehen. Diese Undurchsichtigkeit wird als problematisch angesehen, da sie den Wissenstransfer aus dem Modell behindert, sie die Handlungsfähigkeit von Personen, die von algorithmischen Entscheidungen betroffen sind, untergräbt und sie es schwieriger macht, nicht robustes oder unethisches Verhalten aufzudecken. Um die Undurchsichtigkeit von ML-Modellen anzugehen, hat sich das Feld des interpretierbaren maschinellen Lernens (IML) entwickelt. Dieses Feld ist von der Idee motiviert, dass wir, wenn wir das Verhalten des Modells verstehen könnten - entweder indem wir das Modell selbst interpretierbar machen oder anhand von post-hoc Erklärungen - auch unethisches und nicht robustes Verhalten aufdecken, über den datengenerierenden Prozess lernen und die Handlungsfähigkeit betroffener Personen wiederherstellen könnten. IML ist nicht nur ein sehr aktiver Forschungsbereich, sondern die entwickelten Techniken werden auch weitgehend in der Industrie und den Wissenschaften angewendet. Trotz der Popularität von IML ist das Feld mit fundamentaler Kritik konfrontiert, die in Frage stellt, ob IML tatsächlich dabei hilft, die oben genannten Probleme von ML anzugehen, und ob es überhaupt ein Forschungsgebiet sein sollte: In erster Linie wird an IML kritisiert, dass es an einem klaren Ziel und damit an einer klaren Definition dessen fehlt, was es für ein Modell bedeutet, interpretierbar zu sein. Weiterhin ist die Bedeutung bestehender Methoden oft unklar, so dass sie missverstanden oder sogar missbraucht werden können, um unethisches Verhalten zu verbergen. Letztlich stellt die Schätzung von auf bedingten Stichproben basierenden Verfahren eine erhebliche rechnerische Herausforderung dar. In dieser Arbeit befassen wir uns mit diesen drei grundlegenden Herausforderungen von IML. Wir schließen uns der Argumentation an, dass es schwierig ist, "Interpretierbarkeit" zu definieren und zu bewerten, weil inkohärente Interpretationsziele miteinander vermengt werden. Die verschiedenen Ziele lassen sich jedoch entflechten, sodass kohärente Anforderungen die Ableitung der jeweiligen Zielgrößen informieren. Wir demonstrieren dies am Beispiel von zwei Interpretationskontexten: algorithmischer Regress und wissenschaftliche Inferenz. Um der Fehlinterpretation von IML-Methoden zu begegnen, schlagen wir vor, formale Interpretationsregeln abzuleiten, die Erklärungen mit Aspekten des Modells und der Daten verknüpfen. In unserer Arbeit konzentrieren wir uns speziell auf die Interpretation von sogenannten Feature Importance Methoden. Darüber hinaus tragen wir wichtige Interpretationsfallen zusammen und kommunizieren sie an ein breiteres Publikum. Zur effizienten Schätzung auf bedingten Stichproben basierender Interpretationstechniken schlagen wir zwei Methoden vor, die die Abhängigkeitsstruktur in den Daten nutzen, um die Schätzprobleme für Conditional Feature Importance (CFI) und SAGE zu vereinfachen. Eine kausale Perspektive erwies sich als entscheidend für die Bewältigung der Herausforderungen: Erstens, weil IML-Probleme wie der algorithmische Regress inhärent kausal sind; zweitens, weil Kausalität hilft, die verschiedenen Aspekte von Modell und Daten zu entflechten und somit die Erkenntnisse, die verschiedene Methoden liefern, zu unterscheiden; und drittens können wir Algorithmen, die für das Lernen kausaler Struktur entwickelt wurden, für die effiziente Schätzung von auf bindingten Verteilungen basierenden IML-Methoden verwenden

    Interpretable Mechanistic and Machine Learning Models for Pre-dicting Cardiac Remodeling from Biochemical and Biomechanical Features

    Get PDF
    Biochemical and biomechanical signals drive cardiac remodeling, resulting in altered heart physiology and the precursor for several cardiac diseases, the leading cause of death for most racial groups in the USA. Reversing cardiac remodeling requires medication and device-assisted treatment such as Cardiac Resynchronization Therapy (CRT), but current interventions produce highly variable responses from patient to patient. Mechanistic modeling and Machine learning (ML) approaches have the functionality to aid diagnosis and therapy selection using various input features. Moreover, \u27Interpretable\u27 machine learning methods have helped make machine learning models fairer and more suited for clinical application. The overarching objective of this doctoral work is to develop computational models that combine an extensive array of clinically measured biochemical and biomechanical variables to enable more accurate identification of heart failure patients prone to respond positively to therapeutic interventions. In the first aim, we built an ensemble ML classification algorithm using previously acquired data from the SMART-AV CRT clinical trial. Our classification algorithm incorporated 26 patient demographic and medical history variables, 12 biomarker variables, and 18 LV functional variables, yielding correct CRT response prediction in 71% of patients. In the second aim, we employed a machine learning-based method to infer the fibrosis-related gene regulatory network from RNA-seq data from the MAGNet cohort of heart failure patients. This network identified significant interactions between transcription factors and cell synthesis outputs related to cardiac fibrosis - a critical driver of heart failure. Novel filtering methods helped us prioritize the most critical regulatory interactions of mechanistic forward simulations. In the third aim, we developed a logic-based model for the mechanistic network of cardiac fibrosis, integrating the gene regulatory network derived from aim two into a previously constructed cardiac fibrosis signaling network model. This integrated model implemented biochemical and biomechanical reactions as ordinary differential equations based on normalized Hill functions. The model elucidated the semi-quantitative behavior of cardiac fibrosis signaling complexity by capturing multi-pathway crosstalk and feedback loops. Perturbation analysis predicted the most critical nodes in the mechanistic model. Patient-specific simulations helped identify which biochemical species highly correlate with clinical measures of patient cardiac function

    Causality concepts in machine learning: heterogeneous treatment effect estimation with machine learning & model interpretation with counterfactual and semi-factual explanations

    Get PDF
    Over decades, machine learning and causality were two separate research fields that developed independently of each other. It was not until recently that the exchange between the two intensified. This thesis comprises seven articles that contribute novel insights into the utilization of causality concepts in machine learning and highlights how both fields can benefit from one another. One part of this thesis focuses on adapting machine learning algorithms for estimating heterogeneous treatment effects. Specifically, random forest-based methods have demonstrated to be a powerful approach to heterogeneous treatment effect estimation; however, understanding the key elements responsible for that remains an open question. To provide answers, one contribution analyzed which elements of two popular forest-based heterogeneous treatment effect estimators – causal forests and model-based forests – are beneficial in case of real-valued outcomes. A simulation study reveals that model-based forests' simultaneous split selection based on prognostic and predictive effects is effective for randomized controlled trials, while causal forests' orthogonalization strategy is advantageous for observational data under confounding. Another contribution shows that combining these elements yields a versatile model framework applicable to a wide range of application cases: observational data with diverse outcome types, potentially under different forms of censoring. Another part focuses on two methods that leverage causality concepts to interpret machine learning models: counterfactual explanations and semi-factual explanations. Counterfactual explanations describe minimal changes in a few features required for changing a prediction, while semi-factual explanations describe maximal changes in a few features required for not changing a prediction. These insights are valuable because they reveal which features do or do not affect a prediction, and they can help to object against or justify a prediction. The existence of multiple equally good counterfactual explanations and semi-factual explanations for a given instance is often overlooked in the existing literature. This is also pointed out in the first contribution of the second part, which deals with possible pitfalls of interpretation methods, potential solutions, and open issues. To address the multiplicity of counterfactual explanations and semi-factual explanations, two contributions propose methods to generate multiple explanations: The underlying optimization problem was formalized multi-objectively for counterfactual explanations and as a hyperbox search for semi-factual explanations. Both approaches can be easily adapted to other use cases, with another contribution demonstrating how the multi-objective approach can be applied to assess counterfactual fairness. Despite the multitude of counterfactual methods proposed in recent years, the availability of methods for users of the programming language R remains extremely limited. Therefore, another contribution introduces a modular R package that facilitates the application and comparison of multiple counterfactual explanation methods.Über Jahrzehnte waren maschinelles Lernen und Kausalität zwei getrennte Forschungsbereiche, die sich unabhängig voneinander entwickelten. Erst in jüngster Zeit hat sich der Austausch zwischen den beiden Bereichen intensiviert. Diese Arbeit umfasst sieben Artikel, die neue Einblicke in die Nutzung von Kausalitätskonzepten im maschinellen Lernen geben, und zeigt, wie beide Bereiche voneinander profitieren können. Ein Teil dieser Arbeit befasst sich mit der Anpassung von Algorithmen des maschinellen Lernens zur Schätzung heterogener Behandlungseffekte. Insbesondere Random-Forest-Methoden haben sich als leistungsfähiger Ansatz für die Behandlungseffekt-Schätzung erwiesen; das Verständnis der Schlüsselelemente, die dafür verantwortlich sind, bleibt jedoch eine offene Frage. Um Antworten zu finden, wurde in einem Beitrag analysiert, welche Elemente von zwei beliebten Random-Forest-Schätzern - Causal Forests und Model-based Forests - im Fall von reellwertigen Zielvariablen von Vorteil sind. Eine Simulationsstudie zeigt, dass die gleichzeitige Split-Auswahl von Model-based Forests auf der Grundlage von prognostischen und prädiktiven Effekten für randomisierte kontrollierte Studien effektiv ist, während die Orthogonalisierungsstrategie der Causal Forests für Beobachtungsdaten mit Confoundern von Vorteil ist. Ein weiterer Beitrag zeigt, dass die Kombination dieser Elemente ein vielseitiges Framework für Modelle ergibt, welches auf viele verschiedene Fälle anwendbar ist: Beobachtungsdaten mit verschiedenen Arten von Zielvariablen, möglicherweise unter verschiedenen Formen von Zensierung. Ein weiterer Teil dieser Arbeit konzentriert sich auf zwei Methoden, die Kausalitätskonzepte zur Interpretation von Modellen des maschinellen Lernens nutzen: Counterfactual Explanations (kontrafaktische Erklärungen) und Semi-factual Explanations (semi-faktische Erklärungen). Counterfactual Explanations beschreiben minimale Änderungen in einigen wenigen Merkmalen, die für die Änderung einer Vorhersage erforderlich sind, während Semi-factual Explanations maximale Änderungen in einigen wenigen Merkmalen beschreiben, die zu keiner Änderung der Vorhersage führen. Diese Erkenntnisse sind wertvoll, weil sie zeigen, welche Merkmale eine Vorhersage beeinflussen und welche nicht, und sie können helfen, eine Vorhersage zu widerlegen oder zu rechtfertigen. Die Existenz mehrerer gleich guter Counterfactual Explanations und Semi-factual Explanations für einen Datenpunkt wird in der bestehenden Literatur oft übersehen. Darauf weist auch der erste Beitrag des zweiten Teils hin, der sich mit möglichen Fallstricken von Interpretationsmethoden, möglichen Lösungen und offenen Fragen befasst. Um der Vielzahl von Counterfactual Explanations und Semi-factual Explanations zu begegnen, werden in zwei Beiträgen Methoden zur Generierung multipler Erklärungen vorgeschlagen: Das zugrundeliegende Optimierungsproblem wurde für Counterfactual Explanations multi-objektiv und für Semi-factual Explanations als Hyperbox-Suche formalisiert. Beide Ansätze können leicht an andere Anwendungsfälle angepasst werden, wobei ein weiterer Beitrag zeigt, wie der multi-objektive Ansatz zur Bewertung der Modellfairness im kontrafaktischen Sinne angewendet werden kann. Trotz der Vielzahl von Counterfactual Explanations Methoden, die in den letzten Jahren vorgeschlagen wurden, ist die Verfügbarkeit von Methoden für Nutzer der Programmiersprache R äußerst begrenzt. Daher wird in einem weiteren Beitrag ein modulares R-Paket vorgestellt, das die Anwendung und den Vergleich mehrerer Counterfactual Explanations Methoden erleichtert

    Analytics of Sequential Time Data from Physical Assets

    Get PDF
    RÉSUMÉ: Avec l’avancement dans les technologies des capteurs et de l’intelligence artificielle, l'analyse des données est devenue une source d’information et de connaissance qui appuie la prise de décisions dans l’industrie. La prise de ces décisions, en se basant seulement sur l’expertise humaine n’est devenu suffisant ou souhaitable, et parfois même infaisable pour de nouvelles industries. L'analyse des données collectées à partir des actifs physiques vient renforcer la prise de décisions par des connaissances pratiques qui s’appuient sur des données réelles. Ces données sont utilisées pour accomplir deux tâches principales; le diagnostic et le pronostic. Les deux tâches posent un défi, principalement à cause de la provenance des données et de leur adéquation avec l’exploitation, et aussi à cause de la difficulté à choisir le type d'analyse. Ce dernier exige un analyste ayant une expertise dans les déférentes techniques d’analyse de données, et aussi dans le domaine de l’application. Les problèmes de données sont dus aux nombreuses sources inconnues de variations interagissant avec les données collectées, qui peuvent parfois être dus à des erreurs humaines. Le choix du type de modélisation est un autre défi puisque chaque modèle a ses propres hypothèses, paramètres et limitations. Cette thèse propose quatre nouveaux types d'analyse de séries chronologiques dont deux sont supervisés et les deux autres sont non supervisés. Ces techniques d'analyse sont testées et appliquées sur des différents problèmes industriels. Ces techniques visent à minimiser la charge de choix imposée à l'analyste. Pour l’analyse de séries chronologiques par des techniques supervisées, la prédiction de temps de défaillance d’un actif physique est faite par une technique qui porte le nom de ‘Logical Analysis of Survival Curves (LASC)’. Cette technique est utilisée pour stratifier de manière adaptative les courbes de survie tout au long d’un processus d’inspection. Ceci permet une modélisation plus précise au lieu d'utiliser un seul modèle augmenté pour toutes les données. L'autre technique supervisée de pronostic est un nouveau réseau de neurones de type ‘Long Short-Term Memory (LSTM) bidirectionnel’ appelé ‘Bidirectional Handshaking LSTM (BHLSTM)’. Ce modèle fait un meilleur usage des séquences courtes en faisant un tour de ronde à travers les données. De plus, le réseau est formé à l'aide d'une nouvelle fonction objective axée sur la sécurité qui force le réseau à faire des prévisions plus sûres. Enfin, étant donné que LSTM est une technique supervisée, une nouvelle approche pour générer la durée de vie utile restante (RUL) est proposée. Cette technique exige la formulation des hypothèses moins importantes par rapport aux approches précédentes. À des fins de diagnostic non supervisé, une nouvelle technique de classification interprétable est proposée. Cette technique est intitulée ‘Interpretable Clustering for Rule Extraction and Anomaly Detection (IC-READ)’. L'interprétation signifie que les groupes résultants sont formulés en utilisant une logique conditionnelle simple. Cela est pratique lors de la fourniture des résultats à des non-spécialistes. Il facilite toute mise en oeuvre du matériel si nécessaire. La technique proposée est également non paramétrique, ce qui signifie qu'aucun réglage n'est requis. Cette technique pourrait également être utiliser dans un contexte de ‘one class classification’ pour construire un détecteur d'anomalie. L'autre technique non supervisée proposée est une approche de regroupement de séries chronologiques à plusieurs variables de longueur variable à l'aide d'une distance de type ‘Dynamic Time Warping (DTW)’ modifiée. Le DTW modifié donne des correspondances plus élevées pour les séries temporelles qui ont des tendances et des grandeurs similaires plutôt que de se concentrer uniquement sur l'une ou l'autre de ces propriétés. Cette technique est également non paramétrique et utilise la classification hiérarchique pour regrouper les séries chronologiques de manière non supervisée. Cela est particulièrement utile pour décider de la planification de la maintenance. Il est également montré qu'il peut être utilisé avec ‘Kernel Principal Components Analysis (KPCA)’ pour visualiser des séquences de longueurs variables dans des diagrammes bidimensionnels.---------- ABSTRACT: Data analysis has become a necessity for industry. Working with inherited expertise only has become insufficient, expensive, not easily transferable, and mostly unavailable for new industries and facilities. Data analysis can provide decision-makers with more insight on how to manage their production, maintenance and personnel. Data collection requires acquisition and storage of observatory information about the state of the different production assets. Data collection usually takes place in a timely manner which result in time-series of observations. Depending on the type of data records available, the type of possible analyses will differ. Data labeled with previous human experience in terms of identifiable faults or fatigues can be used to build models to perform the expert’s task in the future by means of supervised learning. Otherwise, if no human labeling is available then data analysis can provide insights about similar observations or visualize these similarities through unsupervised learning. Both are challenging types of analyses. The challenges are two-fold; the first originates from the data and its adequacy, and the other is selecting the type of analysis which is a decision made by the analyst. Data challenges are due to the substantial number of unknown sources of variations inherited in the collected data, which may sometimes include human errors. Deciding upon the type of modelling is another issue as each model has its own assumptions, parameters to tune, and limitations. This thesis proposes four new types of time-series analysis, two of which are supervised requiring data labelling by certain events such as failure when, and the other two are unsupervised that require no such labelling. These analysis techniques are tested and applied on various industrial applications, namely road maintenance, bearing outer race failure detection, cutting tool failure prediction, and turbo engine failure prediction. These techniques target minimizing the burden of choice laid on the analyst working with industrial data by providing reliable analysis tools that require fewer choices to be made by the analyst. This in turn allows different industries to easily make use of their data without requiring much expertise. For prognostic purposes a proposed modification to the binary Logical Analysis of Data (LAD) classifier is used to adaptively stratify survival curves into long survivors and short life sets. This model requires no parameters to choose and completely relies on empirical estimations. The proposed Logical Analysis of Survival Curves show a 27% improvement in prediction accuracy than the results obtained by well-known machine learning techniques in terms of the mean absolute error. The other prognostic model is a new bidirectional Long Short-Term Memory (LSTM) neural network termed the Bidirectional Handshaking LSTM (BHLSTM). This model makes better use of short sequences by making a round pass through the given data. Moreover, the network is trained using a new safety oriented objective function which forces the network to make safer predictions. Finally, since LSTM is a supervised technique, a novel approach for generating the target Remaining Useful Life (RUL) is proposed requiring lesser assumptions to be made compared to previous approaches. This proposed network architecture shows an average of 18.75% decrease in the mean absolute error of predictions on the NASA turbo engine dataset. For unsupervised diagnostic purposes a new technique for providing interpretable clustering is proposed named Interpretable Clustering for Rule Extraction and Anomaly Detection (IC-READ). Interpretation means that the resulting clusters are formulated using simple conditional logic. This is very important when providing the results to non-specialists especially those in management and ease any hardware implementation if required. The proposed technique is also non-parametric, which means there is no tuning required and shows an average of 20% improvement in cluster purity over other clustering techniques applied on 11 benchmark datasets. This technique also can use the resulting clusters to build an anomaly detector. The last proposed technique is a whole multivariate variable length time-series clustering approach using a modified Dynamic Time Warping (DTW) distance. The modified DTW gives higher matches for time-series that have the similar trends and magnitudes rather than just focusing on either property alone. This technique is also non-parametric and uses hierarchal clustering to group time-series in an unsupervised fashion. This can be specifically useful for management to decide maintenance scheduling. It is shown also that it can be used along with Kernel Principal Components Analysis (KPCA) for visualizing variable length sequences in two-dimensional plots. The unsupervised techniques can help, in some cases where there is a lot of variation within certain classes, to ease the supervised learning task by breaking it into smaller problems having the same nature
    • …
    corecore