525 research outputs found

    Statistical Learning with Missing Data

    Get PDF
    Statistical learning is a popular family of data analysis methods which has been successfully employed in biomedical research, the social sciences, public safety applications, and most data dependent areas of research. A major goal of statistical learning methods is to construct rules which predict an outcome y from a set of predictors x, for example, predicting treatment response from a set of pre-treatment biomarkers. Accurate prediction rules of treatment response can guide health care providers to select the best treatment options. The support vector machine (SVM) is a statistical learning method profitably employed in a number of research areas such as biomedical computer vision tasks, drug design, and genetics. Because SVMs admit nonlinear prediction rules, it is a natural choice for analyzing data with potentially complex relationships. One drawback to SVMs is the limited means of handling missing data in the training set, yet missing data is ubiquitous in studies of health-related outcomes. In this research, we review the literature on missing data, and we summarize those scenarios when missing data may bias statistical analysis. We also provide an overview of supervised classification methods, especially those methods which accommodate missing data. We pay special attention to SVMs as this family of methods is the focus of our proposed contributions to this body of work. We propose three methods involving SVMs and missing data. The first paper proposes an EM-based solution for constructing SVMs when the training set includes observations with missing covariates. We present the method for continuous covariates but the method is applicable to discrete covariates as well. The second paper proposes weighting methods inspired by weighted estimating equations, also for the purpose of constructing SVMs when the training set includes observations with missing covariates. The third paper considers scenarios in which class labels are missing or are partially observed, an area of study commonly called semi-supervised learning. We propose an EM-type solution for the semi-supervised learning scenario, and we apply the method to both two-class and multi-class SVMs. In each paper, the proposed methods will be demonstrated in the context of a large multi-center observational study of Hepatitis C patients.Doctor of Philosoph

    Learning discriminative models with incomplete data

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, School of Architecture and Planning, Program in Media Arts and Sciences, 2006.Includes bibliographical references (p. 115-121).Many practical problems in pattern recognition require making inferences using multiple modalities, e.g. sensor data from video, audio, physiological changes etc. Often in real-world scenarios there can be incompleteness in the training data. There can be missing channels due to sensor failures in multi-sensory data and many data points in the training set might be unlabeled. Further, instead of having exact labels we might have easy to obtain coarse labels that correlate with the task. Also, there can be labeling errors, for example human annotation can lead to incorrect labels in the training data. The discriminative paradigm of classification aims to model the classification boundary directly by conditioning on the data points; however, discriminative models cannot easily handle incompleteness since the distribution of the observations is never explicitly modeled. We present a unified Bayesian framework that extends the discriminative paradigm to handle four different kinds of incompleteness. First, a solution based on a mixture of Gaussian processes is proposed for achieving sensor fusion under the problematic conditions of missing channels. Second, the framework addresses incompleteness resulting from partially labeled data using input dependent regularization.(cont.) Third, we introduce the located hidden random field (LHRF) that learns finer level labels when only some easy to obtain coarse information is available. Finally the proposed framework can handle incorrect labels, the fourth case of incompleteness. One of the advantages of the framework is that we can use different models for different kinds of label errors, providing a way to encode prior knowledge about the process. The proposed extensions are built on top of Gaussian process classification and result in a modular framework where each component is capable of handling different kinds of incompleteness. These modules can be combined in many different ways, resulting in many different algorithms within one unified framework. We demonstrate the effectiveness of the framework on a variety of problems such as multi-sensor affect recognition, image classification and object detection and segmentation.by Ashish Kapoor.Ph.D

    Efficient Data Driven Multi Source Fusion

    Get PDF
    Data/information fusion is an integral component of many existing and emerging applications; e.g., remote sensing, smart cars, Internet of Things (IoT), and Big Data, to name a few. While fusion aims to achieve better results than what any one individual input can provide, often the challenge is to determine the underlying mathematics for aggregation suitable for an application. In this dissertation, I focus on the following three aspects of aggregation: (i) efficient data-driven learning and optimization, (ii) extensions and new aggregation methods, and (iii) feature and decision level fusion for machine learning with applications to signal and image processing. The Choquet integral (ChI), a powerful nonlinear aggregation operator, is a parametric way (with respect to the fuzzy measure (FM)) to generate a wealth of aggregation operators. The FM has 2N variables and N(2N − 1) constraints for N inputs. As a result, learning the ChI parameters from data quickly becomes impractical for most applications. Herein, I propose a scalable learning procedure (which is linear with respect to training sample size) for the ChI that identifies and optimizes only data-supported variables. As such, the computational complexity of the learning algorithm is proportional to the complexity of the solver used. This method also includes an imputation framework to obtain scalar values for data-unsupported (aka missing) variables and a compression algorithm (lossy or losselss) of the learned variables. I also propose a genetic algorithm (GA) to optimize the ChI for non-convex, multi-modal, and/or analytical objective functions. This algorithm introduces two operators that automatically preserve the constraints; therefore there is no need to explicitly enforce the constraints as is required by traditional GA algorithms. In addition, this algorithm provides an efficient representation of the search space with the minimal set of vertices. Furthermore, I study different strategies for extending the fuzzy integral for missing data and I propose a GOAL programming framework to aggregate inputs from heterogeneous sources for the ChI learning. Last, my work in remote sensing involves visual clustering based band group selection and Lp-norm multiple kernel learning based feature level fusion in hyperspectral image processing to enhance pixel level classification

    제조 시스템에서의 예측 모델링을 위한 지능적 데이터 획득

    Get PDF
    학위논문 (박사) -- 서울대학교 대학원 : 공과대학 산업공학과, 2021. 2. 조성준.Predictive modeling is a type of supervised learning to find the functional relationship between the input variables and the output variable. Predictive modeling is used in various aspects in manufacturing systems, such as automation of visual inspection, prediction of faulty products, and result estimation of expensive inspection. To build a high-performance predictive model, it is essential to secure high quality data. However, in manufacturing systems, it is practically impossible to acquire enough data of all kinds that are needed for the predictive modeling. There are three main difficulties in the data acquisition in manufacturing systems. First, labeled data always comes with a cost. In many problems, labeling must be done by experienced engineers, which is costly. Second, due to the inspection cost, not all inspections can be performed on all products. Because of time and monetary constraints in the manufacturing system, it is impossible to obtain all the desired inspection results. Third, changes in the manufacturing environment make data acquisition difficult. A change in the manufacturing environment causes a change in the distribution of generated data, making it impossible to obtain enough consistent data. Then, the model have to be trained with a small amount of data. In this dissertation, we overcome this difficulties in data acquisition through active learning, active feature-value acquisition, and domain adaptation. First, we propose an active learning framework to solve the high labeling cost of the wafer map pattern classification. This makes it possible to achieve higher performance with a lower labeling cost. Moreover, the cost efficiency is further improved by incorporating the cluster-level annotation into active learning. For the inspection cost for fault prediction problem, we propose a active inspection framework. By selecting products to undergo high-cost inspection with the novel uncertainty estimation method, high performance can be obtained with low inspection cost. To solve the recipe transition problem that frequently occurs in faulty wafer prediction in semiconductor manufacturing, a domain adaptation methods are used. Through sequential application of unsupervised domain adaptation and semi-supervised domain adaptation, performance degradation due to recipe transition is minimized. Through experiments on real-world data, it was demonstrated that the proposed methodologies can overcome the data acquisition problems in the manufacturing systems and improve the performance of the predictive models.예측 모델링은 지도 학습의 일종으로, 학습 데이터를 통해 입력 변수와 출력 변수 간의 함수적 관계를 찾는 과정이다. 이런 예측 모델링은 육안 검사 자동화, 불량 제품 사전 탐지, 고비용 검사 결과 추정 등 제조 시스템 전반에 걸쳐 활용된다. 높은 성능의 예측 모델을 달성하기 위해서는 양질의 데이터가 필수적이다. 하지만 제조 시스템에서 원하는 종류의 데이터를 원하는 만큼 획득하는 것은 현실적으로 거의 불가능하다. 데이터 획득의 어려움은 크게 세가지 원인에 의해 발생한다. 첫번째로, 라벨링이 된 데이터는 항상 비용을 수반한다는 점이다. 많은 문제에서, 라벨링은 숙련된 엔지니어에 의해 수행되어야 하고, 이는 큰 비용을 발생시킨다. 두번째로, 검사 비용 때문에 모든 검사가 모든 제품에 대해 수행될 수 없다. 제조 시스템에는 시간적, 금전적 제약이 존재하기 때문에, 원하는 모든 검사 결과값을 획득하는 것이 어렵다. 세번째로, 제조 환경의 변화가 데이터 획득을 어렵게 만든다. 제조 환경의 변화는 생성되는 데이터의 분포를 변형시켜, 일관성 있는 데이터를 충분히 획득하지 못하게 한다. 이로 인해 적은 양의 데이터만으로 모델을 재학습시켜야 하는 상황이 빈번하게 발생한다. 본 논문에서는 이런 데이터 획득의 어려움을 극복하기 위해 능동 학습, 능동 피쳐값 획득, 도메인 적응 방법을 활용한다. 먼저, 웨이퍼 맵 패턴 분류 문제의 높은 라벨링 비용을 해결하기 위해 능동학습 프레임워크를 제안한다. 이를 통해 적은 라벨링 비용으로 높은 성능의 분류 모델을 구축할 수 있다. 나아가, 군집 단위의 라벨링 방법을 능동학습에 접목하여 비용 효율성을 한차례 더 개선한다. 제품 불량 예측에 활용되는 검사 비용 문제를 해결하기 위해서는 능동 검사 방법을 제안한다. 제안하는 새로운 불확실성 추정 방법을 통해 고비용 검사 대상 제품을 선택함으로써 적은 검사 비용으로 높은 성능을 얻을 수 있다. 반도체 제조의 웨이퍼 불량 예측에서 빈번하게 발생하는 레시피 변경 문제를 해결하기 위해서는 도메인 적응 방법을 활용한다. 비교사 도메인 적응과 반교사 도메인 적응의 순차적인 적용을 통해 레시피 변경에 의한 성능 저하를 최소화한다. 본 논문에서는 실제 데이터에 대한 실험을 통해 제안된 방법론들이 제조시스템의 데이터 획득 문제를 극복하고 예측 모델의 성능을 높일 수 있음을 확인하였다.1. Introduction 1 2. Literature Review 9 2.1 Review of Related Methodologies 9 2.1.1 Active Learning 9 2.1.2 Active Feature-value Acquisition 11 2.1.3 Domain Adaptation 14 2.2 Review of Predictive Modelings in Manufacturing 15 2.2.1 Wafer Map Pattern Classification 15 2.2.2 Fault Detection and Classification 16 3. Active Learning for Wafer Map Pattern Classification 19 3.1 Problem Description 19 3.2 Proposed Method 21 3.2.1 System overview 21 3.2.2 Prediction model 25 3.2.3 Uncertainty estimation 25 3.2.4 Query wafer selection 29 3.2.5 Query wafer labeling 30 3.2.6 Model update 30 3.3 Experiments 31 3.3.1 Data description 31 3.3.2 Experimental design 31 3.3.3 Results and discussion 34 4. Active Cluster Annotation for Wafer Map Pattern Classification 42 4.1 Problem Description 42 4.2 Proposed Method 44 4.2.1 Clustering of unlabeled data 46 4.2.2 CNN training with labeled data 48 4.2.3 Cluster-level uncertainty estimation 49 4.2.4 Query cluster selection 50 4.2.5 Cluster-level annotation 50 4.3 Experiments 51 4.3.1 Data description 51 4.3.2 Experimental setting 51 4.3.3 Clustering results 53 4.3.4 Classification performance 54 4.3.5 Analysis for label noise 57 5. Active Inspection for Fault Prediction 60 5.1 Problem Description 60 5.2 Proposed Method 65 5.2.1 Active inspection framework 65 5.2.2 Acquisition based on Expected Prediction Change 68 5.3 Experiments 71 5.3.1 Data description 71 5.3.2 Fault prediction models 72 5.3.3 Experimental design 73 5.3.4 Results and discussion 74 6. Adaptive Fault Detection for Recipe Transition 76 6.1 Problem Description 76 6.2 Proposed Method 78 6.2.1 Overview 78 6.2.2 Unsupervised adaptation phase 81 6.2.3 Semi-supervised adaptation phase 83 6.3 Experiments 85 6.3.1 Data description 85 6.3.2 Experimental setting 85 6.3.3 Performance degradation caused by recipe transition 86 6.3.4 Effect of unsupervised adaptation 87 6.3.5 Effect of semi-supervised adaptation 88 7. Conclusion 91 7.1 Contributions 91 7.2 Future work 94Docto

    New Approaches to Mapping Forest Conditions and Landscape Change from Moderate Resolution Remote Sensing Data across the Species-Rich and Structurally Diverse Atlantic Northern Forest of Northeastern North America

    Get PDF
    The sustainable management of forest landscapes requires an understanding of the functional relationships between management practices, changes in landscape conditions, and ecological response. This presents a substantial need of spatial information in support of both applied research and adaptive management. Satellite remote sensing has the potential to address much of this need, but forest conditions and patterns of change remain difficult to synthesize over large areas and long time periods. Compounding this problem is error in forest attribute maps and consequent uncertainty in subsequent analyses. The research described in this document is directed at these long-standing problems. Chapter 1 demonstrates a generalizable approach to the characterization of predominant patterns of forest landscape change. Within a ~1.5 Mha northwest Maine study area, a time series of satellite-derived forest harvest maps (1973-2010) served as the basis grouping landscape units according to time series of cumulative harvest area. Different groups reflected different harvest histories, which were linked to changes in landscape composition and configuration through time series of selected landscape metrics. Time series data resolved differences in landscape change attributable to passage of the Maine Forest Practices Act, a major change in forest policy. Our approach should be of value in supporting empirical landscape research. Perhaps the single most important source of uncertainty in the characterization of landscape conditions is over- or under-representation of class prevalence caused by prediction bias. Systematic error is similarly impactful in maps of continuous forest attributes, where regression dilution or attenuation bias causes the overestimation of low values and underestimation of high values. In both cases, patterns of error tend to produce more homogeneous characterizations of landscape conditions. Chapters 2 and 3 present a machine learning method designed to simultaneously reduce systematic and total error in continuous and categorical maps, respectively. By training support vector machines with a multi-objective genetic algorithm, attenuation bias was substantially reduced in regression models of tree species relative abundance (chapter 2), and prediction bias was effectively removed from classification models predicting tree species occurrence and forest disturbance (chapter 3). This approach is generalizable to other prediction problems, other regions, or other geospatial disciplines

    Data driven methods for updating fault detection and diagnosis system in chemical processes

    Get PDF
    Modern industrial processes are becoming more complex, and consequently monitoring them has become a challenging task. Fault Detection and Diagnosis (FDD) as a key element of process monitoring, needs to be investigated because of its essential role in decision making processes. Among available FDD methods, data driven approaches are currently receiving increasing attention because of their relative simplicity in implementation. Regardless of FDD types, one of the main traits of reliable FDD systems is their ability of being updated while new conditions that were not considered at their initial training appear in the process. These new conditions would emerge either gradually or abruptly, but they have the same level of importance as in both cases they lead to FDD poor performance. For addressing updating tasks, some methods have been proposed, but mainly not in research area of chemical engineering. They could be categorized to those that are dedicated to managing Concept Drift (CD) (that appear gradually), and those that deal with novel classes (that appear abruptly). The available methods, mainly, in addition to the lack of clear strategies for updating, suffer from performance weaknesses and inefficient required time of training, as reported. Accordingly, this thesis is mainly dedicated to data driven FDD updating in chemical processes. The proposed schemes for handling novel classes of faults are based on unsupervised methods, while for coping with CD both supervised and unsupervised updating frameworks have been investigated. Furthermore, for enhancing the functionality of FDD systems, some major methods of data processing, including imputation of missing values, feature selection, and feature extension have been investigated. The suggested algorithms and frameworks for FDD updating have been evaluated through different benchmarks and scenarios. As a part of the results, the suggested algorithms for supervised handling CD surpass the performance of the traditional incremental learning in regard to MGM score (defined dimensionless score based on weighted F1 score and training time) even up to 50% improvement. This improvement is achieved by proposed algorithms that detect and forget redundant information as well as properly adjusting the data window for timely updating and retraining the fault detection system. Moreover, the proposed unsupervised FDD updating framework for dealing with novel faults in static and dynamic process conditions achieves up to 90% in terms of the NPP score (defined dimensionless score based on number of the correct predicted class of samples). This result relies on an innovative framework that is able to assign samples either to new classes or to available classes by exploiting one class classification techniques and clustering approaches.Los procesos industriales modernos son cada vez más complejos y, en consecuencia, su control se ha convertido en una tarea desafiante. La detección y el diagnóstico de fallos (FDD), como un elemento clave de la supervisión del proceso, deben ser investigados debido a su papel esencial en los procesos de toma de decisiones. Entre los métodos disponibles de FDD, los enfoques basados en datos están recibiendo una atención creciente debido a su relativa simplicidad en la implementación. Independientemente de los tipos de FDD, una de las principales características de los sistemas FDD confiables es su capacidad de actualización, mientras que las nuevas condiciones que no fueron consideradas en su entrenamiento inicial, ahora aparecen en el proceso. Estas nuevas condiciones pueden surgir de forma gradual o abrupta, pero tienen el mismo nivel de importancia ya que en ambos casos conducen al bajo rendimiento de FDD. Para abordar las tareas de actualización, se han propuesto algunos métodos, pero no mayoritariamente en el área de investigación de la ingeniería química. Podrían ser categorizados en los que están dedicados a manejar Concept Drift (CD) (que aparecen gradualmente), y a los que tratan con clases nuevas (que aparecen abruptamente). Los métodos disponibles, además de la falta de estrategias claras para la actualización, sufren debilidades en su funcionamiento y de un tiempo de capacitación ineficiente, como se ha referenciado. En consecuencia, esta tesis está dedicada principalmente a la actualización de FDD impulsada por datos en procesos químicos. Los esquemas propuestos para manejar nuevas clases de fallos se basan en métodos no supervisados, mientras que para hacer frente a la CD se han investigado los marcos de actualización supervisados y no supervisados. Además, para mejorar la funcionalidad de los sistemas FDD, se han investigado algunos de los principales métodos de procesamiento de datos, incluida la imputación de valores perdidos, la selección de características y la extensión de características. Los algoritmos y marcos sugeridos para la actualización de FDD han sido evaluados a través de diferentes puntos de referencia y escenarios. Como parte de los resultados, los algoritmos sugeridos para el CD de manejo supervisado superan el rendimiento del aprendizaje incremental tradicional con respecto al puntaje MGM (puntuación adimensional definida basada en el puntaje F1 ponderado y el tiempo de entrenamiento) hasta en un 50% de mejora. Esta mejora se logra mediante los algoritmos propuestos que detectan y olvidan la información redundante, así como ajustan correctamente la ventana de datos para la actualización oportuna y el reciclaje del sistema de detección de fallas. Además, el marco de actualización FDD no supervisado propuesto para tratar fallas nuevas en condiciones de proceso estáticas y dinámicas logra hasta 90% en términos de la puntuación de NPP (puntuación adimensional definida basada en el número de la clase de muestras correcta predicha). Este resultado se basa en un marco innovador que puede asignar muestras a clases nuevas o a clases disponibles explotando una clase de técnicas de clasificación y enfoques de agrupamientoPostprint (published version
    corecore