252 research outputs found

    Generalizing, Decoding, and Optimizing Support Vector Machine Classification

    Get PDF
    The classification of complex data usually requires the composition of processing steps. Here, a major challenge is the selection of optimal algorithms for preprocessing and classification. Nowadays, parts of the optimization process are automized but expert knowledge and manual work are still required. We present three steps to face this process and ease the optimization. Namely, we take a theoretical view on classical classifiers, provide an approach to interpret the classifier together with the preprocessing, and integrate both into one framework which enables a semiautomatic optimization of the processing chain and which interfaces numerous algorithms

    Linear Dimensionality Reduction for Margin-Based Classification: High-Dimensional Data and Sensor Networks

    Get PDF
    Low-dimensional statistics of measurements play an important role in detection problems, including those encountered in sensor networks. In this work, we focus on learning low-dimensional linear statistics of high-dimensional measurement data along with decision rules defined in the low-dimensional space in the case when the probability density of the measurements and class labels is not given, but a training set of samples from this distribution is given. We pose a joint optimization problem for linear dimensionality reduction and margin-based classification, and develop a coordinate descent algorithm on the Stiefel manifold for its solution. Although the coordinate descent is not guaranteed to find the globally optimal solution, crucially, its alternating structure enables us to extend it for sensor networks with a message-passing approach requiring little communication. Linear dimensionality reduction prevents overfitting when learning from finite training data. In the sensor network setting, dimensionality reduction not only prevents overfitting, but also reduces power consumption due to communication. The learned reduced-dimensional space and decision rule is shown to be consistent and its Rademacher complexity is characterized. Experimental results are presented for a variety of datasets, including those from existing sensor networks, demonstrating the potential of our methodology in comparison with other dimensionality reduction approaches.National Science Foundation (U.S.). Graduate Research Fellowship ProgramUnited States. Army Research Office (MURI funded through ARO Grant W911NF-06-1-0076)United States. Air Force Office of Scientific Research (Award FA9550-06-1-0324)Shell International Exploration and Production B.V

    Plant recognition, detection, and counting with deep learning

    Get PDF
    In agricultural and farm management, plant recognition, plant detection, and plant counting systems are crucial. We can apply these tasks to several applications, for example, plant disease detection, weed detection, fruit harvest system, and plant species identification. Plants can be identified by looking at their most discriminating parts, such as a leaf, fruit, flower, bark, and the overall plant, by considering attributes as shape, size, or color. However, the identification of plant species from field observation can be complicated, time-consuming, and requires specialized expertise. Computer vision and machine-learning techniques have become ubiquitous and are invaluable to overcome problems with plant recognition in research. Although these techniques have been of great help, image-based plant recognition is still a challenge. There are several obstacles, such as considerable species diversity, intra-class dissimilarity, inter-class similarity, and blurred resource images. Recently, the emerging of deep learning has brought substantial advances in image classification. Deep learning architectures can learn from images and notably increase their predictive accuracy. This thesis provides various techniques, including data augmentation and classification schemes, to improve plant recognition, plant detection, and plant counting system

    An investigation on automatic systems for fault diagnosis in chemical processes

    Get PDF
    Plant safety is the most important concern of chemical industries. Process faults can cause economic loses as well as human and environmental damages. Most of the operational faults are normally considered in the process design phase by applying methodologies such as Hazard and Operability Analysis (HAZOP). However, it should be expected that failures may occur in an operating plant. For this reason, it is of paramount importance that plant operators can promptly detect and diagnose such faults in order to take the appropriate corrective actions. In addition, preventive maintenance needs to be considered in order to increase plant safety. Fault diagnosis has been faced with both analytic and data-based models and using several techniques and algorithms. However, there is not yet a general fault diagnosis framework that joins detection and diagnosis of faults, either registered or non-registered in records. Even more, less efforts have been focused to automate and implement the reported approaches in real practice. According to this background, this thesis proposes a general framework for data-driven Fault Detection and Diagnosis (FDD), applicable and susceptible to be automated in any industrial scenario in order to hold the plant safety. Thus, the main requirement for constructing this system is the existence of historical process data. In this sense, promising methods imported from the Machine Learning field are introduced as fault diagnosis methods. The learning algorithms, used as diagnosis methods, have proved to be capable to diagnose not only the modeled faults, but also novel faults. Furthermore, Risk-Based Maintenance (RBM) techniques, widely used in petrochemical industry, are proposed to be applied as part of the preventive maintenance in all industry sectors. The proposed FDD system together with an appropriate preventive maintenance program would represent a potential plant safety program to be implemented. Thus, chapter one presents a general introduction to the thesis topic, as well as the motivation and scope. Then, chapter two reviews the state of the art of the related fields. Fault detection and diagnosis methods found in literature are reviewed. In this sense a taxonomy that joins both Artificial Intelligence (AI) and Process Systems Engineering (PSE) classifications is proposed. The fault diagnosis assessment with performance indices is also reviewed. Moreover, it is exposed the state of the art corresponding to Risk Analysis (RA) as a tool for taking corrective actions to faults and the Maintenance Management for the preventive actions. Finally, the benchmark case studies against which FDD research is commonly validated are examined in this chapter. The second part of the thesis, integrated by chapters three to six, addresses the methods applied during the research work. Chapter three deals with the data pre-processing, chapter four with the feature processing stage and chapter five with the diagnosis algorithms. On the other hand, chapter six introduces the Risk-Based Maintenance techniques for addressing the plant preventive maintenance. The third part includes chapter seven, which constitutes the core of the thesis. In this chapter the proposed general FD system is outlined, divided in three steps: diagnosis model construction, model validation and on-line application. This scheme includes a fault detection module and an Anomaly Detection (AD) methodology for the detection of novel faults. Furthermore, several approaches are derived from this general scheme for continuous and batch processes. The fourth part of the thesis presents the validation of the approaches. Specifically, chapter eight presents the validation of the proposed approaches in continuous processes and chapter nine the validation of batch process approaches. Chapter ten raises the AD methodology in real scaled batch processes. First, the methodology is applied to a lab heat exchanger and then it is applied to a Photo-Fenton pilot plant, which corroborates its potential and success in real practice. Finally, the fifth part, including chapter eleven, is dedicated to stress the final conclusions and the main contributions of the thesis. Also, the scientific production achieved during the research period is listed and prospects on further work are envisaged.La seguridad de planta es el problema más inquietante para las industrias químicas. Un fallo en planta puede causar pérdidas económicas y daños humanos y al medio ambiente. La mayoría de los fallos operacionales son previstos en la etapa de diseño de un proceso mediante la aplicación de técnicas de Análisis de Riesgos y de Operabilidad (HAZOP). Sin embargo, existe la probabilidad de que pueda originarse un fallo en una planta en operación. Por esta razón, es de suma importancia que una planta pueda detectar y diagnosticar fallos en el proceso y tomar las medidas correctoras adecuadas para mitigar los efectos del fallo y evitar lamentables consecuencias. Es entonces también importante el mantenimiento preventivo para aumentar la seguridad y prevenir la ocurrencia de fallos. La diagnosis de fallos ha sido abordada tanto con modelos analíticos como con modelos basados en datos y usando varios tipos de técnicas y algoritmos. Sin embargo, hasta ahora no existe la propuesta de un sistema general de seguridad en planta que combine detección y diagnosis de fallos ya sea registrados o no registrados anteriormente. Menos aún se han reportado metodologías que puedan ser automatizadas e implementadas en la práctica real. Con la finalidad de abordar el problema de la seguridad en plantas químicas, esta tesis propone un sistema general para la detección y diagnosis de fallos capaz de implementarse de forma automatizada en cualquier industria. El principal requerimiento para la construcción de este sistema es la existencia de datos históricos de planta sin previo filtrado. En este sentido, diferentes métodos basados en datos son aplicados como métodos de diagnosis de fallos, principalmente aquellos importados del campo de “Aprendizaje Automático”. Estas técnicas de aprendizaje han resultado ser capaces de detectar y diagnosticar no sólo los fallos modelados o “aprendidos”, sino también nuevos fallos no incluidos en los modelos de diagnosis. Aunado a esto, algunas técnicas de mantenimiento basadas en riesgo (RBM) que son ampliamente usadas en la industria petroquímica, son también propuestas para su aplicación en el resto de sectores industriales como parte del mantenimiento preventivo. En conclusión, se propone implementar en un futuro no lejano un programa general de seguridad de planta que incluya el sistema de detección y diagnosis de fallos propuesto junto con un adecuado programa de mantenimiento preventivo. Desglosando el contenido de la tesis, el capítulo uno presenta una introducción general al tema de esta tesis, así como también la motivación generada para su desarrollo y el alcance delimitado. El capítulo dos expone el estado del arte de las áreas relacionadas al tema de tesis. De esta forma, los métodos de detección y diagnosis de fallos encontrados en la literatura son examinados en este capítulo. Asimismo, se propone una taxonomía de los métodos de diagnosis que unifica las clasificaciones propuestas en el área de Inteligencia Artificial y de Ingeniería de procesos. En consecuencia, se examina también la evaluación del performance de los métodos de diagnosis en la literatura. Además, en este capítulo se revisa y reporta el estado del arte correspondiente al “Análisis de Riesgos” y a la “Gestión del Mantenimiento” como técnicas complementarias para la toma de medidas correctoras y preventivas. Por último se abordan los casos de estudio considerados como puntos de referencia en el campo de investigación para la aplicación del sistema propuesto. La tercera parte incluye el capítulo siete, el cual constituye el corazón de la tesis. En este capítulo se presenta el esquema o sistema general de diagnosis de fallos propuesto. El sistema es dividido en tres partes: construcción de los modelos de diagnosis, validación de los modelos y aplicación on-line. Además incluye un modulo de detección de fallos previo a la diagnosis y una metodología de detección de anomalías para la detección de nuevos fallos. Por último, de este sistema se desglosan varias metodologías para procesos continuos y por lote. La cuarta parte de esta tesis presenta la validación de las metodologías propuestas. Específicamente, el capítulo ocho presenta la validación de las metodologías propuestas para su aplicación en procesos continuos y el capítulo nueve presenta la validación de las metodologías correspondientes a los procesos por lote. El capítulo diez valida la metodología de detección de anomalías en procesos por lote reales. Primero es aplicada a un intercambiador de calor escala laboratorio y después su aplicación es escalada a un proceso Foto-Fenton de planta piloto, lo cual corrobora el potencial y éxito de la metodología en la práctica real. Finalmente, la quinta parte de esta tesis, compuesta por el capítulo once, es dedicada a presentar y reafirmar las conclusiones finales y las principales contribuciones de la tesis. Además, se plantean las líneas de investigación futuras y se lista el trabajo desarrollado y presentado durante el periodo de investigación

    CANCER DETECTION FOR LOW GRADE SQUAMOUS ENTRAEPITHELIAL LESION

    Get PDF
    The National Cancer Institute estimates in 2012, about 577,190 Americans are expected to die of cancer, more than 1,500 people a day. Cancer is the second most common cause of death in the US, accounting for nearly 1 of every 4 deaths. Cancer diagnosis has a very important role in the early detection and treatment of cancer. Automating the cancer diagnosis process can play a very significant role in reducing the number of falsely identified or unidentified cases. The aim of this thesis is to demonstrate different machine learning approaches for cancer detection. Dr. Tawfik, pathologist from University of Kansas medical Center (KUMC) is an inventor of a novel pathology tissue slicer. The data used in this study comes from this slicer, which successfully allows semi-automated cancer diagnosis and it has the potential to improve patient care. In this study the slides are processed and visual features are computed and the dataset is made from scratch. After features extraction, different machine learning approaches are applied on the dataset which has shown its capability of extracting high-level representations from high-dimensional data. Support Vector Machine and Deep Belief Networks (DBN) are the concentration in this study. In the first section, Support vector machine is applied on the dataset. Next Deep Belief Network which is capable of extracting features in an unsupervised manner is implemented and with back-propagation the network is fine tuned. The results show that DBN can be effective when applied to cytological cancer diagnosis by increasing the accuracy in cancer detection. In the last section a subset of DBN features are selected and then appended with raw features and Support Vector Machine is trained and tested with that. It shows improvement over the first section results. In the end the study infers that Deep Belief Network can be successfully used over other leading classification methods for cancer detection

    Supervised machine learning in psychiatry:towards application in clinical practice

    Get PDF
    In recent years, the field of machine learning (often named with the more general term artificial intelligence) has literally exploded and its application has been proposed in basically all fields, including psychiatry and mental health. This has been motivated by the promise of using machine learning to develop new clinical tools that could help perform personalized predictions and recommendations, ultimately improving the results achievable in the psychiatric clinical practice that still faces only a limited success in the fight against mental diseases. However, despite this huge interest, there is still a substantial lack of tools in psychiatry that are based on machine learning algorithms. Massimiliano Grassi, in his Ph.D. thesis, investigates the challenges of translating machine learning algorithms into clinical practice and proposes innovative solutions to these challenges. The thesis presents the development and validation of new algorithms for the prediction of the onset of Alzheimer’s disease, the remission of obsessive-compulsive disorder, and the automatization of sleep staging in polysomnography, a method to diagnose sleep disorders. The results from these studies demonstrate that the use of machine learning in psychiatric clinical practice is not just a promise, and it is possible to develop machine learning algorithms that achieve clinically relevant performance even if based solely on information that can be easily accessible in the daily clinical routine

    A novel high-throughput and label-free phenotypic drug screening approach: MALDI-TOF mass spectrometry combined with machine learning strategies

    Get PDF
    A renewed and growing interest in phenotypic drug screening approaches in the field of drug discovery is observed, as it has become apparent that target-oriented drug discovery assays have inherent limitations and cannot fulfil the urgent unmet medical need for novel drugs. The shortcomings of target-oriented drug screening assays are especially apparent in the field of antibiotic drug discovery, where target-based approaches largely failed to translate screening hits to clinically relevant drugs. In this thesis, a proteomics-based phenotypic drug screening approach using MALDI-TOF mass spectrometry was developed, which is able to detect sub-lethal stress in bacterial cells provoked by antibiotics. To achieve this, mass spectra of whole-cells exposed to known antibiotics at concentrations below the minimal inhibitory concentration (MIC) were used to extract relevant mass spectral peaks with a data-dependent and automated computational pipeline created in the MATLAB environment. Using the selected subset of mass spectral peaks, classification models were trained to recognize general mass spectral responses provoked by unknown drugs in the cellular proteome. Additionally, the classification models proved capable of identifying the mechanisms of action of unknown drugs. To establish and validate the best performing classification modeling procedure, four different feature selection algorithms and nine classification models were analyzed in detail using an Escherichia coli data set composed of over 900 spectra, involving 17 antibiotics with four different mechanisms of action, at concentrations ranging 1×MIC down to 1/32×MIC in a two-fold dilution series. Four different feature selection approaches were investigated to ensure the extraction of relevant mass spectral data in response to the different antibiotics for classification modeling. The selection approaches included (1) a random forest of decision trees, (2) sequential forward feature selection, and (3) sequential backward feature selection. Mass spectral peaks selected by two or all three of these feature selection approaches were combined into (4) an aggregated feature set. Classification models were trained for all combinations of nine model types and the four feature sets. In this thesis two classification problems were investigated. First, a binary classification problem, to differentiate between affected cells, and non-affected cells based on selected mass spectral peaks. Second, a multi-class model was trained to detect and distinguish between the different antibiotic mechanisms of action, a highly desired drug screening assay characteristic. The combination of these elements yielded 72 models, which were evaluated based on their overall classification accuracy. The overall classification accuracy was determined using internal 10-fold cross-validation and external validation, which was performed with a blind set of 20 drugs. The internal and external validation studies showed that the aggregated feature set in combination with a quadratic support vector machine-based model (Q-SVM) resulted in the best classification performance. For the E. coli data set, this was represented by an overall accuracy of 0.92 for internal validation and an accuracy of 0.95 for the external validation of the Q-SVM model. Classifying based on the mechanism of action of the antibiotics resulted in a classification accuracy of 0.67 for internal validation and 0.80 for external validation. Furthermore, it was shown that the peak selection method was able to identify relevant, known stress associated proteins within the aggregated feature sets of both the binary and the mechanism of action model. After the experimental workflow and the computational pipeline were established based on E. coli data, the method was applied to four different organisms (the Gram-positive bacterium Staphylococcus aureus, the fungi Saccharomyces cerevisiae and Candida albicans, and human HeLa cancer cell line) and different proteomic responses, to explore the versatility and transferability of the developed screening assay. The applicability of the method was demonstrated by the consistent performance of the classification models generated with the experimental and computational pipeline. This resulted in binary model accuracies between 0.92 and 0.97 for internal and 0.77 and 0.95 for external validation, depending on the assayed organism and data set complexity. For mechanism of action models, model accuracies ranged between 0.73 and 0.96 for internal and 0.66 and 0.93 for external validation. The application of the developed assay on different organisms with different drug stressors highlighted several advantageous characteristics of the developed MALDI-TOF MS screening approach. Both the binary and mechanism of action classification models of S. aureus correctly identified an antibiotic drug (fusidic acid) in the blind test set, which had a target binding activity that was not present in the training data set. This implicates the ability of the method to detect novel drugs within known global mechanism of action for which the model was trained. Moreover, external validation of S. cerevisiae showed that the binary classification model is able to detect antifungal drugs (tavaborole, an antifungal protein synthesis inhibitor) with a mechanism of action which was not present in the training data set. This is a highly desirable property of any phenotypic screening assay, as it shows that the assay allows for the identification of drugs with novel mechanisms of action. Lastly, the proteomic effect of different types of drugs on mammalian cells was explored by using the HeLa cancer cell line. It was shown that the presented proteomic profiling approach can easily detect several types of drug-induced stresses in HeLa cells, in particular corticosteroids and tubulin (de)polymerization inhibitors, but is less suitable for distinguishing other types of drug classes (neurotransmitter antagonists, statins, opioids). Additionally, the application of the assay on HeLa cells demonstrated the ability to detect different types of stresses, such as the cells’ proteomic response to UV exposure or heat shocks. These results pave the way for possible distinction between apoptosis and necrosis pathways in HeLa cells using the presented MALDI-TOF MS based method. To conclude, a high-throughput compatible, label free, MALDI-TOF mass spectrometry-based screening assay is described in this thesis, which measures sub-lethal drug effects on the cellular proteome in a phenotypic and pharmacological relevant setting. The method was found suitable for whole-cell screening of small libraries of drugs, and showed the ability to distinguish different types of stresses elicited on multiple types of cell cultures. The potential to find new, weakly active drugs within a known mechanism of action, as well as the ability to detect sub-lethal drug responses with new mechanisms of action for which the model was not trained, was demonstrated. The characteristic to identify novel mechanisms of action in a cell-based screen can be exploited to solve the most pressing issues in drug discovery today. In addition, mechanistic information of the drugs activity can be used as a starting point for further target elucidation or to prioritize drug screening hits. The studies performed in this thesis have resulted in a solid foundation for further research that expand the capabilities of the MALDI-TOF MS-based assay in a broad range of phenotypic profiling applications in the drug discovery field

    Predictive decoding of neural data

    Get PDF
    In the last five decades the number of techniques available for non-invasive functional imaging has increased dramatically. Researchers today can choose from a variety of imaging modalities that include EEG, MEG, PET, SPECT, MRI, and fMRI. This doctoral dissertation offers a methodology for the reliable analysis of neural data at different levels of investigation. By using statistical learning algorithms the proposed approach allows single-trial analysis of various neural data by decoding them into variables of interest. Unbiased testing of the decoder on new samples of the data provides a generalization assessment of decoding performance reliability. Through consecutive analysis of the constructed decoder\u27s sensitivity it is possible to identify neural signal components relevant to the task of interest. The proposed methodology accounts for covariance and causality structures present in the signal. This feature makes it more powerful than conventional univariate methods which currently dominate the neuroscience field. Chapter 2 describes the generic approach toward the analysis of neural data using statistical learning algorithms. Chapter 3 presents an analysis of results from four neural data modalities: extracellular recordings, EEG, MEG, and fMRI. These examples demonstrate the ability of the approach to reveal neural data components which cannot be uncovered with conventional methods. A further extension of the methodology, Chapter 4 is used to analyze data from multiple neural data modalities: EEG and fMRI. The reliable mapping of data from one modality into the other provides a better understanding of the underlying neural processes. By allowing the spatial-temporal exploration of neural signals under loose modeling assumptions, it removes potential bias in the analysis of neural data due to otherwise possible forward model misspecification. The proposed methodology has been formalized into a free and open source Python framework for statistical learning based data analysis. This framework, PyMVPA, is described in Chapter 5

    A Corpus Driven Computational Intelligence Framework for Deception Detection in Financial Text

    Get PDF
    Financial fraud rampages onwards seemingly uncontained. The annual cost of fraud in the UK is estimated to be as high as £193bn a year [1] . From a data science perspective and hitherto less explored this thesis demonstrates how the use of linguistic features to drive data mining algorithms can aid in unravelling fraud. To this end, the spotlight is turned on Financial Statement Fraud (FSF), known to be the costliest type of fraud [2]. A new corpus of 6.3 million words is composed of102 annual reports/10-K (narrative sections) from firms formally indicted for FSF juxtaposed with 306 non-fraud firms of similar size and industrial grouping. Differently from other similar studies, this thesis uniquely takes a wide angled view and extracts a range of features of different categories from the corpus. These linguistic correlates of deception are uncovered using a variety of techniques and tools. Corpus linguistics methodology is applied to extract keywords and to examine linguistic structure. N-grams are extracted to draw out collocations. Readability measurement in financial text is advanced through the extraction of new indices that probe the text at a deeper level. Cognitive and perceptual processes are also picked out. Tone, intention and liquidity are gauged using customised word lists. Linguistic ratios are derived from grammatical constructs and word categories. An attempt is also made to determine ‘what’ was said as opposed to ‘how’. Further a new module is developed to condense synonyms into concepts. Lastly frequency counts from keywords unearthed from a previous content analysis study on financial narrative are also used. These features are then used to drive machine learning based classification and clustering algorithms to determine if they aid in discriminating a fraud from a non-fraud firm. The results derived from the battery of models built typically exceed classification accuracy of 70%. The above process is amalgamated into a framework. The process outlined, driven by empirical data demonstrates in a practical way how linguistic analysis could aid in fraud detection and also constitutes a unique contribution made to deception detection studies
    corecore