60 research outputs found
Importance measures derived from random forests: characterisation and extension
Nowadays new technologies, and especially artificial intelligence, are more and more established in our society.
Big data analysis and machine learning, two sub-fields of artificial intelligence, are at the core of many recent breakthroughs in many application fields (e.g., medicine, communication, finance, ...), including some that are strongly related to our day-to-day life (e.g., social networks, computers, smartphones, ...). In machine learning, significant improvements are usually achieved at the price of an increasing computational complexity and thanks to bigger datasets. Currently, cutting-edge models built by the most advanced machine learning algorithms typically became simultaneously very efficient and profitable but also extremely complex. Their complexity is to such an extent that these models are commonly seen as black-boxes providing a prediction or a decision which can not be interpreted or justified. Nevertheless, whether these models are used autonomously or as a simple decision-making support tool, they are already being used in machine learning applications where health and human life are at stake. Therefore, it appears to be an obvious necessity not to blindly believe everything coming out of those models without a detailed understanding of their predictions or decisions.
Accordingly, this thesis aims at improving the interpretability of models built by a specific family of machine learning algorithms, the so-called tree-based methods. Several mechanisms have been proposed to interpret these models and we aim along this thesis to improve their understanding, study their properties, and define their limitations
An investigation on automatic systems for fault diagnosis in chemical processes
Plant safety is the most important concern of chemical industries. Process faults can cause economic loses as well as human and environmental damages. Most of the operational faults are normally considered in the process design phase by applying methodologies such as Hazard and Operability Analysis (HAZOP). However, it should be expected that failures may occur in an operating plant. For this reason, it is of paramount importance that plant operators can promptly detect and diagnose such faults in order to take the appropriate corrective actions. In addition, preventive maintenance needs to be considered in order to increase plant safety.
Fault diagnosis has been faced with both analytic and data-based models and using several techniques and algorithms. However, there is not yet a general fault diagnosis framework that joins detection and diagnosis of faults, either registered or non-registered in records. Even more, less efforts have been focused to automate and implement the reported approaches in real practice.
According to this background, this thesis proposes a general framework for data-driven Fault Detection and Diagnosis (FDD), applicable and susceptible to be automated in any industrial scenario in order to hold the plant safety. Thus, the main requirement for constructing this system is the existence of historical process data. In this sense, promising methods imported from the Machine Learning field are introduced as fault diagnosis methods. The learning algorithms, used as diagnosis methods, have proved to be capable to diagnose not only the modeled faults, but also novel faults. Furthermore, Risk-Based Maintenance (RBM) techniques, widely used in petrochemical industry, are proposed to be applied as part of the preventive maintenance in all industry sectors. The proposed FDD system together with an appropriate preventive maintenance program would represent a potential plant safety program to be implemented.
Thus, chapter one presents a general introduction to the thesis topic, as well as the motivation and scope. Then, chapter two reviews the state of the art of the related fields. Fault detection and diagnosis methods found in literature are reviewed. In this sense a taxonomy that joins both Artificial Intelligence (AI) and Process Systems Engineering (PSE) classifications is proposed. The fault diagnosis assessment with performance indices is also reviewed. Moreover, it is exposed the state of the art corresponding to Risk Analysis (RA) as a tool for taking corrective actions to faults and the Maintenance Management for the preventive actions. Finally, the benchmark case studies against which FDD research is commonly validated are examined in this chapter.
The second part of the thesis, integrated by chapters three to six, addresses the methods applied during the research work. Chapter three deals with the data pre-processing, chapter four with the feature processing stage and chapter five with the
diagnosis algorithms. On the other hand, chapter six introduces the Risk-Based Maintenance techniques for addressing the plant preventive maintenance. The third part includes chapter seven, which constitutes the core of the thesis. In this chapter the proposed general FD system is outlined, divided in three steps: diagnosis model construction, model validation and on-line application. This scheme includes a fault detection module and an Anomaly Detection (AD) methodology for the detection of novel faults. Furthermore, several approaches are derived from this general scheme for continuous and batch processes. The fourth part of the thesis presents the validation of the approaches. Specifically, chapter eight presents the validation of the proposed approaches in continuous processes and chapter nine the validation of batch process approaches. Chapter ten raises the AD methodology in real scaled batch processes. First, the methodology is applied to a lab heat exchanger and then it is applied to a Photo-Fenton pilot plant, which corroborates its potential and success in real practice. Finally, the fifth part, including chapter eleven, is dedicated to stress the final conclusions and the main contributions of the thesis. Also, the scientific production achieved during the research period is listed and prospects on further work are envisaged.La seguridad de planta es el problema más inquietante para las industrias químicas. Un fallo en planta puede causar pérdidas económicas y daños humanos y al medio ambiente. La mayoría de los fallos operacionales son previstos en la etapa de diseño de un proceso mediante la aplicación de técnicas de Análisis de Riesgos y de Operabilidad (HAZOP). Sin embargo, existe la probabilidad de que pueda originarse un fallo en una planta en operación. Por esta razón, es de suma importancia que una planta pueda detectar y diagnosticar fallos en el proceso y tomar las medidas correctoras adecuadas para mitigar los efectos del fallo y evitar lamentables consecuencias. Es entonces también importante el mantenimiento preventivo para aumentar la seguridad y prevenir la ocurrencia de fallos.
La diagnosis de fallos ha sido abordada tanto con modelos analíticos como con modelos basados en datos y usando varios tipos de técnicas y algoritmos. Sin embargo, hasta ahora no existe la propuesta de un sistema general de seguridad en planta que combine detección y diagnosis de fallos ya sea registrados o no registrados anteriormente. Menos aún se han reportado metodologías que puedan ser automatizadas e implementadas en la práctica real.
Con la finalidad de abordar el problema de la seguridad en plantas químicas, esta tesis propone un sistema general para la detección y diagnosis de fallos capaz de implementarse de forma automatizada en cualquier industria. El principal requerimiento para la construcción de este sistema es la existencia de datos históricos de planta sin previo filtrado. En este sentido, diferentes métodos basados en datos son aplicados como métodos de diagnosis de fallos, principalmente aquellos importados del campo de “Aprendizaje Automático”. Estas técnicas de aprendizaje han resultado ser capaces de detectar y diagnosticar no sólo los fallos modelados o “aprendidos”, sino también nuevos fallos no incluidos en los modelos de diagnosis. Aunado a esto, algunas técnicas de mantenimiento basadas en riesgo (RBM) que son ampliamente usadas en la industria petroquímica, son también propuestas para su aplicación en el resto de sectores industriales como parte del mantenimiento preventivo. En conclusión, se propone implementar en un futuro no lejano un programa general de seguridad de planta que incluya el sistema de detección y diagnosis de fallos propuesto junto con un adecuado programa de mantenimiento preventivo.
Desglosando el contenido de la tesis, el capítulo uno presenta una introducción general al tema de esta tesis, así como también la motivación generada para su desarrollo y el alcance delimitado. El capítulo dos expone el estado del arte de las áreas relacionadas al tema de tesis. De esta forma, los métodos de detección y diagnosis de fallos encontrados en la literatura son examinados en este capítulo. Asimismo, se propone una
taxonomía de los métodos de diagnosis que unifica las clasificaciones propuestas en el área de Inteligencia Artificial y de Ingeniería de procesos. En consecuencia, se examina también la evaluación del performance de los métodos de diagnosis en la literatura. Además, en este capítulo se revisa y reporta el estado del arte correspondiente al “Análisis de Riesgos” y a la “Gestión del Mantenimiento” como técnicas complementarias para la toma de medidas correctoras y preventivas. Por último se abordan los casos de estudio considerados como puntos de referencia en el campo de investigación para la aplicación del sistema propuesto. La tercera parte incluye el capítulo siete, el cual constituye el corazón de la tesis. En este capítulo se presenta el esquema o sistema general de diagnosis de fallos propuesto. El sistema es dividido en tres partes: construcción de los modelos de diagnosis, validación de los modelos y aplicación on-line. Además incluye un modulo de detección de fallos previo a la diagnosis y una metodología de detección de anomalías para la detección de nuevos fallos. Por último, de este sistema se desglosan varias metodologías para procesos continuos y por lote. La cuarta parte de esta tesis presenta la validación de las metodologías propuestas. Específicamente, el capítulo ocho presenta la validación de las metodologías propuestas para su aplicación en procesos continuos y el capítulo nueve presenta la validación de las metodologías correspondientes a los procesos por lote. El capítulo diez valida la metodología de detección de anomalías en procesos por lote reales. Primero es aplicada a un intercambiador de calor escala laboratorio y después su aplicación es escalada a un proceso Foto-Fenton de planta piloto, lo cual corrobora el potencial y éxito de la metodología en la práctica real. Finalmente, la quinta parte de esta tesis, compuesta por el capítulo once, es dedicada a presentar y reafirmar las conclusiones finales y las principales contribuciones de la tesis. Además, se plantean las líneas de investigación futuras y se lista el trabajo desarrollado y presentado durante el periodo de investigación
Recommended from our members
Effective techniques for handling incomplete data using decision trees
Decision Trees (DTs) have been recognized as one of the most successful formalisms for knowledge representation and reasoning and are currently applied to a variety of data mining or knowledge discovery applications, particularly for classification problems. There are several efficient methods to learn a DT from data. However, these methods are often limited to the assumption that data are complete.
In this thesis, some contributions to the field of machine learning and statistics that solve the problem of extracting DTs for learning and classification tasks from incomplete databases are presented. The methodology underlying the thesis blends together well-established statistical theories with the most advanced techniques for machine learning and automated reasoning with uncertainty.
The first contribution is the extensive simulations which study the impact of missing data on predictive accuracy of existing DTs which can cope with missing values, when missing values are in both the training and test sets or when they are in either of the two sets. All simulations are performed under missing completely at random, missing at random and informatively missing mechanisms and for different missing data patterns and proportions.
The proposal of a simple, novel, yet effective proposed procedure for training and testing using decision trees in the presence of missing data is the next contribution. Original and simple splitting criteria for attribute selection in tree building are put forward. The proposed technique is evaluated and validated in empirical tests over many real world application domains. In this work, the proposed algorithm maintains (sometimes exceeds) the outstanding accuracy of multiple imputation, especially on datasets containing mixed attributes and purely nominal attributes. Also, the proposed algorithm greatly improves in accuracy for IM data. Another major advantage of this method over multiple imputation is the important saving in computational resources due to it simplicity.
The next contribution is the proposal of three versions of simple probabilistic techniques that could be used for classifying incomplete vectors using decision trees based on complete data. The proposed procedure is superficially similar to that of fractional cases but more effective. The experimental results demonstrate that these approaches can achieve comparative quality to sophisticated algorithms like multiple imputation and therefore are applicable to all kinds of datasets.
Finally, novel uses of two proposed ensemble procedures for handling incomplete training and test data are proposed and discussed. The algorithms combine the two best approaches either with resampling (REMIMIA) or without resampling (EMIMIA) of the training data before growing the decision trees. Experiments are used to evaluate and validate the success of the proposed ensemble methods with respect to individual missing data techniques in the form of empirical tests. EMIMIA attains the highest overall level of prediction accuracy
Mining and Managing Large-Scale Temporal Graphs
Large-scale temporal graphs are everywhere in our daily life. From online social networks, mobile networks, brain networks to computer systems, entities in these large complex systems communicate with each other, and their interactions evolve over time. Unlike traditional graphs, temporal graphs are dynamic: both topologies and attributes on nodes/edges may change over time. On the one hand, the dynamics have inspired new applications that rely on mining and managing temporal graphs. On the other hand, the dynamics also raise new technical challenges. First, it is difficult to discover or retrieve knowledge from complex temporal graph data. Second, because of the extra time dimension, we also face new scalability problems. To address these new challenges, we need to develop new methods that model temporal information in graphs so that we can deliver useful knowledge, new queries with temporal and structural constraints where users can obtain the desired knowledge, and new algorithms that are cost-effective for both mining and management tasks.In this dissertation, we discuss our recent works on mining and managing large-scale temporal graphs.First, we investigate two mining problems, including node ranking and link prediction problems. In these works, temporal graphs are applied to model the data generated from computer systems and online social networks. We formulate data mining tasks that extract knowledge from temporal graphs. The discovered knowledge can help domain experts identify critical alerts in system monitoring applications and recover the complete traces for information propagation in online social networks. To address computation efficiency problems, we leverage the unique properties in temporal graphs to simplify mining processes. The resulting mining algorithms scale well with large-scale temporal graphs with millions of nodes and billions of edges. By experimental studies over real-life and synthetic data, we confirm the effectiveness and efficiency of our algorithms.Second, we focus on temporal graph management problems. In these study, temporal graphs are used to model datacenter networks, mobile networks, and subscription relationships between stream queries and data sources. We formulate graph queries to retrieve knowledge that supports applications in cloud service placement, information routing in mobile networks, and query assignment in stream processing system. We investigate three types of queries, including subgraph matching, temporal reachability, and graph partitioning. By utilizing the relatively stable components in these temporal graphs, we develop flexible data management techniques to enable fast query processing and handle graph dynamics. We evaluate the soundness of the proposed techniques by both real and synthetic data. Through these study, we have learned valuable lessons. For temporal graph mining, temporal dimension may not necessarily increase computation complexity; instead, it may reduce computation complexity if temporal information can be wisely utilized. For temporal graph management, temporal graphs may include relatively stable components in real applications, which can help us develop flexible data management techniques that enable fast query processing and handle dynamic changes in temporal graphs
A Rules-based Mode Choice Model using CHAID Decision Trees and Dynamic Transit Accessibility
Transportation mode choice models typically represent user decision making using utility-based
mode choice models. However, utility models assume that users make compensatory trade-offs between
decision variables to maximize their expected utility. The decision process literature raises alternative,
non-compensatory theories that suggest people employ simpler, cognitively frugal heuristics in their
decision making. Non-compensatory models, including decision tree classifiers, present an opportunity to
test the effects of transit accessibility variables on mode choices and improve descriptions of mode choice
behaviour. Dynamic forms of transit accessibility, which measure variations in transit service over time,
may better capture heuristic perceptions of transit service quality.
This research addresses the need to understand how dynamic transit accessibility (DTA) impacts
mode choices, without compensatory decision process assumptions. First, this research develops DTA
measures for the Region of Waterloo using General Transit Feed Specification (GTFS) transit schedule
information to calculate travel impedance matrices for departures at every 5-minute interval of the day.
Zonal mode shares are regressed against alternative DTA measures to analyze the effects of different
destination types, time periods of aggregation, and statistical parameters of transit accessibility (i.e., mean
and distribution over time). Based on the aggregate mode share predictive performance, a DTA metric is
selected for analysis within a binary (transit and not transit) disaggregate mode choice model. Second,
this research uses trip diary data to train and score a Chi-squared Automatic Interaction Detection
(CHAID) decision tree classifier to represent and predict rules-based mode choice processes. Finally, the
selected DTA metric is merged with the trip diary data and applied in another decision tree for
comparison. The comparison between the two rules-based mode choice models is based on overall model
accuracy, class recall, precision, and interpretability.
Results from the decision tree classifier reveal that users apply heuristics in their transportation
mode decision making, including lexicographic and aspiration-level based decision rules. User choices
depend primarily on transit pass ownership, and non-transit-pass users consider the trip’s distance
thereafter. Including DTA as an independent variable in the decision tree has a small but statistically
significant effect: users only seem to consider DTA, a generalized location-based measure, if they do not
own a transit pass and only after considering the trip-specific distance. Overall, the rules-based mode
choice models report accuracies of roughly 84%; however, low precision in the transit predictions (i.e.,
many false positives) result in an overestimation of regional transit shares
Comparative Analysis of Student Learning: Technical, Methodological and Result Assessing of PISA-OECD and INVALSI-Italian Systems .
PISA is the most extensive international survey promoted by the OECD in the field of education, which measures the skills of fifteen-year-old students from more than 80 participating countries every three years. INVALSI are written tests carried out every year by all Italian students in some key moments of the school cycle, to evaluate the levels of some fundamental skills in Italian, Mathematics and English. Our comparison is made up to 2018, the last year of the PISA-OECD survey, even if INVALSI was carried out for the last edition in 2022. Our analysis focuses attention on the common part of the reference populations, which are the 15-year-old students of the 2nd class of secondary schools of II degree, where both
sources give a similar picture of the students
- …