Search CORE

752 research outputs found

Data management for production quality deep learning models: Challenges and solutions

Author: Arpteg Anders
Bosch Jan
Brinne Bjoern
Munappy Aiswarya Raj
Olsson Helena Holmstrom
Publication venue: 'Elsevier BV'
Publication date: 01/01/2022
Field of study

Deep learning (DL) based software systems are difficult to develop and maintain in industrial settings due to several challenges. Data management is one of the most prominent challenges which complicates DL in industrial deployments. DL models are data-hungry and require high-quality data. Therefore, the volume, variety, velocity, and quality of data cannot be compromised. This study aims to explore the data management challenges encountered by practitioners developing systems with DL components, identify the potential solutions from the literature and validate the solutions through a multiple case study. We identified 20 data management challenges experienced by DL practitioners through a multiple interpretive case study. Further, we identified 48 articles through a systematic literature review that discuss the solutions for the data management challenges. With the second round of multiple case study, we show that many of these solutions have limitations and are not used in practice due to a combination of four factors: high cost, lack of skill-set and infrastructure, inability to solve the problem completely, and incompatibility with certain DL use cases. Thus, data management for data-intensive DL models in production is complicated. Although the DL technology has achieved very promising results, there is still a significant need for further research in the field of data management to build high-quality datasets and streams that can be used for building production-ready DL systems. Furthermore, we have classified the data management challenges into four categories based on the availability of the solutions.(c) 2022 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/)

Chalmers Research

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Malmö University

AdaCC: cumulative cost-sensitive boosting for imbalanced classification

Author: Iosifidis Vasileios
Ntoutsi Eirini
Papadopoulos Symeon
Rosenhahn Bodo
Publication venue
Publication date: 01/01/2023
Field of study

Class imbalance poses a major challenge for machine learning as most supervised learning models might exhibit bias towards the majority class and under-perform in the minority class. Cost-sensitive learning tackles this problem by treating the classes differently, formulated typically via a user-defined fixed misclassification cost matrix provided as input to the learner. Such parameter tuning is a challenging task that requires domain knowledge and moreover, wrong adjustments might lead to overall predictive performance deterioration. In this work, we propose a novel cost-sensitive boosting approach for imbalanced data that dynamically adjusts the misclassification costs over the boosting rounds in response to model’s performance instead of using a fixed misclassification cost matrix. Our method, called AdaCC, is parameter-free as it relies on the cumulative behavior of the boosting model in order to adjust the misclassification costs for the next boosting round and comes with theoretical guarantees regarding the training error. Experiments on 27 real-world datasets from different domains with high class imbalance demonstrate the superiority of our method over 12 state-of-the-art cost-sensitive boosting approaches exhibiting consistent improvements in different measures, for instance, in the range of [0.3–28.56%] for AUC, [3.4–21.4%] for balanced accuracy, [4.8–45%] for gmean and [7.4–85.5%] for recall

Institutional Repository of the Freie Universität Berlin

Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

Author: Razzaghi Talayeh
Publication venue: 'Information Bulletin on Variable Stars (IBVS)'
Publication date: 01/01/2014
Field of study

Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

University of Central Florida (UCF): STARS (Showcase of Text, Archives, Research & Scholarship)

Handling Concept Drift in the Context of Expensive Labels

Author: Lindstrom Patrick
Publication venue: Dublin Institute of Technology
Publication date: 01/09/2013
Field of study

Machine learning has been successfully applied to a wide range of prediction problems, yet its application to data streams can be complicated by concept drift. Existing approaches to handling concept drift are overwhelmingly reliant on the assumption that it is possible to obtain the true label of an instance shortly after classification at a negligible cost. The aim of this thesis is to examine, and attempt to address, some of the problems related to handling concept drift when the cost of obtaining labels is high. This thesis presents Decision Value Sampling (DVS), a novel concept drift handling approach which periodically chooses a small number of the most useful instances to label. The newly labelled instances are then used to re-train the classifier, an SVM with a linear kernel, to handle any change in concept that might occur. In this way, only the instances that are required to keep the classifier up-to-date are labelled. The evaluation of the system indicates that a classifier can be kept up-to-date with changes in concept while only requiring 15% of the data stream to be labelled. In a data stream with a high throughput this represents a significant reduction in the number of labels required. The second novel concept drift handling approach proposed in this thesis is Confidence Distribution Batch Detection (CDBD). CDBD uses a heuristic based on the distribution of an SVM’s confidence in its predictions to decide when to rebuild the clas- sifier. The evaluation shows that CDBD can be used to reliably detect when a change in concept has taken place and that concept drift can be handled if the classifier is rebuilt when CDBD sig- nals a change in concept. The evaluation also shows that CDBD obtains a considerable labels saving as it only requires labelled data when a change in concept has been detected. The two concept drift handling approaches deal with concept drift in a different manner, DVS continuously adapts the clas- sifier, whereas CDBD only adapts the classifier when a sizeable change in concept is suspected. They reflect a divide also found in the literature, between continuous rebuild approaches (like DVS) and triggered rebuild approaches (like CDBD). The final major contribution in this thesis is a comparison between continuous and triggered rebuild approaches, as this is an underexplored area. An empirical comparison between representative techniques from both types of approaches shows that triggered rebuild works slightly better on large datasets where the changes in concepts occur infrequently, but in general a continuous rebuild approach works the best

Arrow@TUDublin

A Comprehensive Survey on Rare Event Prediction

Author: Sheth Amit
Shyalika Chathurangi
Wickramarachchi Ruwan
Publication venue
Publication date: 20/09/2023
Field of study

Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

arXiv.org e-Print Archive

Game-Theoretic and Machine-Learning Techniques for Cyber-Physical Security and Resilience in Smart Grid

Author: Wei Longfei
Publication venue: FIU Digital Commons
Publication date: 29/10/2018
Field of study

The smart grid is the next-generation electrical infrastructure utilizing Information and Communication Technologies (ICTs), whose architecture is evolving from a utility-centric structure to a distributed Cyber-Physical System (CPS) integrated with a large-scale of renewable energy resources. However, meeting reliability objectives in the smart grid becomes increasingly challenging owing to the high penetration of renewable resources and changing weather conditions. Moreover, the cyber-physical attack targeted at the smart grid has become a major threat because millions of electronic devices interconnected via communication networks expose unprecedented vulnerabilities, thereby increasing the potential attack surface. This dissertation is aimed at developing novel game-theoretic and machine-learning techniques for addressing the reliability and security issues residing at multiple layers of the smart grid, including power distribution system reliability forecasting, risk assessment of cyber-physical attacks targeted at the grid, and cyber attack detection in the Advanced Metering Infrastructure (AMI) and renewable resources. This dissertation first comprehensively investigates the combined effect of various weather parameters on the reliability performance of the smart grid, and proposes a multilayer perceptron (MLP)-based framework to forecast the daily number of power interruptions in the distribution system using time series of common weather data. Regarding evaluating the risk of cyber-physical attacks faced by the smart grid, a stochastic budget allocation game is proposed to analyze the strategic interactions between a malicious attacker and the grid defender. A reinforcement learning algorithm is developed to enable the two players to reach a game equilibrium, where the optimal budget allocation strategies of the two players, in terms of attacking/protecting the critical elements of the grid, can be obtained. In addition, the risk of the cyber-physical attack can be derived based on the successful attack probability to various grid elements. Furthermore, this dissertation develops a multimodal data-driven framework for the cyber attack detection in the power distribution system integrated with renewable resources. This approach introduces the spare feature learning into an ensemble classifier for improving the detection efficiency, and implements the spatiotemporal correlation analysis for differentiating the attacked renewable energy measurements from fault scenarios. Numerical results based on the IEEE 34-bus system show that the proposed framework achieves the most accurate detection of cyber attacks reported in the literature. To address the electricity theft in the AMI, a Distributed Intelligent Framework for Electricity Theft Detection (DIFETD) is proposed, which is equipped with Benford’s analysis for initial diagnostics on large smart meter data. A Stackelberg game between utility and multiple electricity thieves is then formulated to model the electricity theft actions. Finally, a Likelihood Ratio Test (LRT) is utilized to detect potentially fraudulent meters

DigitalCommons@Florida International University

Novel Algorithm-Level Approaches for Class-Imbalanced Machine Learning

Author: Twomey David
Publication venue: UCL (University College London)
Publication date: 28/03/2023
Field of study

Machine learning classifiers are designed with the underlying assumption of a roughly balanced number of instances per class. However, in many real-world applications this is far from true. This thesis explores adaptations of neural networks which are robust to class imbalanced datasets, do not involve data manipulation, and are flexible enough to be used with any model architecture or framework. The thesis explores two complementary approaches to the problem of class imbalance. The first exchanges conventional choices of classification loss function, which are fundamentally measures of how far network outputs are from desired ones, for ones that instead primarily register whether outputs are right or wrong. The construction of these novel loss functions involves the concept of an approximated confusion matrix, another use of which is to generate new performance metrics, especially useful for monitoring validation behaviour for imbalanced datasets. The second approach changes the form of the output layer activation function to one with a threshold which can be learned so as to more easily classify the more difficult minority class. These two approaches can be used together or separately, with the combined technique being a promising approach for cases of extreme class imbalance. While the methods are developed primarily for binary classification scenarios, as these are the most numerous in the applications literature, the novel loss functions introduced here are also demonstrated to be extensible to a multi-class scenari

UCL Discovery

Solving the challenges of concept drift in data stream classification.

Author: Hu Hanqing
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/08/2022
Field of study

The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups

University of Louisville

Misperceptions of Uncertainty and Their Applications to Prevention

Author: Emirmahmutoglu A. (Aysil)
Publication venue: This thesis studies how people misperceive risk and uncertainty, and how this cognitive bias affects individuals' preventive actions. Chapter 1, in a lab experiment, shows that how we present rare events affects how big people perceive those events. I show by means of a lab experiment that people perceive rare events bigger than what they actually are when those events are presented to them separately rather than all together. Chapter 2 shows theoretically that it is actually the same phenomenon that makes people both overinsure and prevent little, namely probability weighting. Chapter 3, with an application to cybersecurity, analyses an intervention aiming at increasing prevention at the organizational level in a field experiment. I test whether communicating information in a more effective way or letting employees experience a simulated phishing attack help to reduce falling for phishing attacks. Chapter 4 deals with the issue that people’s judgements of risk might differ in different contexts. In a lab experiment, it shows that sexual context has an impact on ambiguity attitudes.
Publication date: 06/02/2020
Field of study

This thesis studies how people misperceive risk and uncertainty, and how this cognitive bias affects individuals' preventive actions. Chapter 1, in a lab experiment, shows that how we present rare events affects how big people perceive those events. I show by means of a lab experiment that people perceive rare events bigger than what they actually are when those events are presented to them separately rather than all together. Chapter 2 shows theoretically that it is actually the same phenomenon that makes people both overinsure and prevent little, namely probability weighting. Chapter 3, with an application to cybersecurity, analyses an intervention aiming at increasing prevention at the organizational level in a field experiment. I test whether communicating information in a more effective way or letting employees experience a simulated phishing attack help to reduce falling for phishing attacks. Chapter 4 deals with the issue that people’s judgements of risk might differ in different contexts. In a lab experiment, it shows that sexual context has an impact on ambiguity attitudes

EUR Research Repository

Erasmus University Digital Repository

Improvement of ms based e-nose performances by incorporation of chromatographic retention time as a new data dimension

Author: Burian Cosmin
Publication venue: 'Universitat Rovira I Virgili'
Publication date: 01/01/2010
Field of study

Mejora del rendimiento de la nariz electrónica basada en espectrometría de masas mediante la incorporación del tiempo de retención cromatografico como una nueva dimensión de datosLa importancia del sentido de olor en la naturaleza y en la sociedad humana queda latente con el gran interés que se muestra en el análisis del olor y el gusto en la industria alimentaria. Aunque las aéreas mas interesadas son las de la alimentación y bebida, también se ha mostrado la necesitad para esta tecnología en otros campos como en el de la cosmética. Lamentablemente, el uso de los paneles sensoriales humanos o paneles caninos son costosos, propensos al cansancio, subjetivos, poco fiables e inadecuados para cuantificar, mientras que el análisis de laboratorio, a pesar de la precisión, imparcialidad y capacidad cuantitativa, necesita una labor intensa, con personal especializado y requiere de mucho tiempo. Debido a estos inconvenientes el concepto de olfato artificial generó un gran interés en entornos industriales.El término "nariz electrónica" se asocia con una serie de sensores de gases químicos, con una amplia superposición de selectividad para las mediciones de compuestos volátiles en combinación con los instrumentos informáticos de análisis de datos. La nariz electrónica se utiliza para proporcionar una información comparativa en vez de una cualitativa en un análisis, y porque la interpretación puede ser automatizada, el dispositivo es adecuado para el control de calidad y análisis. A pesar de algunos logros prometedores, los sensores de estado sólido de gas no han cumplido con sus expectativas. La baja sensibilidad y selectividad, la corta vida del sensor, la calibración difícil y los problemas de deriva han demostrado serias limitaciones. En un esfuerzo para mejorar los inconvenientes de los sensores de estado sólido, se han adoptado nuevos enfoques, utilizando diferentes sensores para la nariz electrónica. Sistemas de sensores ópticos, la espectrometría de movilidad iónica y la espectrometría infrarroja son ejemplos de técnicas que han sido probadas.Las narices electrónicas basadas en la espectrometría de masas (MS) aparecieron por primera vez en 1998 [B. Dittmann, S. y G. Nitz Horner. Adv. Food Sci. 20 (1998), p. 115], y representan un salto importante en la sensibilidad, retando a la nariz electrónica basada en sensores químicos. Este nuevo enfoque del concepto de una nariz electrónica usa sensores virtuales en forma de proporciones m/z. Una huella digital compleja y muy reproducible se obtiene en forma de un espectro de masas, que se procesa mediante algoritmos de reconocimiento de patrones para la clasificación y cuantificación. A pesar de que la nariz electrónica basada en la espectrometría de masas supera a la nariz electrónica clásica de sensores de estado sólido en muchos aspectos, su uso se limita actualmente a la instrumentación de laboratorio de escritorio. La falta de portabilidad no representará necesariamente un problema en el futuro, dado que espectrómetros de masas en miniatura se han fabricado ya en una fase de prototipado.Un inconveniente más crítico de la nariz electrónica basada en MS consiste en la manera en la que se analizan las muestras. La fragmentación simultánea de mezclas complejas de isómeros pueden producir resultados muy similares a raíz de este enfoque. Una nariz electrónica mejor sería la que combina la sensibilidad y el poder de identificación del detector de masas con la capacidad de separación de la cromatografía de gases. El principal inconveniente de este enfoque es de nuevo el coste y la falta de portabilidad de los equipos. Además de los problemas anteriores con la espectrometría de masas, el análisis de cromatografía de gases requiere mucho tiempo de medida.Para abordar estas cuestiones, se han reportado miniaturizaciones en cromatografía capilar de gases (GC) que hacen posible el GC-en-un-chip, CG-rápido y CG-flash que hacen uso de columnas cortas, reduciendo el tiempo de análisis a los tiempos de elución como segundos y, en algunos casos, se han comercializado. La miniaturización de la espectrometría de masas y cromatografía de gases tiene un gran potencial para mejorar el rendimiento, la utilidad y la accesibilidad de la nueva generación de narices electrónicas.Esta tesis se dedica al estudio y a la evaluación del enfoque del GC-MS para la nariz electrónica como un paso anterior al desarrollo de las tecnologías mencionadas anteriormente. El objetivo principal de la tesis es de estudiar si el tiempo de retención de una separación de cromatografía puede mejorar el rendimiento de la nariz electrónica basada en MS, mostrando que la adición de una tercera dimensión trae más información, ayudando a la clasificación de las pruebas. Esto se puede hacer de dos maneras: · comparando el análisis de datos de dos vías de espectrometría de masas con análisis de datos de dos vías de matrices desplegadas y concatenadas para los datos de tres vías y · comparando el análisis de datos de dos vías del espectrometría de masas con el análisis de datos de tres vías para el conjunto de datos tridimensionales.Desde el punto de vista de cromatografía, la meta será la de optimizar el método cromatográfico con el fin de reducir el tiempo de análisis a un mínimo sin dejar de tener resultados aceptables.Un paso importante en el análisis de datos multivariados de vías múltiples es el preprocesamiento de datos. Debido a este objetivo, el último objetivo será el de determinar qué técnicas de preprocesamiento son las mejores para y el análisis de dos y tres vías de datos.Con el fin de alcanzar los objetivos propuestos se crearon dos grupos de datos. El primero consiste en las mezclas de nueve isómeros de dimetilfenol y etilfenol. La razón de esta elección fue la similitud de los espectros de masas entre sí. De esta manera la nariz electrónica basada en espectrometría de masas sería retada por el conjunto de datos. También teniendo en cuenta el tiempo de retención de los nueve isómeros solos, las soluciones se hicieron, como si el conjunto de datos demostraría el reto si se usaría sólo el tiempo de retención. Por tanto, este conjunto de datos "artificiales" sostiene nuestras esperanzas en mostrar las mejoras de la utilización de ambas dimensiones, la MS (espectros de masas) y la GC (tiempo de retención).Veinte clases, representando las soluciones de los nueve isómeros se midieron en diez repeticiones cada una, por tres métodos cromatográficos, dando un total de 600 mediciones. Los métodos cromatográficos fueron diseñados para dar un cromatograma resuelto por completo, un pico coeluido y una situación intermediaria con un cromatograma resuelto parcialmente. Los datos fueron registrados en una matriz de tres dimensiones con las siguientes direcciones: (muestras medidas) x (proporción m/z) x (tiempo de retención). Por "colapsar" los ejes X e Y del tiempo de retención cromatográfica y los fragmentos m/z, respectivamente, se obtuvieron dos matrices que representan los espectros de masa regular y el cromatograma de iones totales, respectivamente. Estos enfoques sueltan la información traída por la tercera dimensión y el despliegue por lo que la matriz original 3D y la concatenación de las TIC y el espectro de masa media se han tenido en consideración como una forma de preservar la información adicional de la tercera dimensión en una matriz de dos dimensiones.Los datos fueron tratados mediante la alineación de picos, con una media de centrado y la normalización por la altura máxima y el área del pico, los instrumentos de pre-procesamiento que también fueron evaluados por sus logros.Para el análisis de datos de dos vías fueron utilizados el PCA, PLS-DA y fuzzyARTMAP. La agrupación de PCA y PARAFAC fueron evaluados por la relación intervariedad - intravariedad, mientras que los resultados mediante fuzzy ARTMAP fueron dados como el éxito de la las tasas de clasificación en porcentajes.Cuando PCA y PARAFAC se utilizaron, como era de esperar, el método de cromatografía resuelto (método 1) dio los mejores resultados globales, donde los algoritmos 2D funcionan mejor, mientras que en un caso más complicado (picos más coeluidos del método 3) pierden eficacia frente a métodos 3D.En el caso de PLS-DA y n-PLS, aunque los resultados no son tan concluyentes como los resultados del PCA y PARAFAC, tratándose de las diferencias mínimas, el modelo de vías múltiples PLS-DA ofrece un porcentaje de éxito en la predicción de ambos conjuntos de datos. También se recomienda el n-PLS en vez de utilizar datos desplegados y concatenados, ya que construye un modelo más parsimonioso.Para el análisis fuzzyARTMAP, la estrategia de votación empleada ha demostrado que al usar los espectros de masa media y la información del cromatograma de iones totales juntos se obtienen resultados más consistentes.En el segundo conjunto de datos se aborda el problema de la adulteración del aceite de oliva extra virgen con aceite de avellana, que debido a las similitudes entre los dos aceites es una de las más difíciles de detectar. Cuatro aceites extra virgen de oliva y dos aceites de avellana se midieron puros y en mezclas de 30%, 10%, 5% y 2% con los mismos objetivos mostrando que la adición de la extra dimensión mejora los resultados. Se han hechos cinco repeticiones para cada preparación, dando un total de 190 muestras: 4 aceites puros de oliva, 2 aceites puros de avellana y 32 adulteraciones de aceite de avellana en aceite de oliva, dando un total de 38 clases. Dos métodos cromatográficos fueron utilizados. El primero estaba dirigido a una completa separación de los componentes del aceite de oliva y empleó una separación con temperatura programable, mientras que el objetivo del segundo método fue un pico coeluido, por lo tanto fue contratada una temperatura constante de separación. Los datos fueron analizados por medio de la PCA, PARAFAC, PLS-DA y PLS-n.Como en el conjunto "artificial" de datos, el PCA y PARAFAC se analizaron por medio de la capacidad de clusterización, que mostró que los mejores resultados se obtienen con los datos desplegados seguido por los datos 3D tratados con el PARAFAC.Desde el punto de vista de optimización de la columna, los logros obtenidos por la columna corta está por debajo del enfoque de la columna larga, pero este caso demuestra una vez más que la adición de los incrementos de tercera dimensión mejoran la nariz electrónica basada en MS.Para el PLS-DA y n-PLS se evaluaron las tasas de éxito comparativamente, tanto para las corridas cromatográficas largas como para las cortas. Mientras que para la columna larga el mejor rendimiento es para los datos del cromatograma de iones totales (TIC), la columna corta muestra mejor rendimiento para los datos concatenados de los espectros de masa media y TIC. Además, la predicción de las tasas de éxito son las mismas para los datos TIC de columna larga como para los datos concatenados de la columna corta. Este caso es muy interesante porque demuestra que el enfoque PLS de la tercera dimensión mejora los resultados y, por otra parte, mediante el uso de la columna corta el tiempo de análisis se acorta considerablemente.Se esperan ciertos logros de la nariz electrónica. Por el momento, ninguno de esos enfoques se acercó lo suficiente para producir una respuesta positiva en los mercados. Los sensores de estado sólido tienen inconvenientes casi imposibles de superar. La nariz electrónica basada en espectrometría de masas tiene una falta de portabilidad y a veces sus logros son insuficientes, y el aparato del cromatógrafo de gases-espectrómetro de masas sufre problemas de portabilidad igual que espectrómetro de masas y toma mucho tiempo. El desarrollo de potentes algoritmos matemáticos durante los últimos años, junto con los avances en la miniaturización, tanto para MS y GC y mostrar cromatografía rápida cierta esperanza de una nariz electrónica mucho mejor.A través de este trabajo podemos afirmar que la adición del tiempo de retención cromatográfica como una dimensión extra aporta una ventaja sobre las actuales tecnologías de la nariz electrónica. Mientras que para los cromatogramas totalmente resueltos no se logran mejoras o la ganancia es mínima, sobre todo en la predicción, para una columna corta la información adicional mejora los resultados, en algunos casos, hacerlos tan bien como cuando una larga columna se utiliza. Esto es muy importante ya que las mediciones en un cromatógrafo de gases - espectrometro de masas se pueden optimizar para tramos muy cortos, una característica muy importante para una nariz electrónica. Esto permitiría el diseño de un instrumento de mayor rendimiento, adecuado para el control de calidad en líneas de productos

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Repositori Institucional URV