    Innovative Techniques for the Retrieval of Earth’s Surface and Atmosphere Geophysical Parameters: Spaceborne Infrared/Microwave Combined Analyses

    With the advent of the first satellites for Earth Observation: Landsat-1 in July 1972 and ERS-1 in May 1991, the discipline of environmental remote sensing has become, over time, increasingly fundamental for the study of phenomena characterizing the planet Earth. The goal of environmental remote sensing is to perform detailed analyses and to monitor the temporal evolution of different physical phenomena, exploiting the mechanisms of interaction between the objects that are present in an observed scene and the electromagnetic radiation detected by sensors, placed at a distance from the scene, operating at different frequencies. The analyzed physical phenomena are those related to climate change, weather forecasts, global ocean circulation, greenhouse gas profiling, earthquakes, volcanic eruptions, soil subsidence, and the effects of rapid urbanization processes. Generally, remote sensing sensors are of two primary types: active and passive. Active sensors use their own source of electromagnetic radiation to illuminate and analyze an area of interest. An active sensor emits radiation in the direction of the area to be investigated and then detects and measures the radiation that is backscattered from the objects contained in that area. Passive sensors, on the other hand, detect natural electromagnetic radiation (e.g., from the Sun in the visible band and the Earth in the infrared and microwave bands) emitted or reflected by the object contained in the observed scene. The scientific community has dedicated many resources to developing techniques to estimate, study and analyze Earth’s geophysical parameters. These techniques differ for active and passive sensors because they depend strictly on the type of the measured physical quantity. In my P.h.D. work, inversion techniques for estimating Earth’s surface and atmosphere geophysical parameters will be addressed, emphasizing methods based on machine learning (ML). In particular, the study of cloud microphysics and the characterization of Earth’s surface changes phenomenon are the critical points of this work

    Quantifying correlations between galaxy emission lines and stellar continua

    We analyse the correlations between continuum properties and emission line equivalent widths of star-forming and active galaxies from the Sloan Digital Sky Survey. Since upcoming large sky surveys will make broad-band observations only, including strong emission lines into theoretical modelling of spectra will be essential to estimate physical properties of photometric galaxies. We show that emission line equivalent widths can be fairly well reconstructed from the stellar continuum using local multiple linear regression in the continuum principal component analysis (PCA) space. Line reconstruction is good for star-forming galaxies and reasonable for galaxies with active nuclei. We propose a practical method to combine stellar population synthesis models with empirical modelling of emission lines. The technique will help generate more accurate model spectra and mock catalogues of galaxies to fit observations of the new surveys. More accurate modelling of emission lines is also expected to improve template-based photometric redshift estimation methods. We also show that, by combining PCA coefficients from the pure continuum and the emission lines, automatic distinction between hosts of weak active galactic nuclei (AGNs) and quiescent star-forming galaxies can be made. The classification method is based on a training set consisting of high-confidence starburst galaxies and AGNs, and allows for the similar separation of active and star-forming galaxies as the empirical curve found by Kauffmann et al. We demonstrate the use of three important machine learning algorithms in the paper: k-nearest neighbour finding, k-means clustering and support vector machines.Comment: 14 pages, 14 figures. Accepted by MNRAS on 2015 December 22. The paper's website with data and code is at http://www.vo.elte.hu/papers/2015/emissionlines

    Ny forståelse av gasshydratfenomener og naturlige inhibitorer i råoljesystemer gjennom massespektrometri og maskinlæring

    Gas hydrates represent one of the main flow assurance issues in the oil and gas industry as they can cause complete blockage of pipelines and process equipment, forcing shut downs. Previous studies have shown that some crude oils form hydrates that do not agglomerate or deposit, but remain as transportable dispersions. This is commonly believed to be due to naturally occurring components present in the crude oil, however, despite decades of research, their exact structures have not yet been determined. Some studies have suggested that these components are present in the acid fractions of the oils or are related to the asphaltene content of the oils. Crude oils are among the worlds most complex organic mixtures and can contain up to 100 000 different constituents, making them difficult to characterise using traditional mass spectrometers. The high mass accuracy of Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FT-ICR MS) yields a resolution greater than traditional techniques, making FT-ICR MS able to characterise crude oils to a greater extent, and possibly identify hydrate active components. FT-ICR MS spectra usually contain tens of thousands of peaks, and data treatment methods able to find underlying relationships in big data sets are required. Machine learning and multivariate statistics include many methods suitable for big data. A literature review identified a number of promising methods, and the current status for the use of machine learning for analysis of gas hydrates and FT-ICR MS data was analysed. The literature study revealed that although many studies have used machine learning to predict thermodynamic properties of gas hydrates, very little work have been done in analysing gas hydrate related samples measured by FT-ICR MS. In order to aid their identification, a successive accumulation procedure for increasing the concentrations of hydrate active components was developed by SINTEF. Comparison of the mass spectra from spiked and unspiked samples revealed some peaks that increased in intensity over the spiking levels. Several classification methods were used in combination with variable selection, and peaks related to hydrate formation were identified. The corresponding molecular formulas were determined, and the peaks were assumed to be related to asphaltenes, naphthenes and polyethylene glycol. To aid the characterisation of the oils, infrared spectroscopy (both Fourier Transform infrared and near infrared) was combined with FT-ICR MS in a multiblock analysis to predict the density of crude oils. Two different strategies for data fusion were attempted, and sequential fusion of the blocks achieved the highest prediction accuracy both before and after reducing the dimensions of the data sets by variable selection. As crude oils have such complex matrixes, samples are often very different, and many methods are not able to handle high degrees of variations or non-linearities between the samples. Hierarchical cluster-based partial least squares regression (HC-PLSR) clusters the data and builds local models within each cluster. HC-PLSR can thus handle non-linearities between clusters, but as PLSR is a linear model the data is still required to be locally linear. HC-PLSR was therefore expanded into deep learning (HC-CNN and HC-RNN) and SVR (HC-SVR). The deep learning-based models outperformed HC-PLSR for a data set predicting average molecular weights from hydrolysed raw materials. The analysis of the FT-ICR MS spectra revealed that the large amounts of information contained in the data (due to the high resolution) can disturb the predictive models, but the use of variable selection counteracts this effect. Several methods from machine learning and multivariate statistics were proven valuable for prediction of various parameters from FT-ICR MS using both classification and regression methods.Gasshydrater er et av hovedproblemene for Flow assurance i olje- og gassnæringen ettersom at de kan forårsake blokkeringer i oljerørledninger og prosessutstyr som krever at systemet må stenges ned. Tidligere studier har vist at noen råoljer danner hydrater som ikke agglomererer eller avsetter, men som forblir som transporterbare dispersjoner. Dette antas å være på grunn av naturlig forekommende komponenter til stede i råoljen, men til tross for årevis med forskning er deres nøyaktige strukturer enda ikke bestemt i detalj. Noen studier har indikert at disse komponentene kan stamme fra syrefraksjonene i oljen eller være relatert til asfalteninnholdet i oljene. Råoljer er blant verdens mest komplekse organiske blandinger og kan inneholde opptil 100 000 forskjellige bestanddeler, som gjør dem vanskelig å karakterisere ved bruk av tradisjonelle massespektrometre. Den høye masseoppløsningen Fourier-transform ion syklotron resonans massespektrometri (FT-ICR MS) gir en høyere oppløsning enn tradisjonelle teknikker, som gjør FT-ICR MS i stand til å karakterisere råoljer i større grad og muligens identifisere hydrataktive komponenter. FT-ICR MS spektre inneholder vanligvis titusenvis av topper, og det er nødvendig å bruke databehandlingsmetoder i stand til å håndtere store datasett, med muligheter til å finne underliggende forhold for å analysere spektrene. Maskinlæring og multivariat statistikk har mange metoder som er passende for store datasett. En litteratur studie identifiserte flere metoder og den nåværende statusen for bruken av maskinlæring for analyse av gasshydrater og FT-ICR MS data. Litteraturstudien viste at selv om mange studier har brukt maskinlæring til å predikere termodynamiske egenskaper for gasshydrater, har lite arbeid blitt gjort med å analysere gasshydrat relaterte prøver målt med FT-ICR MS. For å bistå identifikasjonen ble en suksessiv akkumuleringsprosedyre for å øke konsentrasjonene av hydrataktive komponenter utviklet av SINTEF. Sammenligninger av massespektrene fra spikede og uspikede prøver viste at noen topper økte sammen med spikingnivåene. Flere klassifikasjonsmetoder ble brukt i kombinasjon med ariabelseleksjon for å identifisere topper relatert til hydratformasjon. Molekylformler ble bestemt og toppene ble antatt å være relatert til asfaltener, naftener og polyetylenglykol. For å bistå karakteriseringen av oljene ble infrarød spektroskopi inkludert med FT-ICR MS i en multiblokk analyse for å predikere tettheten til råoljene. To forskjellige strategier for datafusjonering ble testet og sekvensiell fusjonering av blokkene oppnådde den høyeste prediksjonsnøyaktigheten både før og etter reduksjon av datasettene med bruk av variabelseleksjon. Ettersom råoljer har så kompleks sammensetning, er prøvene ofte veldig forskjellige og mange metoder er ikke egnet for å håndtere store variasjoner eller ikke-lineariteter mellom prøvene. Hierarchical cluster-based partial least squares regression (HCPLSR) grupperer dataene og lager lokale modeller for hver gruppe. HC-PLSR kan dermed håndtere ikke-lineariteter mellom gruppene, men siden PLSR er en lokal modell må dataene fortsatt være lokalt lineære. HC-PLSR ble derfor utvidet til convolutional neural networks (HC-CNN) og recurrent neural networks (HC-RNN) og support vector regression (HC-SVR). Disse dyp læring metodene utkonkurrerte HC-PLSR for et datasett som predikerte gjennomsnittlig molekylvekt fra hydrolyserte råmaterialer. Analysen av FT-ICR MS spektre viste at spektrene inneholder veldig mye informasjon. Disse store mengdene med data kan forstyrre prediksjonsmodeller, men bruken av variabelseleksjon motvirket denne effekten. Flere metoder fra maskinlæring og multivariat statistikk har blitt vist å være nyttige for prediksjon av flere parametere from FT-ICR MS data ved bruk av både klassifisering og regresjon

    Mineral identification using data-mining in hyperspectral infrared imagery

    Les applications de l’imagerie infrarouge dans le domaine de la géologie sont principalement des applications hyperspectrales. Elles permettent entre autre l’identification minérale, la cartographie, ainsi que l’estimation de la portée. Le plus souvent, ces acquisitions sont réalisées in-situ soit à l’aide de capteurs aéroportés, soit à l’aide de dispositifs portatifs. La découverte de minéraux indicateurs a permis d’améliorer grandement l’exploration minérale. Ceci est en partie dû à l’utilisation d’instruments portatifs. Dans ce contexte le développement de systèmes automatisés permettrait d’augmenter à la fois la qualité de l’exploration et la précision de la détection des indicateurs. C’est dans ce cadre que s’inscrit le travail mené dans ce doctorat. Le sujet consistait en l’utilisation de méthodes d’apprentissage automatique appliquées à l’analyse (au traitement) d’images hyperspectrales prises dans les longueurs d’onde infrarouge. L’objectif recherché étant l’identification de grains minéraux de petites tailles utilisés comme indicateurs minéral -ogiques. Une application potentielle de cette recherche serait le développement d’un outil logiciel d’assistance pour l’analyse des échantillons lors de l’exploration minérale. Les expériences ont été menées en laboratoire dans la gamme relative à l’infrarouge thermique (Long Wave InfraRed, LWIR) de 7.7m à 11.8 m. Ces essais ont permis de proposer une méthode pour calculer l’annulation du continuum. La méthode utilisée lors de ces essais utilise la factorisation matricielle non négative (NMF). En utlisant une factorisation du premier ordre on peut déduire le rayonnement de pénétration, lequel peut ensuite être comparé et analysé par rapport à d’autres méthodes plus communes. L’analyse des résultats spectraux en comparaison avec plusieurs bibliothèques existantes de données a permis de mettre en évidence la suppression du continuum. Les expérience ayant menés à ce résultat ont été conduites en utilisant une plaque Infragold ainsi qu’un objectif macro LWIR. L’identification automatique de grains de différents matériaux tels que la pyrope, l’olivine et le quartz a commencé. Lors d’une phase de comparaison entre des approches supervisées et non supervisées, cette dernière s’est montrée plus approprié en raison du comportement indépendant par rapport à l’étape d’entraînement. Afin de confirmer la qualité de ces résultats quatre expériences ont été menées. Lors d’une première expérience deux algorithmes ont été évalués pour application de regroupements en utilisant l’approche FCC (False Colour Composite). Cet essai a permis d’observer une vitesse de convergence, jusqu’a vingt fois plus rapide, ainsi qu’une efficacité significativement accrue concernant l’identification en comparaison des résultats de la littérature. Cependant des essais effectués sur des données LWIR ont montré un manque de prédiction de la surface du grain lorsque les grains étaient irréguliers avec présence d’agrégats minéraux. La seconde expérience a consisté, en une analyse quantitaive comparative entre deux bases de données de Ground Truth (GT), nommée rigid-GT et observed-GT (rigide-GT: étiquet manuel de la région, observée-GT:étiquetage manuel les pixels). La précision des résultats était 1.5 fois meilleur lorsque l’on a utlisé la base de données observed-GT que rigid-GT. Pour les deux dernières epxérience, des données venant d’un MEB (Microscope Électronique à Balayage) ainsi que d’un microscopie à fluorescence (XRF) ont été ajoutées. Ces données ont permis d’introduire des informations relatives tant aux agrégats minéraux qu’à la surface des grains. Les résultats ont été comparés par des techniques d’identification automatique des minéraux, utilisant ArcGIS. Cette dernière a montré une performance prometteuse quand à l’identification automatique et à aussi été utilisée pour la GT de validation. Dans l’ensemble, les quatre méthodes de cette thèse représentent des méthodologies bénéfiques pour l’identification des minéraux. Ces méthodes présentent l’avantage d’être non-destructives, relativement précises et d’avoir un faible coût en temps calcul ce qui pourrait les qualifier pour être utilisée dans des conditions de laboratoire ou sur le terrain.The geological applications of hyperspectral infrared imagery mainly consist in mineral identification, mapping, airborne or portable instruments, and core logging. Finding the mineral indicators offer considerable benefits in terms of mineralogy and mineral exploration which usually involves application of portable instrument and core logging. Moreover, faster and more mechanized systems development increases the precision of identifying mineral indicators and avoid any possible mis-classification. Therefore, the objective of this thesis was to create a tool to using hyperspectral infrared imagery and process the data through image analysis and machine learning methods to identify small size mineral grains used as mineral indicators. This system would be applied for different circumstances to provide an assistant for geological analysis and mineralogy exploration. The experiments were conducted in laboratory conditions in the long-wave infrared (7.7μm to 11.8μm - LWIR), with a LWIR-macro lens (to improve spatial resolution), an Infragold plate, and a heating source. The process began with a method to calculate the continuum removal. The approach is the application of Non-negative Matrix Factorization (NMF) to extract Rank-1 NMF and estimate the down-welling radiance and then compare it with other conventional methods. The results indicate successful suppression of the continuum from the spectra and enable the spectra to be compared with spectral libraries. Afterwards, to have an automated system, supervised and unsupervised approaches have been tested for identification of pyrope, olivine and quartz grains. The results indicated that the unsupervised approach was more suitable due to independent behavior against training stage. Once these results obtained, two algorithms were tested to create False Color Composites (FCC) applying a clustering approach. The results of this comparison indicate significant computational efficiency (more than 20 times faster) and promising performance for mineral identification. Finally, the reliability of the automated LWIR hyperspectral infrared mineral identification has been tested and the difficulty for identification of the irregular grain’s surface along with the mineral aggregates has been verified. The results were compared to two different Ground Truth(GT) (i.e. rigid-GT and observed-GT) for quantitative calculation. Observed-GT increased the accuracy up to 1.5 times than rigid-GT. The samples were also examined by Micro X-ray Fluorescence (XRF) and Scanning Electron Microscope (SEM) in order to retrieve information for the mineral aggregates and the grain’s surface (biotite, epidote, goethite, diopside, smithsonite, tourmaline, kyanite, scheelite, pyrope, olivine, and quartz). The results of XRF imagery compared with automatic mineral identification techniques, using ArcGIS, and represented a promising performance for automatic identification and have been used for GT validation. In overall, the four methods (i.e. 1.Continuum removal methods; 2. Classification or clustering methods for mineral identification; 3. Two algorithms for clustering of mineral spectra; 4. Reliability verification) in this thesis represent beneficial methodologies to identify minerals. These methods have the advantages to be a non-destructive, relatively accurate and have low computational complexity that might be used to identify and assess mineral grains in the laboratory conditions or in the field

    Active Wavelength Selection for Chemical Identification Using Tunable Spectroscopy

    Spectrometers are the cornerstone of analytical chemistry. Recent advances in microoptics manufacturing provide lightweight and portable alternatives to traditional spectrometers. In this dissertation, we developed a spectrometer based on Fabry-Perot interferometers (FPIs). A FPI is a tunable (it can only scan one wavelength at a time) optical filter. However, compared to its traditional counterparts such as FTIR (Fourier transform infrared spectroscopy), FPIs provide lower resolution and lower signal-noiseratio (SNR). Wavelength selection can help alleviate these drawbacks. Eliminating uninformative wavelengths not only speeds up the sensing process but also helps improve accuracy by avoiding nonlinearity and noise. Traditional wavelength selection algorithms follow a training-validation process, and thus they are only optimal for the target analyte. However, for chemical identification, the identities are unknown. To address the above issue, this dissertation proposes active sensing algorithms that select wavelengths online while sensing. These algorithms are able to generate analytedependent wavelengths. We envision this algorithm deployed on a portable chemical gas platform that has low-cost sensors and limited computation resources. We develop three algorithms focusing on three different aspects of the chemical identification problems. First, we consider the problem of single chemical identification. We formulate the problem as a typical classification problem where each chemical is considered as a distinct class. We use Bayesian risk as the utility function for wavelength selection, which calculates the misclassification cost between classes (chemicals), and we select the wavelength with the maximum reduction in the risk. We evaluate this approach on both synthesized and experimental data. The results suggest that active sensing outperforms the passive method, especially in a noisy environment. Second, we consider the problem of chemical mixture identification. Since the number of potential chemical mixtures grows exponentially as the number of components increases, it is intractable to formulate all potential mixtures as classes. To circumvent combinatorial explosion, we developed a multi-modal non-negative least squares (MMNNLS) method that searches multiple near-optimal solutions as an approximation of all the solutions. We project the solutions onto spectral space, calculate the variance of the projected spectra at each wavelength, and select the next wavelength using the variance as the guidance. We validate this approach on synthesized and experimental data. The results suggest that active approaches are superior to their passive counterparts especially when the condition number of the mixture grows larger (the analytes consist of more components, or the constituent spectra are very similar to each other). Third, we consider improving the computational speed for chemical mixture identification. MM-NNLS scales poorly as the chemical mixture becomes more complex. Therefore, we develop a wavelength selection method based on Gaussian process regression (GPR). GPR aims to reconstruct the spectrum rather than solving the mixture problem, thus, its computational cost is a function of the number of wavelengths. We evaluate the approach on both synthesized and experimental data. The results again demonstrate more accurate and robust performance in contrast to passive algorithms

    Current overview and way forward for the use of machine learning in the field of petroleum gas hydrates

    Gas hydrates represent one of the main flow assurance challenges in the oil and gas industry as they can lead to plugging of pipelines and process equipment. In this paper we present a literature study performed to evaluate the current state of the use of machine learning methods within the field of gas hydrates with specific focus on the oil chemistry. A common analysis technique for crude oils is Fourier Transform Ion Cyclotron Resonance Mass Spectrometry (FT-ICR MS) which could be a good approach to achieving a better understanding of the chemical composition of hydrates, and the use of machine learning in the field of FT-ICR MS was therefore also examined. Several machine learning methods were identified as promising, their use in the literature was reviewed and a text analysis study was performed to identify the main topics within the publications. The literature search revealed that the publications on the combination of FT-ICR MS, machine learning and gas hydrates is limited to one. Most of the work on gas hydrates is related to thermodynamics, while FT-ICR MS is mostly used for chemical analysis of oils. However, with the combination of FT-ICR MS and machine learning to evaluate samples related to gas hydrates, it could be possible to improve the understanding of the composition of hydrates and thereby identify hydrate active compounds responsible for the differences between oils forming plugging hydrates and oils forming transportable hydrates.Current overview and way forward for the use of machine learning in the field of petroleum gas hydratespublishedVersio

    Raman spectroscopic characterization and analysis of agricultural and biological systems

    Technical progresses in the past two decades in instrumental design, laser and electronic technology, and computer-based data analysis have made Raman spectroscopy, a noninvasive, nondestructive optical molecular spectroscopic imaging technique, an attractive choice for analytical tasks. Raman spectroscopy provides chemical structural information at molecular level with minimal sample preparation in a quick, easy-to-operate and reproducible fashion. In recent years it has been applied more and more to the analysis and characterization of agricultural products and biological samples. This dissertation documents the innovative research in Raman spectroscopic characterization and analysis in both biomedical and agricultural systems that I have been working on throughout my PhD training. The biomedical research conducted was focused on glaucoma. Glaucoma is a chronic neurodegenerative disease characterized by apoptosis of retinal ganglion cells and subsequent loss of visual function. Early detection of pathological changes and progression in glaucoma and other neuroretinal diseases, which is critical for the prevention of permanent structural damage and irreversible vision loss, remains a great challenge. In my research, the Raman spectra from canine retinal tissues were subjected to multivariate discriminant analysis with a support vector machine algorithm to differentiate disease tissues versus healthy tissues. The high classification accuracy suggests that Raman spectroscopic screening can be used for in vitro detection of glaucomatous changes in retinal tissue not only at late stage but also at early stage with high specificity. To expand the scope of application of Raman analysis, it was also applied to characterize agricultural and food materials. More specifically, Raman spectroscopy was applied to analyze meat. Existing objective methods (e.g., mechanical stress/strain analysis, near infrared spectroscopy) to predict sensory attributes of pork in general do not yield satisfactory correlation to panel evaluations. Raman spectroscopic methodology was investigated in this study to evaluate and predict tenderness, juiciness and chewiness of fresh, uncooked pork loins from 169 pigs. The method developed in this thesis yielded good prediction of sensory attributes such as tenderness and chewiness, and it has the potential to become a rapid objective assay for tenderness and chewiness of pork products that may find practical applications in pork industry. In addition, a Raman spectroscopic screening method in conjunction with discriminant modeling was developed for rapid evaluation of boar taint level in pork. Through the research demonstrated in this dissertation, Raman spectroscopy has been shown to have great potential to address analytical needs in new fields with great potential for innovative applications

    SOM-based Peptide Prototyping for Mass Spectrometry Peak Intensity Prediction

    In todays bioinformatics, Mass spectrometry (MS) is the key technique for the identification of proteins. A prediction of spectrum peak intensities from pre computed molecular features would pave the way to better understanding of spectrometry data and improved spectrum evaluation. We propose a neural network architecture of Local Linear Map (LLM)-type based on Self-Organizing Maps (SOMs) for peptide prototyping and learning locally tuned regression functions for peak intensity prediction in MALDI-TOF mass spectra. We obtain results comparable to those obtained by nu-Support Vector Regression and show how the SOM learning architecture provides a basis for peptide feature profiling and visualisation

    Machine learning in analytical chemistry: applying innovative data analysis methods using chromatographic techniques

    Dissertação de mestrado em Chemical Analysis and Characterisation Techniques Chemical SciencesScientific and technological advances allowed the extraction of a growing quantity of knowledge from the analysed samples by means of analytical techniques. Over the last few years, the dimensionality of data that the most recent analytical techniques produce is so high, that its analysis is now called megavariate analysis. Recently, the usage of machine learning tools in chemical data analysis have allowed the extraction of relevant information from samples at a level which, until then, would just not be possible. The objective of this work consists in classifying manufacturing conditions of printed circuit boards based on data acquired by SLE-HPLC-ESI-MS. As such, this dissertation is divided in two parts: the first synthesizes the work taken to assure the analytical method produces data with adequate quality in such a way the second part shows the development of predictive model using the previous acquired data. At the same time, a data augmentation technique which, to the best of our knowledge, constitutes the first time a data augmentation technique for classification problems using chromatographic data, has been developed. Best models’ results show precisions above 94% for all manufacturing conditions prediction. Moreover, the developed data augmentation technique reports superior performances when compared to three other data augmentation techniques. In summary, the results show that, besides distinguishing classes with different chemical compositions, it is possible to obtain information about which are the chemical compounds that differentiate the classes. This information might be of significant importance for areas such as quality control, food chemistry, botany and pharmaceutical industry.O constante avanço científico-tecnológico permitiu que, ao longo do último século, as técnicas de análise química extraíssem cada vez mais conhecimento das amostras analisadas. Nos últimos anos, a quantidade de dados que as mais recentes técnicas analíticas produzem possui uma dimensão tão elevada que a sua análise é denominada de análise megavariacional. Recentemente, a aplicação de ferramentas de machine learning em análises de dados químicos tem permitido extrair informação relevante das amostras analisadas que até recentemente não era possível. Com isto em mente, o objetivo deste trabalho consiste em classificar condições de manufatura de placas de circuito impresso tendo por base dados provenientes de análise por cromatografia líquida acoplada a espetrometria de massa com extração sólido-líquido. Desta forma, esta dissertação está dividida em duas partes: a primeira sintetiza o trabalho efetuado para garantir que o método de análise produz dados com qualidade adequada para que na segunda parte esses dados sejam usados para construir modelos preditivos. Paralelamente, foi desenvolvida uma técnica de aumento de dados que, até onde o nosso conhecimento vai, constitui a primeira técnica de aumento de dados desenvolvida para problemas de classificação com dados provenientes de análises cromatográficas. Os resultados dos melhores modelos mostram precisões superiores a 94% para a previsão de todas as condições de manufatura. Adicionalmente, a técnica de aumento de dados desenvolvida mostra desempenhos superiores comparativamente a outras técnicas de aumento de dados. Em síntese, os resultados obtidos indicam que, para além de distinguir classes com composições químicas diferentes, é possível adquirir informação sobre quais são os compostos químicos que distinguem as classes em estudo. Esta informação pode vir a ter uma importância significativa em áreas como controlo de qualidade, química alimentar e indústria fito-farmacêutica.Fundação para a Ciência e Tecnologia através do projeto POCI-01-0145-FEDER-029147 - PTDC/FIS-PAR/29147/2017 financiado por: OE/FCT, Lisboa 2020, Compete 2020 POCI, Portugal 2020 FEDE
