1,107 research outputs found

    Species distribution models

    Get PDF
    Species distribution models are a group of methods often used to estimate consequences of global change, to assess ecological status and for other ecological applications. The main idea behind species distribution models is that the geographical distributions of species can, to a large part, be explained by environmental factors and that species distributions therefore can be predicted in time or space. For robust and reliable applications, models need to be based on sound ecological principles, predictions need to be as accurate as possible, and model uncertainties need to be understood. Two approaches are available for modelling entire species communities: (1) each species can be modelled individually and independently of other species or (2) community information can be incorporated into the models. The first study in this thesis compares these two modelling approaches for predicting phytoplankton assemblages in lakes. The results showed that predictive accuracy was higher when species were modelled individually. The results also showed that phytoplankton can be used for model-based assessment of ecological status. This finding is important because phytoplankton is required for assessing the ecological status of European water bodies according to the European Water Framework Directive. Dispersal barriers in the landscape or limited dispersal ability of species might be a reason for species being absent from suitable habitats, and these factors might therefore affect model accuracy. The second study in this thesis examines the influence of dispersal and the spatial configuration of ecosystems on prediction accuracy of benthic invertebrate and phytoplankton distribution and assemblage composition. The results showed only a minor influence of spatial configuration and no effect of flight ability of invertebrates on model accuracy. However, the models used may partly account for dispersal constraints, since dispersal-related factors, such as lake surface area, are included as predictor variables. The result also showed that composition of littoral invertebrate assemblages was easier to predict at sites located in well-connected lake systems, possibly because the relatively unstable littoral zone necessitates a need for species to re-colonize disturbed habitats from source populations

    Cluster analysis for outlier detection : A case study of applying unsupervised machine learning on diesel engine data

    Get PDF
    With the advent of modern data driven methods, engine manufacturers and maintainers are attempting to pivot from corrective to predictive maintenance. One way to achieve this goal is to install sensors on the engine and look for anomalies in the data patterns it produces. Companies such as Wärtsilä that provide condition monitoring services use the Fast Fourier Transform to manually look for anomalies in the data. The Edge-project is an industrial research project involving institutions such as universities and private companies, with the goal of developing technical solutions and edge analytics for autonomous devices and vessels. Several papers and theses have been written as a result of the project, using techniques such as autoencoders to perform anomaly detection on data produced by sensors on a diesel engine. This thesis explores the use of cluster analysis for anomaly detection on diesel engine data from the Edge-project. Finding clusters could potentially represent different states of the running engine, with anomalies being represented e.g. by data points far away from cluster centroids, or data points not belonging to any particular cluster. The techniques of K-means, DBSCAN and spectral clustering are used for assigning clusters, with silhouette coefficient and eigengap used as hyperparameter tuning heuristics. Distance from cluster centroids and reduced kernel density estimation are used to flag anomalies. T-SNE and Self-Organizing Maps are used as dimensionality reduction techniques to visualize the data into a 3-dimensional and 2-dimensional space, respectively. Results show that what data are flagged as anomalies is highly sensitive to the choice of algorithm and chosen hyperparameters. The different results suggest different data as anomaly candidates. Therefore, further evaluation is needed from subject matter experts to determine which one of the models provides the most interesting results. Further work could include building an ensemble model that combines the used approaches, which could flag certain areas of the data space as a high risk for being anomalous.Moottorien valmistajat ja ylläpitäjät pyrkivät siirtymään korjaavasta huollosta ennakoivaan huoltoon modernien datavetoisten menetelmien avulla. Tämä voidaan saavuttaa esimerkiksi asentamalla antureita moottoriin ja etsimällä poikkeavuuksia anturien tuottamasta datasta. Yritykset kuten Wärtsilä, jotka tarjoavat kunnonvalvontapalveluita etsivät datasta poikkeavuuksia manuaalisesti Fourier-muunnosten avulla. Edge-projekti on teollinen tutkimushanke, johon osallistuu mm. yliopistoja ja yksityisen sektorin yrityksiä, ja jonka tavoitteena on tuottaa teknisiä ratkaisuja ja reunalaskenta-analytiikkaa itseohjautuville laitteille, ajoneuvoille ja aluksille. Hankkeesta on kirjoitettu monia tutkimusartikkeleita ja opinnäytetöitä, joissa käytetään tekniikoita kuten syviä neuroverkkoja poikkeavuuksien havaitsemiseen dieselmoottoriin asennettujen anturien tuottamasta datasta. Tämä opinnäytetyö tutkii klusterianalyysiä menetelmänä poikkeavuuksien havaitsemiseen Edge-projektissa ajetun dieselmoottorin datasta. Klusterit voisivat mahdollisesti edustaa ajettavan moottorin eri tiloja, ja poikkeavuudet voisivat olla esim. kaukana klusterien keskipisteistä olevia datapisteitä, tai datapisteitä, jotka eivät kuulu mihinkään tiettyyn klusteriin. Työssä käytetään algoritmeja K-means, DBSCAN ja spektraaliklusterointia klusterien määrittämiseen, ja siluettikerrointa sekä ominaisväliä käytetään hyperparametrioptimoinnin heuristiikkoina. Poikkeavuuksien merkintään käytetään etäisyyttä klusterien keskipisteisiin sekä alennettua ydintiheysestimaattoria. T-SNE:tä ja itseorganisoituvaa karttaa käytetään datan ulottuvuuksien vähentämisen tekniikoina, jotta data voidaan visualisoida 3- ja 2-ulotteiseen avaruuteen. Tulokset osoittavat, että mikä data tulkitaan poikkeavana, riippuu vahvasti algoritmin ja sen hyperparametrien valinnasta. Menetelmien merkitsemät poikkeavuudet eroavat huomattavasti toisistaan. Tämän vuoksi vaaditaan aihealueen ammattilaisilta lisätutkimuksia, jotta voidaan päättää mikä malli luo mielenkiintoisimmat tulokset. Jatkokehitysideana voisi olla mallikokoelma, jossa yhdistyy tässä työssä käytetyt menetelmät, ja jonka tehtävänä olisi kartoittaa data-avaruuden eri alueiden riskit poikkeavuuksien sisältämiseen

    Multi-tier framework for the inferential measurement and data-driven modeling

    Get PDF
    A framework for the inferential measurement and data-driven modeling has been proposed and assessed in several real-world application domains. The architecture of the framework has been structured in multiple tiers to facilitate extensibility and the integration of new components. Each of the proposed four tiers has been assessed in an uncoupled way to verify their suitability. The first tier, dealing with exploratory data analysis, has been assessed with the characterization of the chemical space related to the biodegradation of organic chemicals. This analysis has established relationships between physicochemical variables and biodegradation rates that have been used for model development. At the preprocessing level, a novel method for feature selection based on dissimilarity measures between Self-Organizing maps (SOM) has been developed and assessed. The proposed method selected more features than others published in literature but leads to models with improved predictive power. Single and multiple data imputation techniques based on the SOM have also been used to recover missing data in a Waste Water Treatment Plant benchmark. A new dynamic method to adjust the centers and widths of in Radial basis Function networks has been proposed to predict water quality. The proposed method outperformed other neural networks. The proposed modeling components have also been assessed in the development of prediction and classification models for biodegradation rates in different media. The results obtained proved the suitability of this approach to develop data-driven models when the complex dynamics of the process prevents the formulation of mechanistic models. The use of rule generation algorithms and Bayesian dependency models has been preliminary screened to provide the framework with interpretation capabilities. Preliminary results obtained from the classification of Modes of Toxic Action (MOA) indicate that this could be a promising approach to use MOAs as proxy indicators of human health effects of chemicals.Finally, the complete framework has been applied to three different modeling scenarios. A virtual sensor system, capable of inferring product quality indices from primary process variables has been developed and assessed. The system was integrated with the control system in a real chemical plant outperforming multi-linear correlation models usually adopted by chemical manufacturers. A model to predict carcinogenicity from molecular structure for a set of aromatic compounds has been developed and tested. Results obtained after the application of the SOM-dissimilarity feature selection method yielded better results than models published in the literature. Finally, the framework has been used to facilitate a new approach for environmental modeling and risk management within geographical information systems (GIS). The SOM has been successfully used to characterize exposure scenarios and to provide estimations of missing data through geographic interpolation. The combination of SOM and Gaussian Mixture models facilitated the formulation of a new probabilistic risk assessment approach.Aquesta tesi proposa i avalua en diverses aplicacions reals, un marc general de treball per al desenvolupament de sistemes de mesurament inferencial i de modelat basats en dades. L'arquitectura d'aquest marc de treball s'organitza en diverses capes que faciliten la seva extensibilitat així com la integració de nous components. Cadascun dels quatre nivells en que s'estructura la proposta de marc de treball ha estat avaluat de forma independent per a verificar la seva funcionalitat. El primer que nivell s'ocupa de l'anàlisi exploratòria de dades ha esta avaluat a partir de la caracterització de l'espai químic corresponent a la biodegradació de certs compostos orgànics. Fruit d'aquest anàlisi s'han establert relacions entre diverses variables físico-químiques que han estat emprades posteriorment per al desenvolupament de models de biodegradació. A nivell del preprocés de les dades s'ha desenvolupat i avaluat una nova metodologia per a la selecció de variables basada en l'ús del Mapes Autoorganitzats (SOM). Tot i que el mètode proposat selecciona, en general, un major nombre de variables que altres mètodes proposats a la literatura, els models resultants mostren una millor capacitat predictiva. S'han avaluat també tot un conjunt de tècniques d'imputació de dades basades en el SOM amb un conjunt de dades estàndard corresponent als paràmetres d'operació d'una planta de tractament d'aigües residuals. Es proposa i avalua en un problema de predicció de qualitat en aigua un nou model dinàmic per a ajustar el centre i la dispersió en xarxes de funcions de base radial. El mètode proposat millora els resultats obtinguts amb altres arquitectures neuronals. Els components de modelat proposat s'han aplicat també al desenvolupament de models predictius i de classificació de les velocitats de biodegradació de compostos orgànics en diferents medis. Els resultats obtinguts demostren la viabilitat d'aquesta aproximació per a desenvolupar models basats en dades en aquells casos en els que la complexitat de dinàmica del procés impedeix formular models mecanicistes. S'ha dut a terme un estudi preliminar de l'ús de algorismes de generació de regles i de grafs de dependència bayesiana per a introduir una nova capa que faciliti la interpretació dels models. Els resultats preliminars obtinguts a partir de la classificació dels Modes d'acció Tòxica (MOA) apunten a que l'ús dels MOA com a indicadors intermediaris dels efectes dels compostos químics en la salut és una aproximació factible.Finalment, el marc de treball proposat s'ha aplicat en tres escenaris de modelat diferents. En primer lloc, s'ha desenvolupat i avaluat un sensor virtual capaç d'inferir índexs de qualitat a partir de variables primàries de procés. El sensor resultant ha estat implementat en una planta química real millorant els resultats de les correlacions multilineals emprades habitualment. S'ha desenvolupat i avaluat un model per a predir els efectes carcinògens d'un grup de compostos aromàtics a partir de la seva estructura molecular. Els resultats obtinguts desprès d'aplicar el mètode de selecció de variables basat en el SOM milloren els resultats prèviament publicats. Aquest marc de treball s'ha usat també per a proporcionar una nova aproximació al modelat ambiental i l'anàlisi de risc amb sistemes d'informació geogràfica (GIS). S'ha usat el SOM per a caracteritzar escenaris d'exposició i per a desenvolupar un nou mètode d'interpolació geogràfica. La combinació del SOM amb els models de mescla de gaussianes dona una nova formulació al problema de l'anàlisi de risc des d'un punt de vista probabilístic

    Decision support with data-analysis methods in a nuclear power plant

    Get PDF
    Early fault detection is an important issue in nuclear industry. Methods based on self-organizing map (SOM) in dynamic systems are discussed and developed to help operators and plant experts in their decision making and used together with other methods. Visualization issues are in an important role in this research. Prototype systems are built to be able to test the basic principles. Five different studies are presented in detail. This report summarizes the test case 4 (TC4) "Decision support at a nuclear power plant" in NoTeS and NoTeS2 projects in TEKES MASI research program

    Spatial and multidimensional analysis of the Dutch housing market using the Kohonen Map and GIS

    Get PDF
    In this work the idea is to analyse general spatially identifiable housing market related data on Dutch districts (wijken) with the SOM (Kohonen Map) and a GIS. One of the authors has earlier carried out purely visual SOM analysis of that data, where patterns formed on a larger ‘map’ (the output matrix of the SOM) were used as a basis for classification of the Dutch housing market segments on a nationwide level. This way the SOM was used as a method for exploratory data analysis. Now we attempt a more rigorous method of determining the segmentation using a smaller ‘map’ size, in order to be able to export the SOM-output directly to a GIS-system to analyse it further. Two technical issues interest us: one, the robustness of the results – do the five basic housing market segments found in the earlier analysis prevail (we call these urban, urban periphery, pseudo-rural, traditional, and low-income segments); and two, which classes fit the real situation better and which worse, when using the RMSE for a measure of goodness? We also keep an eye on policy implications and aim at comparing our classifications with the ‘actual’ ones used in official discourse.

    Unsupervised Feature Extraction Techniques for Plasma Semiconductor Etch Processes

    Get PDF
    As feature sizes on semiconductor chips continue to shrink plasma etching is becoming a more and more critical process in achieving low cost high-volume manufacturing. Due to the highly complex physics of plasma and chemical reactions between plasma species, control of plasma etch processes is one of the most di±cult challenges facing the integrated circuit industry. This is largely due to the di±culty with monitoring plasmas. Optical Emission Spectroscopy (OES) technology can be used to produce rich plasma chemical information in real time and is increasingly being considered in semiconductor manufacturing for process monitoring and control of plasma etch processes. However, OES data is complex and inherently highly redundant, necessitating the development of advanced algorithms for e®ective feature extraction. In this thesis, three new unsupervised feature extraction algorithms have been proposed for OES data analysis and the algorithm properties have been explored with the aid of both arti¯cial and industrial benchmark data sets. The ¯rst algorithm, AWSPCA (AdaptiveWeighting Sparse Principal Component Analysis), is developed for dimension reduction with respect to variations in the analysed variables. The algorithm gener- ates sparse principle components while retaining orthogonality and grouping correlated variables together. The second algorithm, MSC (Max Separation Clustering), is devel- oped for clustering variables with distinctive patterns and providing e®ective pattern representation by a small number of representative variables. The third algorithm, SLHC (Single Linkage Hierarchical Clustering), is developed to achieve a complete and detailed visualisation of the correlation between variables and across clusters in an OES data set. The developed algorithms open up opportunities for using OES data for accurate pro- cess control applications. For example, MSC enables the selection of relevant OES variables for better modeling and control of plasma etching processes. SLHC makes it possible to understand and interpret patterns in OES spectra and how they relate to the plasma chemistry. This in turns can help engineers to achieve an in-depth under- standing of underlying plasma processes
    corecore