128 research outputs found

    Puuttuvien arvojen korvaaminen aliavaruusmenetelmillä

    Get PDF
    In survey practice as well as in many other data analysis tasks, missing values are a common encounter. In this thesis, the missing value imputation task is studied using three subspace methods, principal component analysis (PCA), the Self-Organizing Map (SOM) and the Generative Topographic Mapping (GTM). The application area of interest is survey imputation, where imputation is conventionally conducted using, e.g., hot deck methods or multiple imputation by chained equations (MICE). Similarities and differences between imputation in survey practice and recommendation systems are discussed, as well. The formalism behind missing value imputation is described together with general mechanisms giving rise to missing data. A detailed review of the aforementioned subspace methods in presence of missing data is given in order to motivate the novelties and new implementations contributed. The contributions of this thesis include (i) a novel way of treating missing data in the SOM algorithm, which is shown to improve properties of the model, (ii) a fine-tuned GTM, where the number of radial basis functions is increased during learning and the initialization is made using the SOM, and (iii) a novel regularization for the GTM for binary data. Experimental comparisons of existing and proposed methods are made using the wine data set and Likert-scale data from two wellbeing-related surveys. The variational Bayesian PCA is shown to be superior in the single imputation task. It also enables automatic relevance determination, i.e., automatic selection of the number of principal components needed. Finally, multiple imputation (MI) using the subspace methods and MICE is demonstrated. It is shown, that with survey data with less than 2 % missing data, all MI methods provide very similar population le vel results.Puuttuvat arvot ovat yleisiä niin kyselyaineistoissa kuin muissakin tilastollisesti analysoitavissa aineistoissa. Tässä opinnäytetyössä tutkitaan puuttuvien arvojen korvaamista käyttäen kolmea aliavaruusmenetelmää, pääkomponenttianalyysiä (PCA), itseorganisoivaa karttaa (SOM) ja generatiivista topografista kuvausta (GTM). Sovellusalueena ovat kyselyaineistot, joiden puuttuvia arvoja korvataan perinteisesti esimerkiksi käyttäen niin sanottuja hot-deck -menetelmiä tai moninkertaista ketjutettua korvaamista (multiple imputation by chained equations, MICE). Opinnäytteessä myös tarkastellaan kyselyaineistojen korvaamisen ja suositusjärjestelmien välisistä eroavaisuuksista ja samankaltaisuuksista menetelmätasolla. Edellä mainitut aliavaruusmenetelmät on esitelty yksityiskohtaisesti motivoiden sekä uusia muutoksia, että niiden käyttöä puuttuvien arvojen korvaamisessa. Työssä esitettyjä kontribuutioita ovat (i) uusi tapa käsitellä puuttuvia arvoja SOM-algoritmissa, minkä näytetään parantavan algoritmin ominaisuuksia, (ii) niin sanottu "fine-tuned GTM", jossa käytettävien kantafunktioiden määrää kasvattamalla voidaan oppia parempia malleja, sekä (iii) uudella tavalla regularisoitu GTM-malli binaariselle aineistolle. Kokeellisessa osuudessa vertaillaan ehdotettuja malleja sekä käyttäen tunnettua viiniaineistoa että kahta Likert-asteikkoista hyvinvointikyselyaineistoa. Variaatioaproksimoitu bayesilainen PCA osoittautuu parhaaksi tehtäessä yksittäisiä puuttuvien arvojen korvauksia. Se tekee myös automaattista mallinvalintaa, jolloin erillistä validointia mallin kompleksisuuden valitsemiseksi ei tarvita. Lopuksi näytetään moninkertaista puuttuvien arvojen korvaamista (MI) käyttäen aliavaruusmenetelmiä sekä MICE-menetelmää. Menetelmät tuottavat hyvin samanlaisia tuloksia kyselyaineistolla, jossa on alle 2 % puuttuvia arvoja

    Data exploration process based on the self-organizing map

    Get PDF
    With the advances in computer technology, the amount of data that is obtained from various sources and stored in electronic media is growing at exponential rates. Data mining is a research area which answers to the challange of analysing this data in order to find useful information contained therein. The Self-Organizing Map (SOM) is one of the methods used in data mining. It quantizes the training data into a representative set of prototype vectors and maps them on a low-dimensional grid. The SOM is a prominent tool in the initial exploratory phase in data mining. The thesis consists of an introduction and ten publications. In the publications, the validity of SOM-based data exploration methods has been investigated and various enhancements to them have been proposed. In the introduction, these methods are presented as parts of the data mining process, and they are compared with other data exploration methods with similar aims. The work makes two primary contributions. Firstly, it has been shown that the SOM provides a versatile platform on top of which various data exploration methods can be efficiently constructed. New methods and measures for visualization of data, clustering, cluster characterization, and quantization have been proposed. The SOM algorithm and the proposed methods and measures have been implemented as a set of Matlab routines in the SOM Toolbox software library. Secondly, a framework for SOM-based data exploration of table-format data - both single tables and hierarchically organized tables - has been constructed. The framework divides exploratory data analysis into several sub-tasks, most notably the analysis of samples and the analysis of variables. The analysis methods are applied autonomously and their results are provided in a report describing the most important properties of the data manifold. In such a framework, the attention of the data miner can be directed more towards the actual data exploration task, rather than on the application of the analysis methods. Because of the highly iterative nature of the data exploration, the automation of routine analysis tasks can reduce the time needed by the data exploration process considerably.reviewe

    Exploration of customer churn routes using machine learning probabilistic models

    Get PDF
    The ongoing processes of globalization and deregulation are changing the competitive framework in the majority of economic sectors. The appearance of new competitors and technologies entails a sharp increase in competition and a growing preoccupation among service providing companies with creating stronger bonds with customers. Many of these companies are shifting resources away from the goal of capturing new customers and are instead focusing on retaining existing ones. In this context, anticipating the customer¿s intention to abandon, a phenomenon also known as churn, and facilitating the launch of retention-focused actions represent clear elements of competitive advantage. Data mining, as applied to market surveyed information, can provide assistance to churn management processes. In this thesis, we mine real market data for churn analysis, placing a strong emphasis on the applicability and interpretability of the results. Statistical Machine Learning models for simultaneous data clustering and visualization lay the foundations for the analyses, which yield an interpretable segmentation of the surveyed markets. To achieve interpretability, much attention is paid to the intuitive visualization of the experimental results. Given that the modelling techniques under consideration are nonlinear in nature, this represents a non-trivial challenge. Newly developed techniques for data visualization in nonlinear latent models are presented. They are inspired in geographical representation methods and suited to both static and dynamic data representation

    Advanced and novel modeling techniques for simulation, optimization and monitoring chemical engineering tasks with refinery and petrochemical unit applications

    Get PDF
    Engineers predict, optimize, and monitor processes to improve safety and profitability. Models automate these tasks and determine precise solutions. This research studies and applies advanced and novel modeling techniques to automate and aid engineering decision-making. Advancements in computational ability have improved modeling software’s ability to mimic industrial problems. Simulations are increasingly used to explore new operating regimes and design new processes. In this work, we present a methodology for creating structured mathematical models, useful tips to simplify models, and a novel repair method to improve convergence by populating quality initial conditions for the simulation’s solver. A crude oil refinery application is presented including simulation, simplification tips, and the repair strategy implementation. A crude oil scheduling problem is also presented which can be integrated with production unit models. Recently, stochastic global optimization (SGO) has shown to have success of finding global optima to complex nonlinear processes. When performing SGO on simulations, model convergence can become an issue. The computational load can be decreased by 1) simplifying the model and 2) finding a synergy between the model solver repair strategy and optimization routine by using the initial conditions formulated as points to perturb the neighborhood being searched. Here, a simplifying technique to merging the crude oil scheduling problem and the vertically integrated online refinery production optimization is demonstrated. To optimize the refinery production a stochastic global optimization technique is employed. Process monitoring has been vastly enhanced through a data-driven modeling technique Principle Component Analysis. As opposed to first-principle models, which make assumptions about the structure of the model describing the process, data-driven techniques make no assumptions about the underlying relationships. Data-driven techniques search for a projection that displays data into a space easier to analyze. Feature extraction techniques, commonly dimensionality reduction techniques, have been explored fervidly to better capture nonlinear relationships. These techniques can extend data-driven modeling’s process-monitoring use to nonlinear processes. Here, we employ a novel nonlinear process-monitoring scheme, which utilizes Self-Organizing Maps. The novel techniques and implementation methodology are applied and implemented to a publically studied Tennessee Eastman Process and an industrial polymerization unit

    Machine Learning and Deep Learning applications for the protection of nuclear fusion devices

    Get PDF
    This Thesis addresses the use of artificial intelligence methods for the protection of nuclear fusion devices with reference to the Joint European Torus (JET) Tokamak and the Wendenstein 7-X (W7-X) Stellarator. JET is currently the world's largest operational Tokamak and the only one operated with the Deuterium-Tritium fuel, while W7-X is the world's largest and most advanced Stellarator. For the work on JET, research focused on the prediction of “disruptions”, and sudden terminations of plasma confinement. For the development and testing of machine learning classifiers, a total of 198 disrupted discharges and 219 regularly terminated discharges from JET. Convolutional Neural Networks (CNNs) were proposed to extract the spatiotemporal characteristics from plasma temperature, density and radiation profiles. Since the CNN is a supervised algorithm, it is necessary to explicitly assign a label to the time windows of the dataset during training. All segments belonging to regularly terminated discharges were labelled as 'stable'. For each disrupted discharge, the labelling of 'unstable' was performed by automatically identifying the pre-disruption phase using an algorithm developed during the PhD. The CNN performance has been evaluated using disrupted and regularly terminated discharges from a decade of JET experimental campaigns, from 2011 to 2020, showing the robustness of the algorithm. Concerning W7-X, the research involved the real-time measurement of heat fluxes on plasma-facing components. THEODOR is a code currently used at W7-X for computing heat fluxes offline. However, for heat load control, fast heat flux estimation in real-time is required. Part of the PhD work was dedicated to refactoring and optimizing the THEODOR code, with the aim of speeding up calculation times and making it compatible with real-time use. In addition, a Physics Informed Neural Network (PINN) model was proposed to bring thermal flow computation to GPUs for real-time implementation

    Gaussian Based Visualization of Gaussian and Non-Gaussian Based Clustering

    Get PDF
    International audienceA generic method is introduced to visualize in a "Gaussian-like way", and onto R 2 , results of Gaussian or non-Gaussian based clustering. The key point is to explicitly force a visualization based on a spherical Gaussian mixture to inherit from the within cluster overlap that is present in the initial clustering mixture. The result is a particularly user-friendly drawing of the clusters, providing any practitioner with an overview of the potentially complex clustering result. An entropic measure provides information about the quality of the drawn overlap compared to the true one in the initial space. The proposed method is illustrated on four real data sets of different types (categorical, mixed, functional and network) and is implemented on the R package ClusVis

    Influential Factors within MNCs: From an Extended Agency Perspective

    Get PDF
    Purpose: The aim of this thesis is to give further understanding of the relationship between subsidiaries and HQ by applying it to the broader agency theory perspective. By providing empirical material from a case study, the purpose is to further enrich and complement the agency theory applied to the context of subsidiary and HQ relationship. Methodology: This study is a qualitative case study of the relationship between subsidiary and headquarters with elements of both an inductive and deductive approach. Semi-structured interviews with representatives from DNB’s three ventures in Warsaw, Oslo and Stockholm were conducted. A theoretical framework was developed and revised by the empirical findings. Theoretical perspectives: In studying the relationship between subsidiary-HQ from an agency perspective this thesis follows the narrow and broad perspective of agency theory, applied to the relationship of subsidiary-HQ within an MNC. Most importantly it takes the literature a step further as it introduces an extended agency perspective on the relationship, consisting of three extensive variables that have never before been added to this kontext. Empirical foundation: The empirical data consists of 25 semi-structured interviews with employees from DNB HQ in Oslo, branch DNB Stockholm and subsidiary DNB Poland. Conclusions: The result from this case study is a revised theoretical framework that indicates that an extended perspective of the agency theory is applicable when studying the relationship between subsidiary-HQ. Three additional factors were found to affect the relationship – trust, attention and path dependency. Traditional agency theory measure goal congruence, additionally this thesis argues that goal achievement and goal commitment should be added when studying the relationship between subsidiary-HQ

    Food price volatility: the role of stocks and trade

    Get PDF
    After a period of relatively low international food price volatility since the 1970s, prices spiked in 2007/2008 and 2011. These international price changes transmitted to domestic markets where they generate extra volatility. This volatility adversely impacts on welfare of consumers and producers, while price spikes are a major threat to national food security. This study examines drivers of grain price instability in developing countries and discusses the role of stocks and trade to stabilize prices and consumption levels. Multiple determinants of food price volatility are identified in this work using a panel of more than 70 developing countries. The econometric approach chosen accounts for volatility clusters and potential endogeneity of explanatory variables. The estimation shows a large spill-over of international price volatility into domestic food markets, in particular for importing countries, with a short-run elasticity between 0.26 and 0.44. In relative terms, stocks and regional trade integration contribute most to price stabilization. In numbers, an increase in the stock-to-use ratio or the share of regional trade by one percentage point diminishes variability by 2.5 percent and 0.8 percent, respectively. Export restrictions, so called insulation policies, significantly reduce volatility for non-importers by about four percent when export quantities are 10 percentage points lower. In contrast, markets in countries that run extensive public price stabilization programs are not found to be associated with lower price instability. In Ghana, food prices of locally produced staples exhibit strong seasonality, up to an intra-annual price spread of 60 percent, owed to limited storage. Primary data collected from wholesale traders reveals seasonal fluctuations in stock levels and suggests that traders hold a significant share of total stocks, especially towards the end of the marketing year. In addition to that, traders are found to have distinct storage strategies. Some traders only store to resell in bulk or carry working stocks to supply costumers, while a group of traders speculates for seasonal price increases. Finally, based on a theoretical model to define stocking norms, costs and benefits from storage cooperation are assessed. The empirical application to West Africa reveals great potentials of cooperation emerging from the imperfect correlation of production quantities among these countries. Accordingly, regional stocks under cooperation in an emergency reserve can be up to 60 percent less than without cooperation. Limited intraregional trade reduces the need for stock releases significantly. Full trade integration would diminish regional consumption variability to 3.4 percent without storage, but is not effective in dampening severe supply shortfalls. Cooperation in a stabilization reserve has only limited impact on consumption stability, and thus storage cooperation should be restricted to an emergency reserve.Nahrungsmittelpreisvolatilität: die Rolle von Lagerhaltung und Handel Im Anschluss an eine Phase relativ geringer Volatilität internationaler Nahrungsmittelpreise seit den 1970er Jahren, kam es 2007/2008 und 2011 zu Preisspitzen. Diese internationalen Preisschwankungen übertrugen sich auf nationale Märkte auf denen sie zusätzliche Instabilität verursachen. Preisinstabilität beeinträchtigt die Wohlfahrt von Konsumenten und Produzenten und Preisspitzen stellen eine große Gefahr für die nationale Ernährungssicherheit dar. Diese Studie untersucht die Ursachen von Preisinstabilität in Getreidemärkten in Entwicklungsländern und diskutiert die Rolle von Lagerhaltung und Handel um Preise und das Konsumniveau zu stabilisieren. Die vielfältigen Gründe von Preisinstabilität werden mit Hilfe eines Panels, das mehr als 70 Entwicklungsländer umfasst, identifiziert und voneinander abgegrenzt. Der gewählte ökonometrische Ansatz berücksichtigt Volatilitätshäufungen und eine mögliche Endogenität der erklärenden Variablen. Die Schätzung zeigt einen starken Übersprungseffekt ternationaler Preisvariabilität auf nationale Märkte in Entwicklungsländern, insbesondere für Nahrungsmittelimportländer, mit einer kurzfristigen Elastizität zwischen 0,26 und 0,44. Relativ gesehen tragen Lagerhaltung und Integration in regionalen Handel am stärksten zur Preisstabilisierung bei. In Zahlen bedeutet das: Ein Anstieg im Verhältnis Lagerbestände zu Verbrauch oder des Anteils an regionalem Handel von einem Prozentpunkt reduziert die Preisvolatilität kurzfristig um ca. 2,5 bzw. 0,8 Prozent. Exportrestriktionen von Nicht-Importländern, sogenannte Isolationspolitiken, reduzieren Preisvolatilität signifikant. Dagegen kann nicht festgestellt werden, dass Märkte in Ländern mit weitgehenden öffentlichen Preisstabilisierungsprogrammen weniger Instabilität aufweisen. Preise im Inland produzierter Grundnahrungsmittel in Ghana sind von starken saisonalen Schwankungen, um bis zu 60 Prozent geprägt, die unzureichender Lagerhaltung geschuldet sind. Die Erhebung von Primärdaten unter Getreidegroßhändlern offenbart saisonale Muster und legt nahe, dass Händler einen signifikanten Anteil an der Gesamtlagermenge halten, besonders zum Ausgang des Agrarjahres. Zudem verfügen Händler über unterschiedliche Lagerhaltungsstrategien. Einige Händler lagern ausschließlich um in größeren Mengen iterverkaufen zu können oder um Lieferverpflichtungen nachzukommen, während eine Gruppe von Händlern auf einen saisonalen Anstieg der Preise spekuliert. Zuletzt werden Kosten und Nutzen einer regionalen Lagerhaltungskooperation an Hand eines theoretischen Modells, das optimale Lagerhaltungsmengen festlegt, abgeschätzt. Die empirische Anwendung auf Westafrika zeigt ein großes Potential von Kooperation zu profitieren, das sich aus der unvollständigen Korrelation der Erntemengen der einzelnen Länder ergibt. Demzufolge könnten regionale Lagermengen im Kooperationsfall einer Notfallreserve um bis zu 60 Prozent geringer ausfallen. Geringer intra-regionaler Handel würde die Notwendigkeit der Ausgabe von Lagerbeständen signifikant reduzieren. Vollständige Marktintegration würde die Variation des regionalen Konsums ohne weitere Lagerhaltung auf 3,4 Prozent reduzieren, ist allerdings weniger effektiv um massive Angebotsengpässe auszugleichen. Kooperation bei einer Stabilisierungsreserve zusätzlich zu regionaler ndelsintegration hat nur wenig Einfluss auf die Stabilität des Konsums, deshalb sollte die Lagerhaltungskooperation auf eine Notfallreserve beschränkt werden
    corecore