302 research outputs found

    Projection-Based Clustering through Self-Organization and Swarm Intelligence

    Get PDF
    It covers aspects of unsupervised machine learning used for knowledge discovery in data science and introduces a data-driven approach to cluster analysis, the Databionic swarm (DBS). DBS consists of the 3D landscape visualization and clustering of data. The 3D landscape enables 3D printing of high-dimensional data structures. The clustering and number of clusters or an absence of cluster structure are verified by the 3D landscape at a glance. DBS is the first swarm-based technique that shows emergent properties while exploiting concepts of swarm intelligence, self-organization and the Nash equilibrium concept from game theory. It results in the elimination of a global objective function and the setting of parameters. By downloading the R package DBS can be applied to data drawn from diverse research fields and used even by non-professionals in the field of data mining

    Latent Space Classification of Seismic Facies

    Get PDF
    Supervised and unsupervised seismic facies classification methods are slowly gaining popularity in hydrocarbon exploration and production workflows. Unsupervised clustering is data driven, unbiased by the interpreter beyond the choice of input data and brings out the natural clusters present in the data. There are several competing unsupervised clustering techniques, each with advantages and disadvantages. In this dissertation, I demonstrate the use of various classification techniques on real 3D seismic data from various depositional environments. Initially, I use the popular unsupervised Kohonen self-organizing maps (SOMs) algorithms and apply it to a deep-water Gulf of Mexico 3D dataset to identify various deep-water depositional facies including basin floor fans, mass transport complexes and feeder channels. I then extend this algorithm to characterize a heterogeneous Mississippian Chert reservoir from Oklahoma and map the locations of the tight/non-porous chert and limestone vs. more prospective porous tripolitic chert and fractured chert zones. The tight chert and dense limestone can be highly fractured, giving rise to an additional seismic facies. In both the case studies, a large number of potential classes are fed into the SOM algorithm. These "prototype vectors" are clustered and colors are assigned to them using a 2D gradational RGB color-scale for visual aid in interpretation.Kohonen SOM suffers from the absence of any proper convergence criterion and rules for parameter selection. These shortcomings are addressed by the more recent development of generative topographic mapping (GTM) algorithm. GTM is based on a probabilistic unsupervised classification technique and "generates" a PDF to map the data about a lower dimensional "topographic" surface residing in high dimensional attribute space. GTM predicts not only which cluster best represents the data, but how well it is predicted by all other clusters. For this reason, GTM interfaces neatly with modern risk analysis wokflows. I apply the GTM technique to classify 15 sets of horizontal well parameters in one of the recent unconventional shale plays, correlating the results with normalized estimated ultimate recovery (EURs), allowing an estimation of EUR based on the most relevant parameters.I extend the GTM workflow to consider multi-attribute inversion volumes and do seismic facies classification for a Barnett shale survey. With the aid of microseismic data, the clusters from GTM analysis are interpreted as brittle or ductile. I also apply the GTM technique to the P-impedance, lambda-rho, mu-rho and the VP/VS volumes from a Veracruz Basin survey in Southern Mexico that was acquired over a heterogeneous conglomerate reservoir.Finally, I introduce limited supervision into both the SOM and GTM algorithms. The target vectors for both SOM and GTM are the average attribute vector about the different facies identified from the well logs. This supervision introduced user-defined clusters. In the preliminary supervision, I use multiattribute minimum Euclidean distance measures, comparing the results with the unsupervised SOM results. For GTM, I calculate the probability of occurrence of the well facies in the survey.Given the appropriate 3D seismic attribute volumes, SOM and GTM workflows will not only accelerate seismic facies identification, but also with GTM, quantify the identification of different petrotypes or heterogeneities present in the reservoir zone. The final product of my dissertation is a suite of algorithms, workflows, user interfaces and user documentation allowing others to build upon and extend this research

    Projection-Based Clustering through Self-Organization and Swarm Intelligence: Combining Cluster Analysis with the Visualization of High-Dimensional Data

    Get PDF
    Cluster Analysis; Dimensionality Reduction; Swarm Intelligence; Visualization; Unsupervised Machine Learning; Data Science; Knowledge Discovery; 3D Printing; Self-Organization; Emergence; Game Theory; Advanced Analytics; High-Dimensional Data; Multivariate Data; Analysis of Structured Dat

    Manifold Learning Approaches to Compressing Latent Spaces of Unsupervised Feature Hierarchies

    Get PDF
    Field robots encounter dynamic unstructured environments containing a vast array of unique objects. In order to make sense of the world in which they are placed, they collect large quantities of unlabelled data with a variety of sensors. Producing robust and reliable applications depends entirely on the ability of the robot to understand the unlabelled data it obtains. Deep Learning techniques have had a high level of success in learning powerful unsupervised representations for a variety of discriminative and generative models. Applying these techniques to problems encountered in field robotics remains a challenging endeavour. Modern Deep Learning methods are typically trained with a substantial labelled dataset, while datasets produced in a field robotics context contain limited labelled training data. The primary motivation for this thesis stems from the problem of applying large scale Deep Learning models to field robotics datasets that are label poor. While the lack of labelled ground truth data drives the desire for unsupervised methods, the need for improving the model scaling is driven by two factors, performance and computational requirements. When utilising unsupervised layer outputs as representations for classification, the classification performance increases with layer size. Scaling up models with multiple large layers of features is problematic, as the sizes of subsequent hidden layers scales with the size of the previous layer. This quadratic scaling, and the associated time required to train such networks has prevented adoption of large Deep Learning models beyond cluster computing. The contributions in this thesis are developed from the observation that parameters or filter el- ements learnt in Deep Learning systems are typically highly structured, and contain related ele- ments. Firstly, the structure of unsupervised filters is utilised to construct a mapping from the high dimensional filter space to a low dimensional manifold. This creates a significantly smaller repre- sentation for subsequent feature learning. This mapping, and its effect on the resulting encodings, highlights the need for the ability to learn highly overcomplete sets of convolutional features. Driven by this need, the unsupervised pretraining of Deep Convolutional Networks is developed to include a number of modern training and regularisation methods. These pretrained models are then used to provide initialisations for supervised convolutional models trained on low quantities of labelled data. By utilising pretraining, a significant increase in classification performance on a number of publicly available datasets is achieved. In order to apply these techniques to outdoor 3D Laser Illuminated Detection And Ranging data, we develop a set of resampling techniques to provide uniform input to Deep Learning models. The features learnt in these systems outperform the high effort hand engineered features developed specifically for 3D data. The representation of a given signal is then reinterpreted as a combination of modes that exist on the learnt low dimensional filter manifold. From this, we develop an encoding technique that allows the high dimensional layer output to be represented as a combination of low dimensional components. This allows the growth of subsequent layers to only be dependent on the intrinsic dimensionality of the filter manifold and not the number of elements contained in the previous layer. Finally, the resulting unsupervised convolutional model, the encoding frameworks and the em- bedding methodology are used to produce a new unsupervised learning stratergy that is able to encode images in terms of overcomplete filter spaces, without producing an explosion in the size of the intermediate parameter spaces. This model produces classification results on par with state of the art models, yet requires significantly less computational resources and is suitable for use in the constrained computation environment of a field robot

    Manifold learning techniques and statistical approaches applied to the disruption prediction in tokamaks

    Get PDF
    The nuclear fusion arises as the unique clean energy source capable to meet the energy needs of the entire world in the future. On present days, several experimental fusion devices are operating to optimize the fusion process, confining the plasma by means of magnetic fields. The goal of plasma confined in a magnetic field can be achieved by linear cylindrical configurations or toroidal configurations, e.g., stellarator, reverse field pinch, or tokamak. Among the explored magnetic confinement techniques, the tokamak configuration is to date considered the most reliable. Unfortunately, the tokamak is vulnerable to instabilities that, in the most severe cases, can lead to lose the magnetic confinement; this phenomenon is called disruption. Disruptions are dangerous and irreversible events for the device during which the plasma energy is suddenly released on the first wall components and vacuum vessel causing runaway electrons, large mechanical forces and intense thermal loads, which may cause severe damage to the vessel wall and the plasma face components. Present devices are designed to resist the disruptive events; for this reason, today, the disruptions are generally tolerable. Furthermore, one of their aims is the investigation of disruptive boundaries in the operational space. However, on future devices, such as ITER, which must operate at high density and at high plasma current, only a limited number of disruptions will be tolerable. For these reasons, disruptions in tokamaks must be avoided, but, when a disruption is unavoidable, minimizing its severity is mandatory. Therefore, finding appropriate mitigating actions to reduce the damage of the reactor components is accepted as fundamental objective in the fusion community. The physical phenomena that lead plasma to disrupt are non-linear and very complex. The present understanding of disruption physics has not gone so far as to provide an analytical model describing the onset of these instabilities and the main effort has been devoted to develop data-based methods. In the present thesis the development of a reliable disruption prediction system has been investigated using several data-based approaches, starting from the strengths and the drawbacks of the methods proposed in the literature. In fact, literature reports numerous studies for disruption prediction using data-based models, such as neural networks. Even if the results are encouraging, they are not sufficient to explain the intrinsic structure of the data used to describe the complex behavior of the plasma. Recent studies demonstrated the urgency of developing sophisticated control schemes that allow exploring the operating limits of tokamak in order to increase the reactor performance. For this reason, one of the goal of the present thesis is to identify and to develop tools for visualization and analysis of multidimensional data from numerous plasma diagnostics available in the database of the machine. The identification of the boundaries of the disruption free plasma parameter space would lead to an increase in the knowledge of disruptions. A viable approach to understand disruptive events consists of identifying the intrinsic structure of the data used to describe the plasma operational space. Manifold learning algorithms attempt to identify these structures in order to find a low-dimensional representation of the data. Data for this thesis comes from ASDEX Upgrade (AUG). ASDEX Upgrade is a medium size tokamak experiment located at IPP Max-Planck-Institut für Plasmaphysik, Garching bei München (Germany). At present it is the largest tokamak in Germany. Among the available methods the attention has been mainly devoted to data clustering techniques. Data clustering consists on grouping a set of data in such a way that data in the same group (cluster) are more similar to each other than those in other groups. Due to the inherent predisposition for visualization, the most popular and widely used clustering technique, the Self-Organizing Map (SOM), has been firstly investigated. The SOM allows to extract information from the multidimensional operational space of AUG using 7 plasma parameters coming from successfully terminated (safe) and disruption terminated (disrupted) pulses. Data to train and test the SOM have been extracted from AUG experiments performed between July 2002 and November 2009. The SOM allowed to display the AUG operational space and to identify regions with high risk of disruption (disruptive regions) and those with low risk of disruption (safe regions). In addition to space visualization purposes, the SOM can be used also to monitor the time evolution of the discharges during an experiment. Thus, the SOM has been used as disruption predictor by introducing a suitable criterion, based on the trend of the trajectories on the map throughout the different regions. When a plasma configuration with a high risk of disruption is recognized, a disruption alarm is triggered allowing to perform disruption avoidance or mitigation actions. The data-based models, such as the SOM, are affected by the so-called "ageing effect". The ageing effect consists in the degradation of the predictor performance during the time. It is due to the fact that, during the operation of the predictor, new data may come from experiments different from those used for the training. In order to reduce such effect, a retraining of the predictor has been proposed. The retraining procedure consists of a new training procedure performed adding to the training set the new plasma configurations coming from more recent experimental campaigns. This aims to supply the novel information to the model to increase the prediction performances of the predictor. Another drawback of the SOM, common to all the proposed data-based models in literature, is the need of a dedicated set of experiments terminated with a disruption to implement the predictive model. Indeed, future fusion devices, like ITER, will tolerate only a limited number of disruptive events and hence the disruption database won't be available. In order to overcome this shortcoming, a disruption prediction system for AUG built using only input signals from safe pulses has been implemented. The predictor model is based on a Fault Detection and Isolation (FDI) approach. FDI is an important and active research field which allows to monitor a system and to determine when a fault happens. The majority of model-based FDI procedures are based on a statistical analysis of residuals. Given an empirical model identified on a reference dataset, obtained under Normal Operating Conditions (NOC), the discrepancies between the new observations and those estimated by the NOCs (residuals) are calculated. The residuals are considered as a random process with known statistical properties. If a fault happens, a change of these properties is detected. In this thesis, the safe pulses are assumed as the normal operation conditions of the process and the disruptions are assumed as status of fault. Thus, only safe pulses are used to train the NOC model. In order to have a graphical representation of the trajectory of the pulses, only three plasma parameters have been used to build the NOC model. Monitoring the time evolution of the residuals by introducing an alarm criterion based on a suitable threshold on the residual values, the NOC model properly identifies an incoming disruption. Data for the training and the tests of the NOC model have been extracted from AUG experiments executed between July 2002 and November 2009. The assessment of a specific disruptive phase for each disruptive discharge represents a relevant issue in understanding the disruptive events. Up to now at AUG disruption precursors have been assumed appearing into a prefixed time window, the last 45ms for all disrupted discharges. The choice of such a fixed temporal window could limit the prediction performance. In fact, it generates ambiguous information in cases of disruptions with disruptive phase different from 45ms. In this thesis, the Mahalanobis distance is applied to define a specific disruptive phase for each disruption. In particular, a different length of the disruptive phase has been selected for each disrupted pulse in the training set by labeling each sample as safe or disruptive depending on its own Mahalanobis distance from the set of the safe discharges. Then, with this new training set, the operational space of AUG has been mapped using the Generative Topography Mapping (GTM). The GTM is inspired by the SOM algorithm, with the aim to overcome its limitations. The GTM has been investigated in order to identify regions with high risk of disruption and those with low risk of disruption. For comparison purposes a second SOM has been built. Hence, GTM and SOM have been tested as disruption predictors. Data for the training and the tests of the SOM and the GTM have been extracted from AUG experiments executed from May 2007 to November 2012. The last method studied and applied in this thesis has been the Logistic regression model (Logit). The logistic regression is a well-known statistic method to analyze problems with dichotomous dependent variables. In this study the Logit models the probability that a generic sample belongs to the non-disruptive or the disruptive phase. The time evolution of the Logit Model output (LMO) has been used as disruption proximity index by introducing a suitable threshold. Data for the training and the tests of the Logit models have been extracted from AUG experiments executed from May 2007 to November 2012. Disruptive samples have been selected through the Mahalanobis distance criterion. Finally, in order to interpret the behavior of data-based predictors, a manual classification of disruptions has been performed for experiments occurred from May 2007 to November 2012. The manual classification has been performed by means of a visual analysis of several plasma parameters for each disruption. Moreover, the specific chains of events have been detected and used to classify disruptions and when possible, the same classes introduced for JET are adopte

    Manifold learning techniques and statistical approaches applied to the disruption prediction in tokamaks

    Get PDF
    The nuclear fusion arises as the unique clean energy source capable to meet the energy needs of the entire world in the future. On present days, several experimental fusion devices are operating to optimize the fusion process, confining the plasma by means of magnetic fields. The goal of plasma confined in a magnetic field can be achieved by linear cylindrical configurations or toroidal configurations, e.g., stellarator, reverse field pinch, or tokamak. Among the explored magnetic confinement techniques, the tokamak configuration is to date considered the most reliable. Unfortunately, the tokamak is vulnerable to instabilities that, in the most severe cases, can lead to lose the magnetic confinement; this phenomenon is called disruption. Disruptions are dangerous and irreversible events for the device during which the plasma energy is suddenly released on the first wall components and vacuum vessel causing runaway electrons, large mechanical forces and intense thermal loads, which may cause severe damage to the vessel wall and the plasma face components. Present devices are designed to resist the disruptive events; for this reason, today, the disruptions are generally tolerable. Furthermore, one of their aims is the investigation of disruptive boundaries in the operational space. However, on future devices, such as ITER, which must operate at high density and at high plasma current, only a limited number of disruptions will be tolerable. For these reasons, disruptions in tokamaks must be avoided, but, when a disruption is unavoidable, minimizing its severity is mandatory. Therefore, finding appropriate mitigating actions to reduce the damage of the reactor components is accepted as fundamental objective in the fusion community. The physical phenomena that lead plasma to disrupt are non-linear and very complex. The present understanding of disruption physics has not gone so far as to provide an analytical model describing the onset of these instabilities and the main effort has been devoted to develop data-based methods. In the present thesis the development of a reliable disruption prediction system has been investigated using several data-based approaches, starting from the strengths and the drawbacks of the methods proposed in the literature. In fact, literature reports numerous studies for disruption prediction using data-based models, such as neural networks. Even if the results are encouraging, they are not sufficient to explain the intrinsic structure of the data used to describe the complex behavior of the plasma. Recent studies demonstrated the urgency of developing sophisticated control schemes that allow exploring the operating limits of tokamak in order to increase the reactor performance. For this reason, one of the goal of the present thesis is to identify and to develop tools for visualization and analysis of multidimensional data from numerous plasma diagnostics available in the database of the machine. The identification of the boundaries of the disruption free plasma parameter space would lead to an increase in the knowledge of disruptions. A viable approach to understand disruptive events consists of identifying the intrinsic structure of the data used to describe the plasma operational space. Manifold learning algorithms attempt to identify these structures in order to find a low-dimensional representation of the data. Data for this thesis comes from ASDEX Upgrade (AUG). ASDEX Upgrade is a medium size tokamak experiment located at IPP Max-Planck-Institut für Plasmaphysik, Garching bei München (Germany). At present it is the largest tokamak in Germany. Among the available methods the attention has been mainly devoted to data clustering techniques. Data clustering consists on grouping a set of data in such a way that data in the same group (cluster) are more similar to each other than those in other groups. Due to the inherent predisposition for visualization, the most popular and widely used clustering technique, the Self-Organizing Map (SOM), has been firstly investigated. The SOM allows to extract information from the multidimensional operational space of AUG using 7 plasma parameters coming from successfully terminated (safe) and disruption terminated (disrupted) pulses. Data to train and test the SOM have been extracted from AUG experiments performed between July 2002 and November 2009. The SOM allowed to display the AUG operational space and to identify regions with high risk of disruption (disruptive regions) and those with low risk of disruption (safe regions). In addition to space visualization purposes, the SOM can be used also to monitor the time evolution of the discharges during an experiment. Thus, the SOM has been used as disruption predictor by introducing a suitable criterion, based on the trend of the trajectories on the map throughout the different regions. When a plasma configuration with a high risk of disruption is recognized, a disruption alarm is triggered allowing to perform disruption avoidance or mitigation actions. The data-based models, such as the SOM, are affected by the so-called "ageing effect". The ageing effect consists in the degradation of the predictor performance during the time. It is due to the fact that, during the operation of the predictor, new data may come from experiments different from those used for the training. In order to reduce such effect, a retraining of the predictor has been proposed. The retraining procedure consists of a new training procedure performed adding to the training set the new plasma configurations coming from more recent experimental campaigns. This aims to supply the novel information to the model to increase the prediction performances of the predictor. Another drawback of the SOM, common to all the proposed data-based models in literature, is the need of a dedicated set of experiments terminated with a disruption to implement the predictive model. Indeed, future fusion devices, like ITER, will tolerate only a limited number of disruptive events and hence the disruption database won't be available. In order to overcome this shortcoming, a disruption prediction system for AUG built using only input signals from safe pulses has been implemented. The predictor model is based on a Fault Detection and Isolation (FDI) approach. FDI is an important and active research field which allows to monitor a system and to determine when a fault happens. The majority of model-based FDI procedures are based on a statistical analysis of residuals. Given an empirical model identified on a reference dataset, obtained under Normal Operating Conditions (NOC), the discrepancies between the new observations and those estimated by the NOCs (residuals) are calculated. The residuals are considered as a random process with known statistical properties. If a fault happens, a change of these properties is detected. In this thesis, the safe pulses are assumed as the normal operation conditions of the process and the disruptions are assumed as status of fault. Thus, only safe pulses are used to train the NOC model. In order to have a graphical representation of the trajectory of the pulses, only three plasma parameters have been used to build the NOC model. Monitoring the time evolution of the residuals by introducing an alarm criterion based on a suitable threshold on the residual values, the NOC model properly identifies an incoming disruption. Data for the training and the tests of the NOC model have been extracted from AUG experiments executed between July 2002 and November 2009. The assessment of a specific disruptive phase for each disruptive discharge represents a relevant issue in understanding the disruptive events. Up to now at AUG disruption precursors have been assumed appearing into a prefixed time window, the last 45ms for all disrupted discharges. The choice of such a fixed temporal window could limit the prediction performance. In fact, it generates ambiguous information in cases of disruptions with disruptive phase different from 45ms. In this thesis, the Mahalanobis distance is applied to define a specific disruptive phase for each disruption. In particular, a different length of the disruptive phase has been selected for each disrupted pulse in the training set by labeling each sample as safe or disruptive depending on its own Mahalanobis distance from the set of the safe discharges. Then, with this new training set, the operational space of AUG has been mapped using the Generative Topography Mapping (GTM). The GTM is inspired by the SOM algorithm, with the aim to overcome its limitations. The GTM has been investigated in order to identify regions with high risk of disruption and those with low risk of disruption. For comparison purposes a second SOM has been built. Hence, GTM and SOM have been tested as disruption predictors. Data for the training and the tests of the SOM and the GTM have been extracted from AUG experiments executed from May 2007 to November 2012. The last method studied and applied in this thesis has been the Logistic regression model (Logit). The logistic regression is a well-known statistic method to analyze problems with dichotomous dependent variables. In this study the Logit models the probability that a generic sample belongs to the non-disruptive or the disruptive phase. The time evolution of the Logit Model output (LMO) has been used as disruption proximity index by introducing a suitable threshold. Data for the training and the tests of the Logit models have been extracted from AUG experiments executed from May 2007 to November 2012. Disruptive samples have been selected through the Mahalanobis distance criterion. Finally, in order to interpret the behavior of data-based predictors, a manual classification of disruptions has been performed for experiments occurred from May 2007 to November 2012. The manual classification has been performed by means of a visual analysis of several plasma parameters for each disruption. Moreover, the specific chains of events have been detected and used to classify disruptions and when possible, the same classes introduced for JET are adopte

    2D Dimensionality Reduction Methods without Loss

    Get PDF
    In this paper, several two-dimensional extensions of principal component analysis (PCA) and linear discriminant analysis (LDA) techniques has been applied in a lossless dimensionality reduction framework, for face recognition application. In this framework, the benefits of dimensionality reduction were used to improve the performance of its predictive model, which was a support vector machine (SVM) classifier. At the same time, the loss of the useful information was minimized using the projection penalty idea. The well-known face databases were used to train and evaluate the proposed methods. The experimental results indicated that the proposed methods had a higher average classification accuracy in general compared to the classification based on Euclidean distance, and also compared to the methods which first extracted features based on dimensionality reduction technics, and then used SVM classifier as the predictive model

    Exploratory Cluster Analysis from Ubiquitous Data Streams using Self-Organizing Maps

    Get PDF
    This thesis addresses the use of Self-Organizing Maps (SOM) for exploratory cluster analysis over ubiquitous data streams, where two complementary problems arise: first, to generate (local) SOM models over potentially unbounded multi-dimensional non-stationary data streams; second, to extrapolate these capabilities to ubiquitous environments. Towards this problematic, original contributions are made in terms of algorithms and methodologies. Two different methods are proposed regarding the first problem. By focusing on visual knowledge discovery, these methods fill an existing gap in the panorama of current methods for cluster analysis over data streams. Moreover, the original SOM capabilities in performing both clustering of observations and features are transposed to data streams, characterizing these contributions as versatile compared to existing methods, which target an individual clustering problem. Also, additional methodologies that tackle the ubiquitous aspect of data streams are proposed in respect to the second problem, allowing distributed and collaborative learning strategies. Experimental evaluations attest the effectiveness of the proposed methods and realworld applications are exemplified, namely regarding electric consumption data, air quality monitoring networks and financial data, motivating their practical use. This research study is the first to clearly address the use of the SOM towards ubiquitous data streams and opens several other research opportunities in the future

    Machine Learning in Aerodynamic Shape Optimization

    Get PDF
    Machine learning (ML) has been increasingly used to aid aerodynamic shape optimization (ASO), thanks to the availability of aerodynamic data and continued developments in deep learning. We review the applications of ML in ASO to date and provide a perspective on the state-of-the-art and future directions. We first introduce conventional ASO and current challenges. Next, we introduce ML fundamentals and detail ML algorithms that have been successful in ASO. Then, we review ML applications to ASO addressing three aspects: compact geometric design space, fast aerodynamic analysis, and efficient optimization architecture. In addition to providing a comprehensive summary of the research, we comment on the practicality and effectiveness of the developed methods. We show how cutting-edge ML approaches can benefit ASO and address challenging demands, such as interactive design optimization. Practical large-scale design optimizations remain a challenge because of the high cost of ML training. Further research on coupling ML model construction with prior experience and knowledge, such as physics-informed ML, is recommended to solve large-scale ASO problems
    • …
    corecore