8 research outputs found

    MML Probabilistic Principal Component Analysis

    Full text link
    Principal component analysis (PCA) is perhaps the most widely method for data dimensionality reduction. A key question in PCA decomposition of data is deciding how many factors to retain. This manuscript describes a new approach to automatically selecting the number of principal components based on the Bayesian minimum message length method of inductive inference. We also derive a new estimate of the isotropic residual variance and demonstrate, via numerical experiments, that it improves on the usual maximum likelihood approach

    Managing uncertainty in integrated environmental modelling:the UncertWeb framework

    Get PDF
    Web-based distributed modelling architectures are gaining increasing recognition as potentially useful tools to build holistic environmental models, combining individual components in complex workflows. However, existing web-based modelling frameworks currently offer no support for managing uncertainty. On the other hand, the rich array of modelling frameworks and simulation tools which support uncertainty propagation in complex and chained models typically lack the benefits of web based solutions such as ready publication, discoverability and easy access. In this article we describe the developments within the UncertWeb project which are designed to provide uncertainty support in the context of the proposed ‘Model Web’. We give an overview of uncertainty in modelling, review uncertainty management in existing modelling frameworks and consider the semantic and interoperability issues raised by integrated modelling. We describe the scope and architecture required to support uncertainty management as developed in UncertWeb. This includes tools which support elicitation, aggregation/disaggregation, visualisation and uncertainty/sensitivity analysis. We conclude by highlighting areas that require further research and development in UncertWeb, such as model calibration and inference within complex environmental models

    Outside The Machine Learning Blackbox: Supporting Analysts Before And After The Learning Algorithm

    Full text link
    Applying machine learning to real problems is non-trivial because many important steps are needed to prepare for learning and to interpret the results after learning. This dissertation investigates four problems that arise before and after applying learning algorithms. First, how can we verify a dataset contains "good" information? I propose cross-data validation for quantifying the quality of a dataset relative to a benchmark dataset and define a data efficiency ratio that measures how efficiently the dataset in question collects information (relative to the benchmark). Using these methods I demonstrate the quality of bird observations collected by the eBird citizen science project which has few quality controls. Second, can off-the-shelf algorithms learn a model with good task-specific performance, or must the user have expertise both in the domain and in machine learning? In many applications, standard performance metrics are inappropriate, and most analysts lack the expertise or time to customize algorithms to optimize task-specific metrics. Ensemble selection offers a potential solution: build an ensemble to optimize the desired metric. I evaluate ensemble selection's ability to optimize for domain-specific metrics on natural language processing tasks and show that ensemble selection usually improves performance but sometimes overfits. Third, how can we understand complex models? Understanding a model often is as important its accuracy. I propose and evaluate statistics for measuring the importance of inputs used by a decision tree ensemble. The statistics agree with sensitivity analysis and, in an application to bird distribution models, are 500 times faster to compute. The statistics have been used to study hundreds of bird distribution models. Fourth, how should data be pre-processed when learning a high-performing ensemble? I examine the behavior of variable selection and bagging using a bias-variance analysis of error. The results show that the most accurate variable subset corresponds to the best bias-variance trade-off point. Often, this is not the point separating relevant from irrelevant inputs. Variable selection should be viewed as a variance reduction method and thus is often redundant for low variance methods like bagging. The best bagged model performance usually is obtained using all available inputs

    UNDERSTANDING OF THE VARIABILITY OF PHYTOPLANKTON ECOSYSTEM FUNCTION PROPERTIES: A SYNERGISTIC USE OF REMOTE SENSING AND IN SITU DATA

    Get PDF
    The majority of the earth's surface (-71%) is covered by the aquatic environment where 97% of that is the oceanic regime. Almost every part of the aquatic regime is dominated by microscopic plants called phytoplankton. Being at the bottom of the food chain, these ecological drivers influence the earth's climate system as well as the biodiversity trends of other organisms such as zooplankton, fish, sea birds and marine mammals. The aim of this research was to understand the ecology of phytoplankton and assess which environmental, physical, biological, and spatiotemporal factors influence their distribution and abundance. Using this information a knowledge-based expert system discriminated phytoplankton functional types. The ecological knowledge was derived from the Continuous Plankton Recorder (CPR) survey, whereas information regarding the physical regime was acquired from satellite remote sensing. The data matrix was analysed using Generalised Additive Models (GAMs) and Artificial Neural Networks (ANNs). The significant relationships developed by the synergistic use of CPR measure of phytoplankton biomass and satellite chlorophyll a (Chl-a), allowed the production of a >50 years Chl-a dataset in the Northeast Atlantic and North Sea. It was found that the documented mid-80s regime shift corresponded to a 60% increase in Chl-a since 1948; a result of an 80% increase in Chl-a during winter alongside a smaller summer increase. GAMs indicated that the combined effects of high solar radiation, shallow mixed layer depth and increased temperatures explained more than 89% of the coccolithophore variation. The June 1998 bloom, which was associated with high light intensity, unusually high sea-surface temperature (SST) and a very shallow mixed layer, was found to be one of the most extensive ( -1 million km² ) blooms ever recorded. There was a pronounced SST shift in the mid-1990s with a peak in 1998, suggesting that exceptionally large blooms are caused by pronounced environmental conditions and the variability of the physical environment strongly affects the spatial extent of these blooms. Diatom abundance in the epipelagic zone of the Northern North Atlantic was mainly driven by SST. The ANNs indicated that higher SSTs could lead to a rapid decrease in diatom abundance; increased SST can stratify the water column for longer preventing nutrients from being available. Therefore, further increases may be devastating to diatoms but may benefit smaller plankton such as coccolithophores and/or dinoflagellates. Finally, the knowledge gained though the developed methodological approaches was used to identify/discriminate phytoplankton functional groups (diatoms, dinoflagellates, coccolithophores and silicoflagellates) with an accuracy of greater than 70%. The most important information for phytoplankton functional group discrimination was spatiotemporal information, and for the physical environment was SST. Future research aimed at the identification of functional groups from remotely sensed data should include fundamental information on the physical environment as well as spatiotemporal information and not just based on bio-optical measurements. Further development, potential applications and future research are discussed.Sir Alister Hardy Foundation for Ocean Scienc

    Isotope geochemistry and petrology of Dalradian metacarbonate rocks

    Get PDF
    corecore