8 research outputs found
MML Probabilistic Principal Component Analysis
Principal component analysis (PCA) is perhaps the most widely method for data
dimensionality reduction. A key question in PCA decomposition of data is
deciding how many factors to retain. This manuscript describes a new approach
to automatically selecting the number of principal components based on the
Bayesian minimum message length method of inductive inference. We also derive a
new estimate of the isotropic residual variance and demonstrate, via numerical
experiments, that it improves on the usual maximum likelihood approach
Managing uncertainty in integrated environmental modelling:the UncertWeb framework
Web-based distributed modelling architectures are gaining increasing recognition as potentially useful tools to build holistic environmental models, combining individual components in complex workflows. However, existing web-based modelling frameworks currently offer no support for managing uncertainty. On the other hand, the rich array of modelling frameworks and simulation tools which support uncertainty propagation in complex and chained models typically lack the benefits of web based solutions such as ready publication, discoverability and easy access. In this article we describe the developments within the UncertWeb project which are designed to provide uncertainty support in the context of the proposed ‘Model Web’. We give an overview of uncertainty in modelling, review uncertainty management in existing modelling frameworks and consider the semantic and interoperability issues raised by integrated modelling. We describe the scope and architecture required to support uncertainty management as developed in UncertWeb. This includes tools which support elicitation, aggregation/disaggregation, visualisation and uncertainty/sensitivity analysis. We conclude by highlighting areas that require further research and development in UncertWeb, such as model calibration and inference within complex environmental models
Ergonomics of the Operative Field in Paediatric Minimal Access Surgery
Imperial Users onl
Outside The Machine Learning Blackbox: Supporting Analysts Before And After The Learning Algorithm
Applying machine learning to real problems is non-trivial because many important steps are needed to prepare for learning and to interpret the results after learning. This dissertation investigates four problems that arise before and after applying learning algorithms. First, how can we verify a dataset contains "good" information? I propose cross-data validation for quantifying the quality of a dataset relative to a benchmark dataset and define a data efficiency ratio that measures how efficiently the dataset in question collects information (relative to the benchmark). Using these methods I demonstrate the quality of bird observations collected by the eBird citizen science project which has few quality controls. Second, can off-the-shelf algorithms learn a model with good task-specific performance, or must the user have expertise both in the domain and in machine learning? In many applications, standard performance metrics are inappropriate, and most analysts lack the expertise or time to customize algorithms to optimize task-specific metrics. Ensemble selection offers a potential solution: build an ensemble to optimize the desired metric. I evaluate ensemble selection's ability to optimize for domain-specific metrics on natural language processing tasks and show that ensemble selection usually improves performance but sometimes overfits. Third, how can we understand complex models? Understanding a model often is as important its accuracy. I propose and evaluate statistics for measuring the importance of inputs used by a decision tree ensemble. The statistics agree with sensitivity analysis and, in an application to bird distribution models, are 500 times faster to compute. The statistics have been used to study hundreds of bird distribution models. Fourth, how should data be pre-processed when learning a high-performing ensemble? I examine the behavior of variable selection and bagging using a bias-variance analysis of error. The results show that the most accurate variable subset corresponds to the best bias-variance trade-off point. Often, this is not the point separating relevant from irrelevant inputs. Variable selection should be viewed as a variance reduction method and thus is often redundant for low variance methods like bagging. The best bagged model performance usually is obtained using all available inputs
UNDERSTANDING OF THE VARIABILITY OF PHYTOPLANKTON ECOSYSTEM FUNCTION PROPERTIES: A SYNERGISTIC USE OF REMOTE SENSING AND IN SITU DATA
The majority of the earth's surface (-71%) is covered by the aquatic
environment where 97% of that is the oceanic regime. Almost every part of the
aquatic regime is dominated by microscopic plants called phytoplankton. Being at the
bottom of the food chain, these ecological drivers influence the earth's climate system
as well as the biodiversity trends of other organisms such as zooplankton, fish, sea
birds and marine mammals.
The aim of this research was to understand the ecology of phytoplankton and
assess which environmental, physical, biological, and spatiotemporal factors influence
their distribution and abundance. Using this information a knowledge-based expert
system discriminated phytoplankton functional types. The ecological knowledge was
derived from the Continuous Plankton Recorder (CPR) survey, whereas information
regarding the physical regime was acquired from satellite remote sensing. The data
matrix was analysed using Generalised Additive Models (GAMs) and Artificial
Neural Networks (ANNs).
The significant relationships developed by the synergistic use of CPR measure
of phytoplankton biomass and satellite chlorophyll a (Chl-a), allowed the production
of a >50 years Chl-a dataset in the Northeast Atlantic and North Sea. It was found that
the documented mid-80s regime shift corresponded to a 60% increase in Chl-a since
1948; a result of an 80% increase in Chl-a during winter alongside a smaller summer
increase.
GAMs indicated that the combined effects of high solar radiation, shallow
mixed layer depth and increased temperatures explained more than 89% of the
coccolithophore variation. The June 1998 bloom, which was associated with high
light intensity, unusually high sea-surface temperature (SST) and a very shallow
mixed layer, was found to be one of the most extensive ( -1 million km² )
blooms ever
recorded. There was a pronounced SST shift in the mid-1990s with a peak in 1998,
suggesting that exceptionally large blooms are caused by pronounced environmental
conditions and the variability of the physical environment strongly affects the spatial
extent of these blooms.
Diatom abundance in the epipelagic zone of the Northern North Atlantic was
mainly driven by SST. The ANNs indicated that higher SSTs could lead to a rapid
decrease in diatom abundance; increased SST can stratify the water column for longer
preventing nutrients from being available. Therefore, further increases may be
devastating to diatoms but may benefit smaller plankton such as coccolithophores
and/or dinoflagellates.
Finally, the knowledge gained though the developed methodological
approaches was used to identify/discriminate phytoplankton functional groups
(diatoms, dinoflagellates, coccolithophores and silicoflagellates) with an accuracy of
greater than 70%. The most important information for phytoplankton functional group
discrimination was spatiotemporal information, and for the physical environment was
SST. Future research aimed at the identification of functional groups from remotely
sensed data should include fundamental information on the physical environment as
well as spatiotemporal information and not just based on bio-optical measurements.
Further development, potential applications and future research are discussed.Sir Alister Hardy Foundation for Ocean Scienc