1,005 research outputs found
A hierarchical latent variable model for data visualization
Visualization has proven to be a powerful and widely-applicable tool the analysis and interpretation of data. Most visualization algorithms aim to find a projection from the data space down to a two-dimensional visualization space. However, for complex data sets living in a high-dimensional space it is unlikely that a single two-dimensional projection can reveal all of the interesting structure. We therefore introduce a hierarchical visualization algorithm which allows the complete data set to be visualized at the top level, with clusters and sub-clusters of data points visualized at deeper levels. The algorithm is based on a hierarchical mixture of latent variable models, whose parameters are estimated using the expectation-maximization algorithm. We demonstrate the principle of the approach first on a toy data set, and then apply the algorithm to the visualization of a synthetic data set in 12 dimensions obtained from a simulation of multi-phase flows in oil pipelines and to data in 36 dimensions derived from satellite images
Probabilistic principal component analysis
Principal component analysis (PCA) is a ubiquitous technique for data analysis and processing, but one which is not based upon a probability model. In this paper we demonstrate how the principal axes of a set of observed data vectors may be determined through maximum-likelihood estimation of parameters in a latent variable model closely related to factor analysis. We consider the properties of the associated likelihood function, giving an EM algorithm for estimating the principal subspace iteratively, and discuss the advantages conveyed by the definition of a probability density function for PCA
Ecological indicators for abandoned mines, Phase 1: Review of the literature
Mine waters have been identified as a significant issue in the majority of Environment Agency draft River Basin Management Plans. They are one of the largest drivers for chemical pollution in the draft Impact Assessment for the Water Framework Directive (WFD), with significant failures of environmental quality standards (EQS) for metals (particularly Cd, Pb, Zn, Cu, Fe) in many rivers linked to abandoned mines. Existing EQS may be overprotective of aquatic life which may have adapted over centuries of exposure. This study forms part of a larger project to investigate the ecological impact of metals in rivers, to develop water quality targets (alternative objectives for the WFD) for aquatic ecosystems impacted by long-term mining pollution. The report reviews literature on EQS failures, metal effects on aquatic biota and effects of water chemistry, and uses this information to consider further work.
A preliminary assessment of water quality and biology data for 87 sites across Gwynedd and Ceredigion (Wales) shows that existing Environment Agency water quality and biology data could be used to establish statistical relations between chemical variables and metrics of ecological quality. Visual representation and preliminary statistical analyses show that invertebrate diversity declines with increasing zinc concentration. However, the situation is more complex because the effects of other metals are not readily apparent. Furthermore, pH and aluminium also affect streamwater invertebrates, making it difficult to tease out toxicity due to individual mine-derived metals.
The most characteristic feature of the plant communities of metal-impacted systems is a reduction in diversity, compared to that found in comparable unimpacted streams. Some species thrive in the presence of heavy metals, presumably because they are able to develop metal tolerance, whilst others consistently disappear. Effects are, however, confounded by water chemistry, particularly pH. Tolerant species are spread across a number of divisions of photosynthetic organisms, though green algae, diatoms and blue-green algae are usually most abundant, often thriving in the absence of competition and/or grazing. Current UK monitoring techniques focus on community composition and, whilst these provide a sampling and analytical framework for studies of metal impacts, the metrics are not sensitive to these impacts. There is scope for developing new metrics, based on community-level analyses and for looking at morphological variations common in some taxa at elevated metal concentrations. On the whole, community-based metrics are recommended, as these are easier to relate to ecological status definitions.
With respect to invertebrates and fish, metals affect individuals, population and communities but sensitivity varies among species, life stages, sexes, trophic groups and with body condition. Acclimation or adaptation may cause varying sensitivity even within species. Ecosystem-scale effects, for example on ecological function, are poorly understood. Effects vary between metals such as cadmium, copper, lead, chromium, zinc and nickel in order of decreasing toxicity. Aluminium is important in acidified headwaters. Biological effects depend on speciation, toxicity, availability, mixtures, complexation and exposure conditions, for example discharge (flow). Current water quality monitoring is unlikely to detect short-term episodic increases in metal concentrations or evaluate the bioavailability of elevated metal concentrations in sediments. These factors create uncertainty in detecting ecological impairment in metal-impacted ecosystems. Moreover, most widely used biological indicators for UK freshwaters were developed for other pressures and none distinguishes metal impacts from other causes of impairment. Key ecological needs for better regulation and management of metals in rivers include: i) models relating metal data to ecological data that better represent influences on metal toxicity; ii) biodiagnostic indices to reflect metal effects; iii) better methods to identify metal acclimation or adaptation among sensitive taxa; iv) better investigative procedures to isolate metal effects from other pressures.
Laboratory data on the effects of water chemistry on cationic metal toxicity and bioaccumulation show that a number of chemical parameters, particularly pH, dissolved organic carbon (DOC) and major cations (Na, Mg, K, Ca) exert a major influence on the toxicity and/or bioaccumulation of cationic metals. The biotic ligand model (BLM) provides a conceptual framework for understanding these water chemistry effects as a combination of the influence of chemical speciation, and metal uptake by organisms in competition with H+ and other cations. In some cases where the BLM cannot describe effects, empirical bioavailable models have been successfully used. Laboratory data on the effects of metal mixtures across different water chemistries are sparse, with implications for transferring understanding to mining-impacted sites in the field where mixture effects are likely.
The available field data, although relatively sparse, indicate that water chemistry influences metal effects on aquatic ecosystems. This occurs through complexation reactions, notably involving dissolved organic matter and metals such as Al, Cu and Pb. Secondly, because bioaccumulation and toxicity are partly governed by complexation reactions, competition effects among metals, and between metals and H+, give rise to dependences upon water chemistry. There is evidence that combinations of metals are active in the field; the main study conducted so far demonstrated the combined effects of Al and Zn, and suggested, less certainly, that Cu and H+ can also contribute. Chemical speciation is essential to interpret and predict observed effects in the field. Speciation results need to be combined with a model that relates free ion concentrations to toxic effect. Understanding the toxic effects of heavy metals derived from abandoned mines requires the simultaneous consideration of the acidity-related components Al and H+.
There are a number of reasons why organisms in waters affected by abandoned mines may experience different levels of metal toxicity than in the laboratory. This could lead to discrepancies between actual field behaviour and that predicted by EQS derived from laboratory experiments, as would be applied within the WFD. The main factors to consider are adaptation/acclimation, water chemistry, and the effects of combinations of metals. Secondary effects are metals in food, metals supplied by sediments, and variability in stream flows. Two of the most prominent factors, namely adaptation/ acclimation and bioavailability, could justify changes in EQS or the adoption of an alternative measure of toxic effects in the field. Given that abandoned mines are widespread in England and Wales, and the high cost of their remediation to meet proposed WFD EQS criteria, further research into the question is clearly justified.
Although ecological communities of mine-affected streamwaters might be over-protected by proposed WFD EQS, there are some conditions under which metals emanating from abandoned mines definitely exert toxic effects on biota. The main issue is therefore the reliable identification of chemical conditions that are unacceptable and comparison of those conditions with those predicted by WFD EQS. If significant differences can convincingly be demonstrated, the argument could be made for alternative standards for waters affected by abandoned mines. Therefore in our view, the immediate research priority is to improve the quantification of metal effects under field circumstances. Demonstration of dose-response relationships, based on metal mixtures and their chemical speciation, and the use of better biological tools to detect and diagnose community-level impairment, would provide the necessary scientific information
Simulation of carbon cycling, including dissolved organic carbon transport, in forest soil locally enriched with 14C
The DyDOC model was used to simulate the soil carbon cycle of a deciduous forest at the Oak Ridge Reservation (Tennessee, USA). The model application relied on extensive data from the Enriched Background Isotope Study (EBIS), which exploited a short-term local atmospheric enrichment of radiocarbon to establish a large-scale manipulation experiment with different inputs of 14C from both above-ground and below-ground litter. The model was first fitted to hydrological data, then observed pools and fluxes of carbon and 14C data were used to fit parameters describing metabolic transformations of soil organic matter (SOM) components and the transport and sorption of dissolved organic matter (DOM). This produced a detailed quantitative description of soil C cycling in the three horizons (O, A, B) of the soil profile. According to the parameterised model, SOM turnover within the thin O-horizon rapidly produces DOM (46 gC m-2 a-1), which is predominantly hydrophobic. This DOM is nearly all adsorbed in the A- and B-horizons, and while most is mineralised relatively quickly, 11 gC m-2 a-1 undergoes a “maturing” reaction, producing mineral-associated stable SOM pools with mean residence times of 100-200 years. Only a small flux (~ 1 gC m-2 a-1) of hydrophilic DOM leaves the B-horizon. The SOM not associated with mineral matter is assumed to be derived from root litter, and turns over quite quickly (mean residence time 20-30 years). Although DyDOC was successfully fitted to C pools, annual fluxes and 14C data, it accounted less well for short-term variations in DOC concentrations
Active Sampling-based Binary Verification of Dynamical Systems
Nonlinear, adaptive, or otherwise complex control techniques are increasingly
relied upon to ensure the safety of systems operating in uncertain
environments. However, the nonlinearity of the resulting closed-loop system
complicates verification that the system does in fact satisfy those
requirements at all possible operating conditions. While analytical proof-based
techniques and finite abstractions can be used to provably verify the
closed-loop system's response at different operating conditions, they often
produce conservative approximations due to restrictive assumptions and are
difficult to construct in many applications. In contrast, popular statistical
verification techniques relax the restrictions and instead rely upon
simulations to construct statistical or probabilistic guarantees. This work
presents a data-driven statistical verification procedure that instead
constructs statistical learning models from simulated training data to separate
the set of possible perturbations into "safe" and "unsafe" subsets. Binary
evaluations of closed-loop system requirement satisfaction at various
realizations of the uncertainties are obtained through temporal logic
robustness metrics, which are then used to construct predictive models of
requirement satisfaction over the full set of possible uncertainties. As the
accuracy of these predictive statistical models is inherently coupled to the
quality of the training data, an active learning algorithm selects additional
sample points in order to maximize the expected change in the data-driven model
and thus, indirectly, minimize the prediction error. Various case studies
demonstrate the closed-loop verification procedure and highlight improvements
in prediction error over both existing analytical and statistical verification
techniques.Comment: 23 page
Statistical Mechanical Development of a Sparse Bayesian Classifier
The demand for extracting rules from high dimensional real world data is
increasing in various fields. However, the possible redundancy of such data
sometimes makes it difficult to obtain a good generalization ability for novel
samples. To resolve this problem, we provide a scheme that reduces the
effective dimensions of data by pruning redundant components for bicategorical
classification based on the Bayesian framework. First, the potential of the
proposed method is confirmed in ideal situations using the replica method.
Unfortunately, performing the scheme exactly is computationally difficult. So,
we next develop a tractable approximation algorithm, which turns out to offer
nearly optimal performance in ideal cases when the system size is large.
Finally, the efficacy of the developed classifier is experimentally examined
for a real world problem of colon cancer classification, which shows that the
developed method can be practically useful.Comment: 13 pages, 6 figure
BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees
The rising volume of datasets has made training machine learning (ML) models
a major computational cost in the enterprise. Given the iterative nature of
model and parameter tuning, many analysts use a small sample of their entire
data during their initial stage of analysis to make quick decisions (e.g., what
features or hyperparameters to use) and use the entire dataset only in later
stages (i.e., when they have converged to a specific model). This sampling,
however, is performed in an ad-hoc fashion. Most practitioners cannot precisely
capture the effect of sampling on the quality of their model, and eventually on
their decision-making process during the tuning phase. Moreover, without
systematic support for sampling operators, many optimizations and reuse
opportunities are lost.
In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML
training. BlinkML allows users to make error-computation tradeoffs: instead of
training a model on their full data (i.e., full model), BlinkML can quickly
train an approximate model with quality guarantees using a sample. The quality
guarantees ensure that, with high probability, the approximate model makes the
same predictions as the full model. BlinkML currently supports any ML model
that relies on maximum likelihood estimation (MLE), which includes Generalized
Linear Models (e.g., linear regression, logistic regression, max entropy
classifier, Poisson regression) as well as PPCA (Probabilistic Principal
Component Analysis). Our experiments show that BlinkML can speed up the
training of large-scale ML tasks by 6.26x-629x while guaranteeing the same
predictions, with 95% probability, as the full model.Comment: 22 pages, SIGMOD 201
- …