27,537 research outputs found
Recommended from our members
Information content of spatially distributed ground-based measurements for hydrologic-parameter calibration in mixed rain-snow mountain headwaters
Parameters in hydrologic models used in mixed rain-snow regions are often uncertain to calibrate and overfitted on streamflow. To contribute addressing these challenges, we used an algorithm that assesses modeling performances through time (Dynamic Identifiability Analysis) to quantify the information content of spatially distributed ground-based measurements for identifying optimal parameter values in the Precipitation Runoff Modeling System (PRMS) model. Including spatially distributed ground-based measurements in Identifiability Analysis allowed us to unambiguously estimate more parameter values than only using streamflow (seven parameters instead of two out of a pool of thirty-three). Peaks in information gain were obtained when using dew-point temperature to identify precipitation phase-partitioning parameters. Multi-attribute identifiability analysis also yielded optimal parameter values that were temporally less variable than those estimated using streamflow alone. Overall, identifying parameter values using ground-based measurements improved the simulation of key drivers of the surface-water budget, such as air temperature and precipitation-phase partitioning. However, parameters simulating surface-to-subsurface mass fluxes like snow accumulation and melt or evapotranspiration were poorly identified by any attribute and so emerged as key sources of predictive uncertainty for this distributed-parameter hydrologic model. This work demonstrates the value of expanded ground-based measurements for identifying parameters in distributed-parameter hydrologic models and so diagnosing their conceptual uncertainty across the water budget
A multi-resolution approximation for massive spatial datasets
Automated sensing instruments on satellites and aircraft have enabled the
collection of massive amounts of high-resolution observations of spatial fields
over large spatial regions. If these datasets can be efficiently exploited,
they can provide new insights on a wide variety of issues. However, traditional
spatial-statistical techniques such as kriging are not computationally feasible
for big datasets. We propose a multi-resolution approximation (M-RA) of
Gaussian processes observed at irregular locations in space. The M-RA process
is specified as a linear combination of basis functions at multiple levels of
spatial resolution, which can capture spatial structure from very fine to very
large scales. The basis functions are automatically chosen to approximate a
given covariance function, which can be nonstationary. All computations
involving the M-RA, including parameter inference and prediction, are highly
scalable for massive datasets. Crucially, the inference algorithms can also be
parallelized to take full advantage of large distributed-memory computing
environments. In comparisons using simulated data and a large satellite
dataset, the M-RA outperforms a related state-of-the-art method.Comment: 23 pages; to be published in Journal of the American Statistical
Associatio
Recommended from our members
Approaches to conceptual clustering
Methods for Conceptual Clustering may be explicated in two lights. Conceptual Clustering methods may be viewed as extensions to techniques of numerical taxonomy, a collection of methods developed by social and natural scientists for creating classification schemes over object sets. Alternatively, conceptual clustering may be viewed as a form of learning by observation or concept formation, as opposed to methods of learning from examples or concept identification. In this paper we survey and compare a number of conceptual clustering methods along dimensions suggested by each of these views. The point we most wish to clarify is that conceptual clustering processes can be explicated as being composed of three distinct but inter-dependent subprocesses: the process of deriving a hierarchical classification scheme; the process of aggregating objects into individual classes; and the process of assigning conceptual descriptions to object classes. Each subprocess may be characterized along a number of dimensions related to search, thus facilitating a better understanding of the conceptual clustering process as a whole
Distributed Correlation-Based Feature Selection in Spark
CFS (Correlation-Based Feature Selection) is an FS algorithm that has been
successfully applied to classification problems in many domains. We describe
Distributed CFS (DiCFS) as a completely redesigned, scalable, parallel and
distributed version of the CFS algorithm, capable of dealing with the large
volumes of data typical of big data applications. Two versions of the algorithm
were implemented and compared using the Apache Spark cluster computing model,
currently gaining popularity due to its much faster processing times than
Hadoop's MapReduce model. We tested our algorithms on four publicly available
datasets, each consisting of a large number of instances and two also
consisting of a large number of features. The results show that our algorithms
were superior in terms of both time-efficiency and scalability. In leveraging a
computer cluster, they were able to handle larger datasets than the
non-distributed WEKA version while maintaining the quality of the results,
i.e., exactly the same features were returned by our algorithms when compared
to the original algorithm available in WEKA.Comment: 25 pages, 5 figure
Efficient regularized isotonic regression with application to gene--gene interaction search
Isotonic regression is a nonparametric approach for fitting monotonic models
to data that has been widely studied from both theoretical and practical
perspectives. However, this approach encounters computational and statistical
overfitting issues in higher dimensions. To address both concerns, we present
an algorithm, which we term Isotonic Recursive Partitioning (IRP), for isotonic
regression based on recursively partitioning the covariate space through
solution of progressively smaller "best cut" subproblems. This creates a
regularized sequence of isotonic models of increasing model complexity that
converges to the global isotonic regression solution. The models along the
sequence are often more accurate than the unregularized isotonic regression
model because of the complexity control they offer. We quantify this complexity
control through estimation of degrees of freedom along the path. Success of the
regularized models in prediction and IRPs favorable computational properties
are demonstrated through a series of simulated and real data experiments. We
discuss application of IRP to the problem of searching for gene--gene
interactions and epistasis, and demonstrate it on data from genome-wide
association studies of three common diseases.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS504 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …