1,239 research outputs found
A Tree-based Federated Learning Approach for Personalized Treatment Effect Estimation from Heterogeneous Data Sources
Federated learning is an appealing framework for analyzing sensitive data
from distributed health data networks due to its protection of data privacy.
Under this framework, data partners at local sites collaboratively build an
analytical model under the orchestration of a coordinating site, while keeping
the data decentralized. However, existing federated learning methods mainly
assume data across sites are homogeneous samples of the global population,
hence failing to properly account for the extra variability across sites in
estimation and inference. Drawing on a multi-hospital electronic health records
network, we develop an efficient and interpretable tree-based ensemble of
personalized treatment effect estimators to join results across hospital sites,
while actively modeling for the heterogeneity in data sources through site
partitioning. The efficiency of our method is demonstrated by a study of causal
effects of oxygen saturation on hospital mortality and backed up by
comprehensive numerical results
Knowledge integration in distributed data mining
Imperial Users onl
Comparative Biology of Three Species of Costa Rican Haeterini
Documenting life history characteristics of populations, especially of herbivorous insects such as butterflies, is fundamental to the ecological study of tropical rainforests. However, we know relatively little about tropical forest butterflies. Here, I combine information gathered using the mark-release-recapture (MRR) approach with manipulative and observational experiments in a natural environment to explore aspects of the population biology of three closely-related species of Costa Rican fruit-feeding understory butterflies (Cithaerias pireta, Dulcedo polita, and Pierella helvina), specifically: vertical stratification, attraction to and persistence in fruit-baited traps, relative abundance and distribution, movement patterns, probabilities of recapture and daily survival, and factors that affect those probabilities. Among the three focal species there were differences in capturability, recapturability, spatial distribution, and degree of vertical stratification. Males appear to fly within smaller home ranges than females, and P. helvina can traverse the entire forest reserve in a single day. These findings have implications for the genetic diversity of these populations and for the risk of local extinction in the face of changing ecological conditions
3D Remote Sensing Applications in Forest Ecology: Composition, Structure and Function
Dear Colleagues, The composition, structure and function of forest ecosystems are the key features characterizing their ecological properties, and can thus be crucially shaped and changed by various biotic and abiotic factors on multiple spatial scales. The magnitude and extent of these changes in recent decades calls for enhanced mitigation and adaption measures. Remote sensing data and methods are the main complementary sources of up-to-date synoptic and objective information of forest ecology. Due to the inherent 3D nature of forest ecosystems, the analysis of 3D sources of remote sensing data is considered to be most appropriate for recreating the forest’s compositional, structural and functional dynamics. In this Special Issue of Forests, we published a set of state-of-the-art scientific works including experimental studies, methodological developments and model validations, all dealing with the general topic of 3D remote sensing-assisted applications in forest ecology. We showed applications in forest ecology from a broad collection of method and sensor combinations, including fusion schemes. All in all, the studies and their focuses are as broad as a forest’s ecology or the field of remote sensing and, thus, reflect the very diverse usages and directions toward which future research and practice will be directed
Control and surveillance of partially observed stochastic epidemics in a Bayesian framework
This thesis comprises a number of inter-related parts. For most of the thesis we are
concerned with developing a new statistical technique that can enable the identi cation
of the optimal control by comparing competing control strategies for stochastic
epidemic models in real time. In the second part, we develop a novel approach for
modelling the spread of Peste des Petits Ruminants (PPR) virus within a given country
and the risk of introduction to other countries.
The control of highly infectious diseases of agriculture crops, animal and human
diseases is considered as one of the key challenges in epidemiological and ecological
modelling. Previous methods for analysis of epidemics, in which different controls
are compared, do not make full use of the trajectory of the epidemic. Most methods
use the information provided by the model parameters which may consider partial
information on the epidemic trajectory, so for example the same control strategy
may lead to different outcomes when the experiment is repeated. Also, by using
partial information it is observed that it might need more simulated realisations when
comparing two different controls. We introduce a statistical technique that makes full
use of the available information in estimating the effect of competing control strategies
on real-time epidemic outbreaks. The key to this approach lies in identifying a suitable
mechanism to couple epidemics, which could be unaffected by controls. To that end,
we use the Sellke construction as a latent process to link epidemics with different
control strategies.
The method is initially applied on non-spatial processes including SIR and SIS
models assuming that there are no observation data available before moving on to
more complex models that explicitly represent the spatial nature of the epidemic
spread. In the latter case, the analysis is conditioned on some observed data and
inference on the model parameters is performed in Bayesian framework using the
Markov Chain Monte Carlo (MCMC) techniques coupled with the data augmentation
methods. The methodology is applied on various simulated data sets and to citrus
canker data from Florida. Results suggest that the approach leads to highly positively
correlated outcomes of different controls, thus reducing the variability between the
effect of different control strategies, hence providing a more efficient estimator of their
expected differences. Therefore, a reduction of the number of realisations required to compare competing strategies in term of their expected outcomes is obtained.
The main purpose of the final part of this thesis is to develop a novel approach
to modelling the speed of Pest des Petits Ruminants (PPR) within a given country
and to understand the risk of subsequent spread to other countries. We are interested
in constructing models that can be fitted using information on the occurrence
of outbreaks as the information on the susceptible population is not available, and use
these models to estimate the speed of spatial spread of the virus. However, there was
little prior modelling on which the models developed here could be built. We start
by first establishing a spatio-temporal stochastic formulation for the spread of PPR.
This modelling is then used to estimate spatial transmission and speed of spread. To
account for uncertainty on the lack of information on the susceptible population, we
apply ideas from Bayesian modelling and data augmentation by treating the transmission
network as a missing quantity. Lastly, we establish a network model to address
questions regarding the risk of spread in the large-scale network of countries and
introduce the notion of ` first-passage time' using techniques from graph theory and
operational research such as the Bellman-Ford algorithm. The methodology is first
applied to PPR data from Tunisia and on simulated data. We also use simulated
models to investigate the dynamics of spread through a network of countries
Efficient similarity search in high-dimensional data spaces
Similarity search in high-dimensional data spaces is a popular paradigm for many modern database applications, such as content based image retrieval, time series analysis in financial and marketing databases, and data mining. Objects are represented as high-dimensional points or vectors based on their important features. Object similarity is then measured by the distance between feature vectors and similarity search is implemented via range queries or k-Nearest Neighbor (k-NN) queries.
Implementing k-NN queries via a sequential scan of large tables of feature vectors is computationally expensive. Building multi-dimensional indexes on the feature vectors for k-NN search also tends to be unsatisfactory when the dimensionality is high. This is due to the poor index performance caused by the dimensionality curse.
Dimensionality reduction using the Singular Value Decomposition method is the approach adopted in this study to deal with high-dimensional data. Noting that for many real-world datasets, data distribution tends to be heterogeneous, dimensionality reduction on the entire dataset may cause a significant loss of information. More efficient representation is sought by clustering the data into homogeneous subsets of points, and applying dimensionality reduction to each cluster respectively, i.e., utilizing local rather than global dimensionality reduction.
The thesis deals with the improvement of the efficiency of query processing associated with local dimensionality reduction methods, such as the Clustering and Singular Value Decomposition (CSVD) and the Local Dimensionality Reduction (LDR) methods. Variations in the implementation of CSVD are considered and the two methods are compared from the viewpoint of the compression ratio, CPU time, and retrieval efficiency.
An exact k-NN algorithm is presented for local dimensionality reduction methods by extending an existing multi-step k-NN search algorithm, which is designed for global dimensionality reduction. Experimental results show that the new method requires less CPU time than the approximate method proposed original for CSVD at a comparable level of accuracy.
Optimal subspace dimensionality reduction has the intent of minimizing total query cost. The problem is complicated in that each cluster can retain a different number of dimensions. A hybrid method is presented, combining the best features of the CSVD and LDR methods, to find optimal subspace dimensionalities for clusters generated by local dimensionality reduction methods. The experiments show that the proposed method works well for both real-world datasets and synthetic datasets
- …