807 research outputs found
Scalable learning for geostatistics and speaker recognition
With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular.
Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition.
In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance.
Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation
Reservoir characterization using intelligent seismic inversion
Integrating different types of data having different scales is the major challenge in reservoir characterization studies. Seismic data is among those different types of data, which is usually used by geoscientists for structural mapping of the subsurface and making interpretations of the reservoir\u27s facies distribution. Yet, it has been a common aim of geoscientists to incorporate seismic data in high-resolution reservoir description through a process called seismic inversion.;In this study, an intelligent seismic inversion methodology is presented to achieve a desirable correlation between relatively low-frequency seismic signals, and the much higher frequency wireline-log data. Vertical seismic profile (VSP) is used as an intermediate step between the well logs and the surface seismic. Generalized regression neural network (GRNN) is used to build two correlation models between; (1) Surface seismic and VSP, (2) VSP and well logs both using synthetic seismic data, and real data taken from the Buffalo Valley Field
Sharpening land use maps and predicting the trends of land use change using high resolution airborne image: A geostatistical approach
High quality land use/land cover (LULC) data with fine spatial resolution and frequent temporal coverage are indispensable for revealing detail information of the Earth’s surface, characterizing LULC of the area, predicting the plausible land use changes, and assessing the viability and impacts of any development plans. While airborne imagery has high spatial resolution, it only provides limited temporal coverage over time. The LULC data from historical remote sensing images, such as those from Landsat, have frequent coverages over a long temporal period, but their spatial resolutions are low.
This paper presents a spatio-temporal Cokriging method to sharpen LULC data and predict the trends of land use change. A set of time-series coarse resolution LULC maps and one frame of high spatial resolution airborne imagery of the Upper Mill Creek Watershed were used to illustrate the utility of our method. By explicitly describing the spatio-temporal dependence within and between different datasets, modelling the Anderson classification codes using spatial, temporal, and cross-covariance structures, and transforming the Anderson integer classification code to class probability, our method was able to resolve the differences between multi-source spatio-temporal LULC data, generate maps with sharpened and detailed land features, characterize the spatial and temporal LULC changes, reveal the trend of LULC change, and create a quality dataset invaluable for monitoring, assessing, and modelling LULC changes
Methods in machine learning for probabilistic modelling of environment, with applications in meteorology and geology
Earth scientists increasingly deal with ‘big data’. Where once we may have struggled to obtain a handful of relevant measurements, we now often have data being collected from multiple sources, on the ground, in the air, and from space. These observations are accumulating at a rate that far outpaces our ability to make sense of them using traditional methods with limited scalability (e.g., mental modelling, or trial-and-error improvement of process based models). The revolution in machine learning offers a new paradigm for modelling the environment: rather than focusing on tweaking every aspect of models developed from the top down based largely on prior knowledge, we now have the capability to instead set up more abstract machine learning systems that can ‘do the tweaking for us’ in order to learn models from the bottom up that can be considered optimal in terms of how well they agree with our (rapidly increasing number of) observations of reality, while still being guided by our prior beliefs.
In this thesis, with the help of spatial, temporal, and spatio-temporal examples in meteorology and geology, I present methods for probabilistic modelling of environmental variables using machine learning, and explore the considerations involved in developing and adopting these technologies, as well as the potential benefits they stand to bring, which include improved knowledge-acquisition and decision-making. In each application, the common theme is that we would like to learn predictive distributions for the variables of interest that are well-calibrated and as sharp as possible (i.e., to provide answers that are as precise as possible while remaining honest about their uncertainty). Achieving this requires the adoption of statistical approaches, but the volume and complexity of data available mean that scalability is an important factor — we can only realise the value of available data if it can be successfully incorporated into our models.Engineering and Physical Sciences Research Council (EPSRC
Big Data for Social Sciences: Measuring patterns of human behavior through large-scale mobile phone data
Through seven publications this dissertation shows how anonymized mobile
phone data can contribute to the social good and provide insights into human
behaviour on a large scale. The size of the datasets analysed ranges from 500
million to 300 billion phone records, covering millions of people. The key
contributions are two-fold:
1. Big Data for Social Good: Through prediction algorithms the results show
how mobile phone data can be useful to predict important socio-economic
indicators, such as income, illiteracy and poverty in developing countries.
Such knowledge can be used to identify where vulnerable groups in society are,
reduce economic shocks and is a critical component for monitoring poverty rates
over time. Further, the dissertation demonstrates how mobile phone data can be
used to better understand human behaviour during large shocks in society,
exemplified by an analysis of data from the terror attack in Norway and a
natural disaster on the south-coast in Bangladesh. This work leads to an
increased understanding of how information spreads, and how millions of people
move around. The intention is to identify displaced people faster, cheaper and
more accurately than existing survey-based methods.
2. Big Data for efficient marketing: Finally, the dissertation offers an
insight into how anonymised mobile phone data can be used to map out large
social networks, covering millions of people, to understand how products spread
inside these networks. Results show that by including social patterns and
machine learning techniques in a large-scale marketing experiment in Asia, the
adoption rate is increased by 13 times compared to the approach used by
experienced marketers. A data-driven and scientific approach to marketing,
through more tailored campaigns, contributes to less irrelevant offers for the
customers, and better cost efficiency for the companies.Comment: 166 pages, PHD thesi
Kriging: applying geostatistical techniques to the genetic study of complex diseases
Complex diseases often display geographic distribution patterns.
Therefore, the integration of genetic and environmental factors using
geographic information systems (GIS) and specific statistical analyses
that consider the spatial dimension of data greatly assist in the research
of their gene-environment interactions (GxE). The objectives of the
present work were to assess the application of a geostatistical
interpolation technique (kriging) in the study of complex diseases with
a distinct heterogeneous geographic distribution and to test its
performance as an alternative to conventional genetic imputation
methods. Using multiple sclerosis as a case study, kriging demonstrated
to be a flexible and valuable tool for integrating information from
various sources and at a different spatial resolution into a model that
easily allowed to visualize its heterogeneous geographic distribution in
Europe and to explore the intertwined interactions between several
known genetic and environmental risk factors. Even though the
performance of kriging did not surpass the results obtained with current
imputation techniques, this pilot study revealed a worse performance of
the latter for rare variants in chromosomal regions with a low density
of markers
Application of machine learning and deep neural networks for spatial prediction of groundwater nitrate concentration to improve land use management practices
The prediction of groundwater nitrate concentration\u27s response to geo-environmental and human-influenced factors is essential to better restore groundwater quality and improve land use management practices. In this paper, we regionalize groundwater nitrate concentration using different machine learning methods (Random forest (RF), unimodal 2D and 3D convolutional neural networks (CNN), and multi-stream early and late fusion 2D-CNNs) so that the nitrate situation in unobserved areas can be predicted. CNNs take into account not only the nitrate values of the grid cells of the observation wells but also the values around them. This has the added benefit of allowing them to learn directly about the influence of the surroundings. The predictive performance of the models was tested on a dataset from a pilot region in Germany, and the results show that, in general, all the machine learning models, after a Bayesian optimization hyperparameter search and training, achieve good spatial predictive performance compared to previous studies based on Kriging and numerical models. Based on the mean absolute error (MAE), the random forest model and the 2DCNN late fusion model performed best with an MAE (STD) of 9.55 (0.367) mg/l, R2 = 0.43 and 10.32 (0.27) mg/l, R2 = 0.27, respectively. The 3DCNN with an MAE (STD) of 11.66 (0.21) mg/l and largest resources consumption is the worst performing model. Feature importance learning from the models was used in conjunction with partial dependency analysis of the most important features to gain greater insight into the major factors explaining the nitrate spatial variability. Large uncertainties in nitrate prediction have been shown in previous studies. Therefore, the models were extended to quantify uncertainty using prediction intervals (PIs) derived from bootstrapping. Knowledge of uncertainty helps the water manager reduce risk and plan more reliably
- …