807 research outputs found

    Scalable learning for geostatistics and speaker recognition

    Get PDF
    With improved data acquisition methods, the amount of data that is being collected has increased severalfold. One of the objectives in data collection is to learn useful underlying patterns. In order to work with data at this scale, the methods not only need to be effective with the underlying data, but also have to be scalable to handle larger data collections. This thesis focuses on developing scalable and effective methods targeted towards different domains, geostatistics and speaker recognition in particular. Initially we focus on kernel based learning methods and develop a GPU based parallel framework for this class of problems. An improved numerical algorithm that utilizes the GPU parallelization to further enhance the computational performance of kernel regression is proposed. These methods are then demonstrated on problems arising in geostatistics and speaker recognition. In geostatistics, data is often collected at scattered locations and factors like instrument malfunctioning lead to missing observations. Applications often require the ability interpolate this scattered spatiotemporal data on to a regular grid continuously over time. This problem can be formulated as a regression problem, and one of the most popular geostatistical interpolation techniques, kriging is analogous to a standard kernel method: Gaussian process regression. Kriging is computationally expensive and needs major modifications and accelerations in order to be used practically. The GPU framework developed for kernel methods is extended to kriging and further the GPU's texture memory is better utilized for enhanced computational performance. Speaker recognition deals with the task of verifying a person's identity based on samples of his/her speech - "utterances". This thesis focuses on text-independent framework and three new recognition frameworks were developed for this problem. We proposed a kernelized Renyi distance based similarity scoring for speaker recognition. While its performance is promising, it does not generalize well for limited training data and therefore does not compare well to state-of-the-art recognition systems. These systems compensate for the variability in the speech data due to the message, channel variability, noise and reverberation. State-of-the-art systems model each speaker as a mixture of Gaussians (GMM) and compensate for the variability (termed "nuisance"). We propose a novel discriminative framework using a latent variable technique, partial least squares (PLS), for improved recognition. The kernelized version of this algorithm is used to achieve a state of the art speaker ID system, that shows results competitive with the best systems reported on in NIST's 2010 Speaker Recognition Evaluation

    Reservoir characterization using intelligent seismic inversion

    Get PDF
    Integrating different types of data having different scales is the major challenge in reservoir characterization studies. Seismic data is among those different types of data, which is usually used by geoscientists for structural mapping of the subsurface and making interpretations of the reservoir\u27s facies distribution. Yet, it has been a common aim of geoscientists to incorporate seismic data in high-resolution reservoir description through a process called seismic inversion.;In this study, an intelligent seismic inversion methodology is presented to achieve a desirable correlation between relatively low-frequency seismic signals, and the much higher frequency wireline-log data. Vertical seismic profile (VSP) is used as an intermediate step between the well logs and the surface seismic. Generalized regression neural network (GRNN) is used to build two correlation models between; (1) Surface seismic and VSP, (2) VSP and well logs both using synthetic seismic data, and real data taken from the Buffalo Valley Field

    Sharpening land use maps and predicting the trends of land use change using high resolution airborne image: A geostatistical approach

    Get PDF
    High quality land use/land cover (LULC) data with fine spatial resolution and frequent temporal coverage are indispensable for revealing detail information of the Earth’s surface, characterizing LULC of the area, predicting the plausible land use changes, and assessing the viability and impacts of any development plans. While airborne imagery has high spatial resolution, it only provides limited temporal coverage over time. The LULC data from historical remote sensing images, such as those from Landsat, have frequent coverages over a long temporal period, but their spatial resolutions are low. This paper presents a spatio-temporal Cokriging method to sharpen LULC data and predict the trends of land use change. A set of time-series coarse resolution LULC maps and one frame of high spatial resolution airborne imagery of the Upper Mill Creek Watershed were used to illustrate the utility of our method. By explicitly describing the spatio-temporal dependence within and between different datasets, modelling the Anderson classification codes using spatial, temporal, and cross-covariance structures, and transforming the Anderson integer classification code to class probability, our method was able to resolve the differences between multi-source spatio-temporal LULC data, generate maps with sharpened and detailed land features, characterize the spatial and temporal LULC changes, reveal the trend of LULC change, and create a quality dataset invaluable for monitoring, assessing, and modelling LULC changes

    Methods in machine learning for probabilistic modelling of environment, with applications in meteorology and geology

    Get PDF
    Earth scientists increasingly deal with ‘big data’. Where once we may have struggled to obtain a handful of relevant measurements, we now often have data being collected from multiple sources, on the ground, in the air, and from space. These observations are accumulating at a rate that far outpaces our ability to make sense of them using traditional methods with limited scalability (e.g., mental modelling, or trial-and-error improvement of process based models). The revolution in machine learning offers a new paradigm for modelling the environment: rather than focusing on tweaking every aspect of models developed from the top down based largely on prior knowledge, we now have the capability to instead set up more abstract machine learning systems that can ‘do the tweaking for us’ in order to learn models from the bottom up that can be considered optimal in terms of how well they agree with our (rapidly increasing number of) observations of reality, while still being guided by our prior beliefs. In this thesis, with the help of spatial, temporal, and spatio-temporal examples in meteorology and geology, I present methods for probabilistic modelling of environmental variables using machine learning, and explore the considerations involved in developing and adopting these technologies, as well as the potential benefits they stand to bring, which include improved knowledge-acquisition and decision-making. In each application, the common theme is that we would like to learn predictive distributions for the variables of interest that are well-calibrated and as sharp as possible (i.e., to provide answers that are as precise as possible while remaining honest about their uncertainty). Achieving this requires the adoption of statistical approaches, but the volume and complexity of data available mean that scalability is an important factor — we can only realise the value of available data if it can be successfully incorporated into our models.Engineering and Physical Sciences Research Council (EPSRC

    Big Data for Social Sciences: Measuring patterns of human behavior through large-scale mobile phone data

    Full text link
    Through seven publications this dissertation shows how anonymized mobile phone data can contribute to the social good and provide insights into human behaviour on a large scale. The size of the datasets analysed ranges from 500 million to 300 billion phone records, covering millions of people. The key contributions are two-fold: 1. Big Data for Social Good: Through prediction algorithms the results show how mobile phone data can be useful to predict important socio-economic indicators, such as income, illiteracy and poverty in developing countries. Such knowledge can be used to identify where vulnerable groups in society are, reduce economic shocks and is a critical component for monitoring poverty rates over time. Further, the dissertation demonstrates how mobile phone data can be used to better understand human behaviour during large shocks in society, exemplified by an analysis of data from the terror attack in Norway and a natural disaster on the south-coast in Bangladesh. This work leads to an increased understanding of how information spreads, and how millions of people move around. The intention is to identify displaced people faster, cheaper and more accurately than existing survey-based methods. 2. Big Data for efficient marketing: Finally, the dissertation offers an insight into how anonymised mobile phone data can be used to map out large social networks, covering millions of people, to understand how products spread inside these networks. Results show that by including social patterns and machine learning techniques in a large-scale marketing experiment in Asia, the adoption rate is increased by 13 times compared to the approach used by experienced marketers. A data-driven and scientific approach to marketing, through more tailored campaigns, contributes to less irrelevant offers for the customers, and better cost efficiency for the companies.Comment: 166 pages, PHD thesi

    Kriging: applying geostatistical techniques to the genetic study of complex diseases

    Get PDF
    Complex diseases often display geographic distribution patterns. Therefore, the integration of genetic and environmental factors using geographic information systems (GIS) and specific statistical analyses that consider the spatial dimension of data greatly assist in the research of their gene-environment interactions (GxE). The objectives of the present work were to assess the application of a geostatistical interpolation technique (kriging) in the study of complex diseases with a distinct heterogeneous geographic distribution and to test its performance as an alternative to conventional genetic imputation methods. Using multiple sclerosis as a case study, kriging demonstrated to be a flexible and valuable tool for integrating information from various sources and at a different spatial resolution into a model that easily allowed to visualize its heterogeneous geographic distribution in Europe and to explore the intertwined interactions between several known genetic and environmental risk factors. Even though the performance of kriging did not surpass the results obtained with current imputation techniques, this pilot study revealed a worse performance of the latter for rare variants in chromosomal regions with a low density of markers

    Application of machine learning and deep neural networks for spatial prediction of groundwater nitrate concentration to improve land use management practices

    Get PDF
    The prediction of groundwater nitrate concentration\u27s response to geo-environmental and human-influenced factors is essential to better restore groundwater quality and improve land use management practices. In this paper, we regionalize groundwater nitrate concentration using different machine learning methods (Random forest (RF), unimodal 2D and 3D convolutional neural networks (CNN), and multi-stream early and late fusion 2D-CNNs) so that the nitrate situation in unobserved areas can be predicted. CNNs take into account not only the nitrate values of the grid cells of the observation wells but also the values around them. This has the added benefit of allowing them to learn directly about the influence of the surroundings. The predictive performance of the models was tested on a dataset from a pilot region in Germany, and the results show that, in general, all the machine learning models, after a Bayesian optimization hyperparameter search and training, achieve good spatial predictive performance compared to previous studies based on Kriging and numerical models. Based on the mean absolute error (MAE), the random forest model and the 2DCNN late fusion model performed best with an MAE (STD) of 9.55 (0.367) mg/l, R2 = 0.43 and 10.32 (0.27) mg/l, R2 = 0.27, respectively. The 3DCNN with an MAE (STD) of 11.66 (0.21) mg/l and largest resources consumption is the worst performing model. Feature importance learning from the models was used in conjunction with partial dependency analysis of the most important features to gain greater insight into the major factors explaining the nitrate spatial variability. Large uncertainties in nitrate prediction have been shown in previous studies. Therefore, the models were extended to quantify uncertainty using prediction intervals (PIs) derived from bootstrapping. Knowledge of uncertainty helps the water manager reduce risk and plan more reliably
    corecore