59 research outputs found

    Incorporating covariance estimation uncertainty in spatial sampling design for prediction with trans-Gaussian random fields

    Get PDF
    Recently, Spock and Pilz [38], demonstratedthat the spatial sampling design problem forthe Bayesian linear kriging predictor can betransformed to an equivalent experimentaldesign problem for a linear regression modelwith stochastic regression coefficients anduncorrelated errors. The stochastic regressioncoefficients derive from the polar spectralapproximation of the residual process. Thus,standard optimal convex experimental designtheory can be used to calculate optimal spatialsampling designs. The design functionals ̈considered in Spock and Pilz [38] did nottake into account the fact that kriging isactually a plug-in predictor which uses theestimated covariance function. The resultingoptimal designs were close to space-fillingconfigurations, because the design criteriondid not consider the uncertainty of thecovariance function.In this paper we also assume that thecovariance function is estimated, e.g., byrestricted maximum likelihood (REML). Wethen develop a design criterion that fully takesaccount of the covariance uncertainty. Theresulting designs are less regular and space-filling compared to those ignoring covarianceuncertainty. The new designs, however, alsorequire some closely spaced samples in orderto improve the estimate of the covariancefunction. We also relax the assumption ofGaussian observations and assume that thedata is transformed to Gaussianity by meansof the Box-Cox transformation. The resultingprediction method is known as trans-Gaussiankriging. We apply the Smith and Zhu [37]approach to this kriging method and show thatresulting optimal designs also depend on theavailable data. We illustrate our results witha data set of monthly rainfall measurementsfrom Upper Austria

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Spatial Statistical Models: an overview under the Bayesian Approach

    Full text link
    Spatial documentation is exponentially increasing given the availability of Big IoT Data, enabled by the devices miniaturization and data storage capacity. Bayesian spatial statistics is a useful statistical tool to determine the dependence structure and hidden patterns over space through prior knowledge and data likelihood. Nevertheless, this modeling class is not well explored as the classification and regression machine learning models given their simplicity and often weak (data) independence supposition. In this manner, this systematic review aimed to unravel the main models presented in the literature in the past 20 years, identify gaps, and research opportunities. Elements such as random fields, spatial domains, prior specification, covariance function, and numerical approximations were discussed. This work explored the two subclasses of spatial smoothing global and local.Comment: 33 pages, 6 figure

    Leveraging Computer Vision for Applications in Biomedicine and Geoscience

    Get PDF
    Skin cancer is one of the most common types of cancer and is usually classified as either non-melanoma and melanoma skin cancer. Melanoma skin cancer accounts for about half of all skin cancer-related deaths. The 5-year survival rate is 99% when the cancer is detected early but drops to 25% once it becomes metastatic. In other words, the key to preventing death is early detection. Foraminifera are microscopic single-celled organisms that exist in marine environments and are classified as living a benthic or planktic lifestyle. In total, roughly 50,000 species are known to have existed, of which about 9,000 are still living today. Foraminifera are important proxies for reconstructing past ocean and climate conditions and as bio-indicators of anthropogenic pollution. Since the 1800s, the identification and counting of foraminifera have been performed manually. The process is resource-intensive. In this dissertation, we leverage recent advances in computer vision, driven by breakthroughs in deep learning methodologies and scale-space theory, to make progress towards both early detection of melanoma skin cancer and automation of the identification and counting of microscopic foraminifera. First, we investigate the use of hyperspectral images in skin cancer detection by performing a critical review of relevant, peer-reviewed research. Second, we present a novel scale-space methodology for detecting changes in hyperspectral images. Third, we develop a deep learning model for classifying microscopic foraminifera. Finally, we present a deep learning model for instance segmentation of microscopic foraminifera. The works presented in this dissertation are valuable contributions in the fields of biomedicine and geoscience, more specifically, towards the challenges of early detection of melanoma skin cancer and automation of the identification, counting, and picking of microscopic foraminifera

    Deep Learning for Abstraction, Control and Monitoring of Complex Cyber-Physical Systems

    Get PDF
    Cyber-Physical Systems (CPS) consist of digital devices that interact with some physical components. Their popularity and complexity are growing exponentially, giving birth to new, previously unexplored, safety-critical application domains. As CPS permeate our daily lives, it becomes imperative to reason about their reliability. Formal methods provide rigorous techniques for verification, control and synthesis of safe and reliable CPS. However, these methods do not scale with the complexity of the system, thus their applicability to real-world problems is limited. A promising strategy is to leverage deep learning techniques to tackle the scalability issue of formal methods, transforming unfeasible problems into approximately solvable ones. The approximate models are trained over observations which are solutions of the formal problem. In this thesis, we focus on the following tasks, which are computationally challenging: the modeling and the simulation of a complex stochastic model, the design of a safe and robust control policy for a system acting in a highly uncertain environment and the runtime verification problem under full or partial observability. Our approaches, based on deep learning, are indeed applicable to real-world complex and safety-critical systems acting under strict real-time constraints and in presence of a significant amount of uncertainty.Cyber-Physical Systems (CPS) consist of digital devices that interact with some physical components. Their popularity and complexity are growing exponentially, giving birth to new, previously unexplored, safety-critical application domains. As CPS permeate our daily lives, it becomes imperative to reason about their reliability. Formal methods provide rigorous techniques for verification, control and synthesis of safe and reliable CPS. However, these methods do not scale with the complexity of the system, thus their applicability to real-world problems is limited. A promising strategy is to leverage deep learning techniques to tackle the scalability issue of formal methods, transforming unfeasible problems into approximately solvable ones. The approximate models are trained over observations which are solutions of the formal problem. In this thesis, we focus on the following tasks, which are computationally challenging: the modeling and the simulation of a complex stochastic model, the design of a safe and robust control policy for a system acting in a highly uncertain environment and the runtime verification problem under full or partial observability. Our approaches, based on deep learning, are indeed applicable to real-world complex and safety-critical systems acting under strict real-time constraints and in presence of a significant amount of uncertainty

    Bayesian Quadrature with Prior Information: Modeling and Policies

    Get PDF
    Quadrature is the problem of estimating intractable integrals. Such integrals regularly arise in engineering and the natural sciences, especially when Bayesian methods are applied; examples include model evidences, normalizing constants and marginal distributions. This dissertation explores Bayesian quadrature, a probabilistic, model-based quadrature method. Specifically, we study different ways in which Bayesian quadrature can be adapted to account for different kinds of prior information one may have about the task. We demonstrate that by taking into account prior knowledge, Bayesian quadrature can outperform commonly used numerical methods that are agnostic to prior knowledge, such as Monte Carlo based integration. We focus on two types of information that are (a) frequently available when faced with an intractable integral and (b) can be (approximately) incorporated into Bayesian quadrature: • Natural bounds on the possible values that the integrand can take, e.g., when the integrand is a probability density function, it must nonnegative everywhere.• Knowledge about how the integral estimate will be used, i.e., for settings where quadrature is a subroutine, different downstream inference tasks can result in different priorities or desiderata for the estimate. These types of prior information are used to inform two aspects of the Bayesian quadrature inference routine: • Modeling: how the belief on the integrand can be tailored to account for the additional information.• Policies: where the integrand will be observed given a constrained budget of observations. This second aspect of Bayesian quadrature, policies for deciding where to observe the integrand, can be framed as an experimental design problem, where an agent must choose locations to evaluate a function of interest so as to maximize some notion of value. We will study the broader area of sequential experimental design, applying ideas from Bayesian decision theory to develop an efficient and nonmyopic policy for general sequential experimental design problems. We consider other sequential experimental design tasks such as Bayesian optimization and active search; in the latter, we focus on facilitating human–computer partnerships with the goal of aiding human agents engaged in data foraging through the use of active search based suggestions and an interactive visual interface. Finally, this dissertation will return to Bayesian quadrature and discuss the batch setting for experimental design, where multiple observations of the function in question are made simultaneously

    Spatial modelling of air pollution for open smart cities

    Get PDF
    A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information Management, specialization in Geographic Information SystemsHalf of the world’s population already lives in cities, and by 2050 two-thirds of the world’s population are expected to further move into urban areas. This urban growth leads to various environmental, social and economic challenges in cities, hampering the Quality of Life (QoL). Although recent trends in technologies equip us with various tools and techniques that can help in improving quality of life, air pollution remains the ‘biggest environmental health risk’ for decades, impacting individuals’ quality of life and well-being according to World Health Organisation (WHO). Many efforts have been made to measure air quality, but the sparse arrangement of monitoring stations and the lack of data currently make it challenging to develop systems that can capture within-city air pollution variations. To solve this, flexible methods that allow air quality monitoring using easily accessible data sources at the city level are desirable. The present thesis seeks to widen the current knowledge concerning detailed air quality monitoring by developing approaches that can help in tackling existing gaps in the literature. The thesis presents five contributions which address the issues mentioned above. The first contribution is the choice of a statistical method which can help in utilising existing open data and overcoming challenges imposed by the bigness of data for detailed air pollution monitoring. The second contribution concerns the development of optimisation method which helps in identifying optimal locations for robust air pollution modelling in cities. The third contribution of the thesis is also an optimisation method which helps in initiating systematic volunteered geographic information (VGI) campaigns for detailed air pollution monitoring by addressing sparsity and scarcity challenges of air pollution data in cities. The fourth contribution is a study proposing the involvement of housing companies as a stakeholder in the participatory framework for air pollution data collection, which helps in overcoming certain gaps existing in VGI-based approaches. Finally, the fifth contribution is an open-hardware system that aids in collecting vehicular traffic data using WiFi signal strength. The developed hardware can help in overcoming traffic data scarcity in cities, which limits detailed air pollution monitoring. All the contributions are illustrated through case studies in Muenster and Stuttgart. Overall, the thesis demonstrates the applicability of the developed approaches for enabling air pollution monitoring at the city-scale under the broader framework of the open smart city and for urban health research

    A Quasi-Likelihood Approach to Zero-Inflated Spatial Count Data

    Get PDF
    The increased accessibility of data that are geographically referenced and correlated increases the demand for techniques of spatial data analysis. The subset of such data comprised of discrete counts exhibit particular difficulties and the challenges further increase when a large proportion (typically 50% or more) of the counts are zero-valued. Such scenarios arise in many applications in numerous fields of research and it is often desirable to infer on subtleties of the process, despite the lack of substantive information obscuring the underlying stochastic mechanism generating the data. An ecological example provides the impetus for the research in this thesis: when observations for a species are recorded over a spatial region, and many of the counts are zero-valued, are the abundant zeros due to bad luck, or are aspects of the region making it unsuitable for the survival of the species? In the framework of generalized linear models, we first develop a zero-inflated Poisson generalized linear regression model, which explains the variability of the responses given a set of measured covariates, and additionally allows for the distinction of two kinds of zeros: sampling ("bad luck" zeros), and structural (zeros that provide insight into the data-generating process). We then adapt this model to the spatial setting by incorporating dependence within the model via a general, leniently-defined quasi-likelihood strategy, which provides consistent, efficient and asymptotically normal estimators, even under erroneous assumptions of the covariance structure. In addition to this advantage of robustness to dependence misspecification, our quasi-likelihood model overcomes the need for the complete specification of a probability model, thus rendering it very general and relevant to many settings. To complement the developed regression model, we further propose methods for the simulation of zero-inflated spatial stochastic processes. This is done by deconstructing the entire process into a mixed, marked spatial point process: we augment existing algorithms for the simulation of spatial marked point processes to comprise a stochastic mechanism to generate zero-abundant marks (counts) at each location. We propose several such mechanisms, and consider interaction and dependence processes for random locations as well as over a lattice
    corecore