1,200 research outputs found

    Air Quality Research Using Remote Sensing

    Get PDF
    Air pollution is a worldwide environmental hazard that poses serious consequences not only for human health and the climate but also for agriculture, ecosystems, and cultural heritage, among other factors. According to the WHO, there are 8 million premature deaths every year as a result of exposure to ambient air pollution. In addition, more than 90% of the world’s population live in areas where the air quality is poor, exceeding the recommended limits. On the other hand, air pollution and the climate co-influence one another through complex physicochemical interactions in the atmosphere that alter the Earth’s energy balance and have implications for climate change and the air quality. It is important to measure specific atmospheric parameters and pollutant compound concentrations, monitor their variations, and analyze different scenarios with the aim of assessing the air pollution levels and developing early warning and forecast systems as a means of improving the air quality and safeguarding public health. Such measures can also form part of efforts to achieve a reduction in the number of air pollution casualties and mitigate climate change phenomena. This book contains contributions focusing on remote sensing techniques for evaluating air quality, including the use of in situ data, modeling approaches, and the synthesis of different instrumentations and techniques. The papers published in this book highlight the importance and relevance of air quality studies and the potential of remote sensing, particularly that conducted from Earth observation platforms, to shed light on this topic

    LIPIcs, Volume 251, ITCS 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 251, ITCS 2023, Complete Volum

    Data- og ekspertdreven variabelseleksjon for prediktive modeller i helsevesenet : mot økt tolkbarhet i underbestemte maskinlæringsproblemer

    Get PDF
    Modern data acquisition techniques in healthcare generate large collections of data from multiple sources, such as novel diagnosis and treatment methodologies. Some concrete examples are electronic healthcare record systems, genomics, and medical images. This leads to situations with often unstructured, high-dimensional heterogeneous patient cohort data where classical statistical methods may not be sufficient for optimal utilization of the data and informed decision-making. Instead, investigating such data structures with modern machine learning techniques promises to improve the understanding of patient health issues and may provide a better platform for informed decision-making by clinicians. Key requirements for this purpose include (a) sufficiently accurate predictions and (b) model interpretability. Achieving both aspects in parallel is difficult, particularly for datasets with few patients, which are common in the healthcare domain. In such cases, machine learning models encounter mathematically underdetermined systems and may overfit easily on the training data. An important approach to overcome this issue is feature selection, i.e., determining a subset of informative features from the original set of features with respect to the target variable. While potentially raising the predictive performance, feature selection fosters model interpretability by identifying a low number of relevant model parameters to better understand the underlying biological processes that lead to health issues. Interpretability requires that feature selection is stable, i.e., small changes in the dataset do not lead to changes in the selected feature set. A concept to address instability is ensemble feature selection, i.e. the process of repeating the feature selection multiple times on subsets of samples of the original dataset and aggregating results in a meta-model. This thesis presents two approaches for ensemble feature selection, which are tailored towards high-dimensional data in healthcare: the Repeated Elastic Net Technique for feature selection (RENT) and the User-Guided Bayesian Framework for feature selection (UBayFS). While RENT is purely data-driven and builds upon elastic net regularized models, UBayFS is a general framework for ensembles with the capabilities to include expert knowledge in the feature selection process via prior weights and side constraints. A case study modeling the overall survival of cancer patients compares these novel feature selectors and demonstrates their potential in clinical practice. Beyond the selection of single features, UBayFS also allows for selecting whole feature groups (feature blocks) that were acquired from multiple data sources, as those mentioned above. Importance quantification of such feature blocks plays a key role in tracing information about the target variable back to the acquisition modalities. Such information on feature block importance may lead to positive effects on the use of human, technical, and financial resources if systematically integrated into the planning of patient treatment by excluding the acquisition of non-informative features. Since a generalization of feature importance measures to block importance is not trivial, this thesis also investigates and compares approaches for feature block importance rankings. This thesis demonstrates that high-dimensional datasets from multiple data sources in the medical domain can be successfully tackled by the presented approaches for feature selection. Experimental evaluations demonstrate favorable properties of both predictive performance, stability, as well as interpretability of results, which carries a high potential for better data-driven decision support in clinical practice.Moderne datainnsamlingsteknikker i helsevesenet genererer store datamengder fra flere kilder, som for eksempel nye diagnose- og behandlingsmetoder. Noen konkrete eksempler er elektroniske helsejournalsystemer, genomikk og medisinske bilder. Slike pasientkohortdata er ofte ustrukturerte, høydimensjonale og heterogene og hvor klassiske statistiske metoder ikke er tilstrekkelige for optimal utnyttelse av dataene og god informasjonsbasert beslutningstaking. Derfor kan det være lovende å analysere slike datastrukturer ved bruk av moderne maskinlæringsteknikker for å øke forståelsen av pasientenes helseproblemer og for å gi klinikerne en bedre plattform for informasjonsbasert beslutningstaking. Sentrale krav til dette formålet inkluderer (a) tilstrekkelig nøyaktige prediksjoner og (b) modelltolkbarhet. Å oppnå begge aspektene samtidig er vanskelig, spesielt for datasett med få pasienter, noe som er vanlig for data i helsevesenet. I slike tilfeller må maskinlæringsmodeller håndtere matematisk underbestemte systemer og dette kan lett føre til at modellene overtilpasses treningsdataene. Variabelseleksjon er en viktig tilnærming for å håndtere dette ved å identifisere en undergruppe av informative variabler med hensyn til responsvariablen. Samtidig som variabelseleksjonsmetoder kan lede til økt prediktiv ytelse, fremmes modelltolkbarhet ved å identifisere et lavt antall relevante modellparametere. Dette kan gi bedre forståelse av de underliggende biologiske prosessene som fører til helseproblemer. Tolkbarhet krever at variabelseleksjonen er stabil, dvs. at små endringer i datasettet ikke fører til endringer i hvilke variabler som velges. Et konsept for å adressere ustabilitet er ensemblevariableseleksjon, dvs. prosessen med å gjenta variabelseleksjon flere ganger på en delmengde av prøvene i det originale datasett og aggregere resultater i en metamodell. Denne avhandlingen presenterer to tilnærminger for ensemblevariabelseleksjon, som er skreddersydd for høydimensjonale data i helsevesenet: "Repeated Elastic Net Technique for feature selection" (RENT) og "User-Guided Bayesian Framework for feature selection" (UBayFS). Mens RENT er datadrevet og bygger på elastic net-regulariserte modeller, er UBayFS et generelt rammeverk for ensembler som muliggjør inkludering av ekspertkunnskap i variabelseleksjonsprosessen gjennom forhåndsbestemte vekter og sidebegrensninger. En case-studie som modellerer overlevelsen av kreftpasienter sammenligner disse nye variabelseleksjonsmetodene og demonstrerer deres potensiale i klinisk praksis. Utover valg av enkelte variabler gjør UBayFS det også mulig å velge blokker eller grupper av variabler som representerer de ulike datakildene som ble nevnt over. Kvantifisering av viktigheten av variabelgrupper spiller en nøkkelrolle for forståelsen av hvorvidt datakildene er viktige for responsvariablen. Tilgang til slik informasjon kan føre til at bruken av menneskelige, tekniske og økonomiske ressurser kan forbedres dersom informasjonen integreres systematisk i planleggingen av pasientbehandlingen. Slik kan man redusere innsamling av ikke-informative variabler. Siden generaliseringen av viktighet av variabelgrupper ikke er triviell, undersøkes og sammenlignes også tilnærminger for rangering av viktigheten til disse variabelgruppene. Denne avhandlingen viser at høydimensjonale datasett fra flere datakilder fra det medisinske domenet effektivt kan håndteres ved bruk av variabelseleksjonmetodene som er presentert i avhandlingen. Eksperimentene viser at disse kan ha positiv en effekt på både prediktiv ytelse, stabilitet og tolkbarhet av resultatene. Bruken av disse variabelseleksjonsmetodene bærer et stort potensiale for bedre datadrevet beslutningsstøtte i klinisk praksis

    Subgroup discovery for structured target concepts

    Get PDF
    The main object of study in this thesis is subgroup discovery, a theoretical framework for finding subgroups in data—i.e., named sub-populations— whose behaviour with respect to a specified target concept is exceptional when compared to the rest of the dataset. This is a powerful tool that conveys crucial information to a human audience, but despite past advances has been limited to simple target concepts. In this work we propose algorithms that bring this framework to novel application domains. We introduce the concept of representative subgroups, which we use not only to ensure the fairness of a sub-population with regard to a sensitive trait, such as race or gender, but also to go beyond known trends in the data. For entities with additional relational information that can be encoded as a graph, we introduce a novel measure of robust connectedness which improves on established alternative measures of density; we then provide a method that uses this measure to discover which named sub-populations are more well-connected. Our contributions within subgroup discovery crescent with the introduction of kernelised subgroup discovery: a novel framework that enables the discovery of subgroups on i.i.d. target concepts with virtually any kind of structure. Importantly, our framework additionally provides a concrete and efficient tool that works out-of-the-box without any modification, apart from specifying the Gramian of a positive definite kernel. To use within kernelised subgroup discovery, but also on any other kind of kernel method, we additionally introduce a novel random walk graph kernel. Our kernel allows the fine tuning of the alignment between the vertices of the two compared graphs, during the count of the random walks, while we also propose meaningful structure-aware vertex labels to utilise this new capability. With these contributions we thoroughly extend the applicability of subgroup discovery and ultimately re-define it as a kernel method.Der Hauptgegenstand dieser Arbeit ist die Subgruppenentdeckung (Subgroup Discovery), ein theoretischer Rahmen für das Auffinden von Subgruppen in Daten—d. h. benannte Teilpopulationen—deren Verhalten in Bezug auf ein bestimmtes Targetkonzept im Vergleich zum Rest des Datensatzes außergewöhnlich ist. Es handelt sich hierbei um ein leistungsfähiges Instrument, das einem menschlichen Publikum wichtige Informationen vermittelt. Allerdings ist es trotz bisherigen Fortschritte auf einfache Targetkonzepte beschränkt. In dieser Arbeit schlagen wir Algorithmen vor, die diesen Rahmen auf neuartige Anwendungsbereiche übertragen. Wir führen das Konzept der repräsentativen Untergruppen ein, mit dem wir nicht nur die Fairness einer Teilpopulation in Bezug auf ein sensibles Merkmal wie Rasse oder Geschlecht sicherstellen, sondern auch über bekannte Trends in den Daten hinausgehen können. Für Entitäten mit zusätzlicher relationalen Information, die als Graph kodiert werden kann, führen wir ein neuartiges Maß für robuste Verbundenheit ein, das die etablierten alternativen Dichtemaße verbessert; anschließend stellen wir eine Methode bereit, die dieses Maß verwendet, um herauszufinden, welche benannte Teilpopulationen besser verbunden sind. Unsere Beiträge in diesem Rahmen gipfeln in der Einführung der kernelisierten Subgruppenentdeckung: ein neuartiger Rahmen, der die Entdeckung von Subgruppen für u.i.v. Targetkonzepten mit praktisch jeder Art von Struktur ermöglicht. Wichtigerweise, unser Rahmen bereitstellt zusätzlich ein konkretes und effizientes Werkzeug, das ohne jegliche Modifikation funktioniert, abgesehen von der Angabe des Gramian eines positiv definitiven Kernels. Für den Einsatz innerhalb der kernelisierten Subgruppentdeckung, aber auch für jede andere Art von Kernel-Methode, führen wir zusätzlich einen neuartigen Random-Walk-Graph-Kernel ein. Unser Kernel ermöglicht die Feinabstimmung der Ausrichtung zwischen den Eckpunkten der beiden unter-Vergleich-gestelltenen Graphen während der Zählung der Random Walks, während wir auch sinnvolle strukturbewusste Vertex-Labels vorschlagen, um diese neue Fähigkeit zu nutzen. Mit diesen Beiträgen erweitern wir die Anwendbarkeit der Subgruppentdeckung gründlich und definieren wir sie im Endeffekt als Kernel-Methode neu

    Anwendungen maschinellen Lernens für datengetriebene Prävention auf Populationsebene

    Get PDF
    Healthcare costs are systematically rising, and current therapy-focused healthcare systems are not sustainable in the long run. While disease prevention is a viable instrument for reducing costs and suffering, it requires risk modeling to stratify populations, identify high- risk individuals and enable personalized interventions. In current clinical practice, however, systematic risk stratification is limited: on the one hand, for the vast majority of endpoints, no risk models exist. On the other hand, available models focus on predicting a single disease at a time, rendering predictor collection burdensome. At the same time, the den- sity of individual patient data is constantly increasing. Especially complex data modalities, such as -omics measurements or images, may contain systemic information on future health trajectories relevant for multiple endpoints simultaneously. However, to date, this data is inaccessible for risk modeling as no dedicated methods exist to extract clinically relevant information. This study built on recent advances in machine learning to investigate the ap- plicability of four distinct data modalities not yet leveraged for risk modeling in primary prevention. For each data modality, a neural network-based survival model was developed to extract predictive information, scrutinize performance gains over commonly collected covariates, and pinpoint potential clinical utility. Notably, the developed methodology was able to integrate polygenic risk scores for cardiovascular prevention, outperforming existing approaches and identifying benefiting subpopulations. Investigating NMR metabolomics, the developed methodology allowed the prediction of future disease onset for many common diseases at once, indicating potential applicability as a drop-in replacement for commonly collected covariates. Extending the methodology to phenome-wide risk modeling, elec- tronic health records were found to be a general source of predictive information with high systemic relevance for thousands of endpoints. Assessing retinal fundus photographs, the developed methodology identified diseases where retinal information most impacted health trajectories. In summary, the results demonstrate the capability of neural survival models to integrate complex data modalities for multi-disease risk modeling in primary prevention and illustrate the tremendous potential of machine learning models to disrupt medical practice toward data-driven prevention at population scale.Die Kosten im Gesundheitswesen steigen systematisch und derzeitige therapieorientierte Gesundheitssysteme sind nicht nachhaltig. Angesichts vieler verhinderbarer Krankheiten stellt die Prävention ein veritables Instrument zur Verringerung von Kosten und Leiden dar. Risikostratifizierung ist die grundlegende Voraussetzung für ein präventionszentri- ertes Gesundheitswesen um Personen mit hohem Risiko zu identifizieren und Maßnah- men einzuleiten. Heute ist eine systematische Risikostratifizierung jedoch nur begrenzt möglich, da für die meisten Krankheiten keine Risikomodelle existieren und sich verfüg- bare Modelle auf einzelne Krankheiten beschränken. Weil für deren Berechnung jeweils spezielle Sets an Prädiktoren zu erheben sind werden in Praxis oft nur wenige Modelle angewandt. Gleichzeitig versprechen komplexe Datenmodalitäten, wie Bilder oder -omics- Messungen, systemische Informationen über zukünftige Gesundheitsverläufe, mit poten- tieller Relevanz für viele Endpunkte gleichzeitig. Da es an dedizierten Methoden zur Ex- traktion klinisch relevanter Informationen fehlt, sind diese Daten jedoch für die Risikomod- ellierung unzugänglich, und ihr Potenzial blieb bislang unbewertet. Diese Studie nutzt ma- chinelles Lernen, um die Anwendbarkeit von vier Datenmodalitäten in der Primärpräven- tion zu untersuchen: polygene Risikoscores für die kardiovaskuläre Prävention, NMR Meta- bolomicsdaten, elektronische Gesundheitsakten und Netzhautfundusfotos. Pro Datenmodal- ität wurde ein neuronales Risikomodell entwickelt, um relevante Informationen zu extra- hieren, additive Information gegenüber üblicherweise erfassten Kovariaten zu quantifizieren und den potenziellen klinischen Nutzen der Datenmodalität zu ermitteln. Die entwickelte Me-thodik konnte polygene Risikoscores für die kardiovaskuläre Prävention integrieren. Im Falle der NMR-Metabolomik erschloss die entwickelte Methodik wertvolle Informa- tionen über den zukünftigen Ausbruch von Krankheiten. Unter Einsatz einer phänomen- weiten Risikomodellierung erwiesen sich elektronische Gesundheitsakten als Quelle prädik- tiver Information mit hoher systemischer Relevanz. Bei der Analyse von Fundusfotografien der Netzhaut wurden Krankheiten identifiziert für deren Vorhersage Netzhautinformationen genutzt werden könnten. Zusammengefasst zeigten die Ergebnisse das Potential neuronaler Risikomodelle die medizinische Praxis in Richtung einer datengesteuerten, präventionsori- entierten Medizin zu verändern

    Examining the Relationship Between Lignocellulosic Biomass Structural Constituents and Its Flow Behavior

    Get PDF
    Lignocellulosic biomass material sourced from plants and herbaceous sources is a promising substrate of inexpensive, abundant, and potentially carbon-neutral energy. One of the leading limitations of using lignocellulosic biomass as a feedstock for bioenergy products is the flow issues encountered during biomass conveyance in biorefineries. In the biorefining process, the biomass feedstock undergoes flow through a variety of conveyance systems. The inherent variability of the feedstock materials, as evidenced by their complex microstructural composition and non-uniform morphology, coupled with the varying flow conditions in the conveyance systems, gives rise to flow issues such as bridging, ratholing, and clogging. These issues slow down the conveyance process, affect machine life, and potentially lead to partial or even complete shutdown of the biorefinery. Hence, we need to improve our fundamental understanding of biomass feedstock flow physics and mechanics to address the flow issues and improve biorefinery economics. This dissertation research examines the fundamental relationship between structural constituents of diverse lignocellulosic biomass materials, i.e., cellulose, hemicellulose, and lignin, their morphology, and the impact of the structural composition and morphology on their flow behavior. First, we prepared and characterized biomass feedstocks of different chemical compositions and morphologies. Then, we conducted our fundamental investigation experimentally, through physical flow characterization tests, and computationally through high-fidelity discrete element modeling. Finally, we statistically analyzed the relative influence of the properties of lignocellulosic biomass assemblies on flow behavior to determine the most critical properties and the optimum values of flow parameters. Our research provides an experimental and computational framework to generalize findings to a wider portfolio of biomass materials. It will help the bioenergy community to design more efficient biorefining machinery and equipment, reduce the risk of failure, and improve the overall commercial viability of the bioenergy industry

    Exemplars as a least-committed alternative to dual-representations in learning and memory

    Get PDF
    Despite some notable counterexamples, the theoretical and empirical exchange between the fields of learning and memory is limited. In an attempt to promote further theoretical exchange, I explored how learning and memory may be conceptualized as distinct algorithms that operate on a the same representations of past experiences. I review representational and process assumptions in learning and memory, by the example of evaluative conditioning and false recognition, and identified important similarities in the theoretical debates. Based on my review, I identify global matching memory models and their exemplar representation as a promising candidate for a common representational substrate that satisfies the principle of least commitment. I then present two cases in which exemplar-based global matching models, which take characteristics of the stimulus material and context into account, suggest parsimonious explanations for empirical dissociations in evaluative conditioning and false recognition in long-term memory. These explanations suggest reinterpretations of findings that are commonly taken as evidence for dual-representation models. Finally, I report the same approach provides also provides a natural unitary account of false recognition in short-term memory, a finding which challenges the assumption that short-term memory is insulated from long-term memory. Taken together, this work illustrates the broad explanatory scope and the integrative and yet parsimonious potential of exemplar-based global matching models

    Approximate Methods for Marginal Likelihood Estimation

    Get PDF
    We consider the estimation of the marginal likelihood in Bayesian statistics, a essential and important task known to be computationally expensive when the dimension of the parameter space is large. We propose a general algorithm with numerous extensions that can be widely applied to a variety of problem settings and excels particularly when dealing with near log-concave posteriors. Our method hinges on a novel idea that uses MCMC samples to partition the parameter space and forms local approximations over these partition sets as a means of estimating the marginal likelihood. In this dissertation, we provide both the motivation and the groundwork for developing what we call the Hybrid estimator. Our numerical experiments show the versatility and accuracy of the proposed estimator, even as the parameter space becomes increasingly high-dimensional and complicated

    Regional water balance analysis of glacierised river basins in the north-eastern Himalaya applying the J2000 hydrological model

    Get PDF
    The glacierised basins of the Northeast Himalayan region are highly vulnerable to climate-change impacts. The spatio-temporal hydroclimatic and physiographic variability impact the water balance of these glacierised basins across the region. This study assesses the glaciohydrological processes and dynamics in the data scarce region for the present as well future climate change scenarios by regional water balance analysis. The J2000 hydrological model was adapted to incorporate the frozen ground as well as glacier dynamics in a stepwise, nested basin calibration approach. The modelled ERA-Interim precipitation data cannot capture the high amplitude orographic and convective events. Therefore, Orographic correction factors were used to inversely correct the ERA-Interim precipitation data to account for the orographic as well as cyclonic precipitation in the region from reported glacier mass balance and evapotranspiration estimates. Monthly temperature lapse rate was adopted for correcting the ERA-Interim temperature dataset. The Beki basin was selected as the donor basin for model development and evaluation. The parameters from the Beki basin were regionalised to the receptor Lohit and the Noadihing basins by the Proxy-basin method. Multi-objective optimization criteria such as the Kling-Gupta efficiency (KGE) for temporal dynamics and flow distribution and Bias for overall water balance showed high to moderate conformity between measured and simulated discharge at the corresponding basin outlets. The variability in the water balance and runoff components among the three basins was primarily related to the spatio-temporal variation in the mean annual precipitation, runoff and evapotranspiration estimates. The impact of climate-change scenarios on the study basins indicated that water availability would sustain until the end of the century due to higher projected precipitation even though after the depletion of glaciers in the region
    corecore