259 research outputs found

    Uncertainty Quantification in Line Edge Roughness Estimation Using Conformal Prediction

    Get PDF
    With its increasing pervasiveness across multiple industries, machine learning is expected to occupy greater significance in the semiconductor manufacturing industry. To foster trust and facilitate the adoption of machine-learning models, it is necessary to employ prediction intervals which summarize the performance and consistency of their predictions. Conformal prediction is a recent, and mathematically proven technique to approach prediction intervals for classification and regression problems. Unlike traditional approaches, conformal prediction does not require distributional assumptions such as Gaussianity. Furthermore, it can be combined with other techniques to yield a variety of interval predictions algorithms. We aim to illustrate the applications and performance of some of these methods on line edge roughness (LER) estimation. Experimental studies have shown that LER degrades the performance of semiconductor devices. While scanning electron microscope (SEM) is the method of choice for measuring LER, it is fraught with added uncertainty such as instrumental noise, edge effects, and Poisson noise. This work focuses on developing prediction intervals for LER estimates derived from EDGENet, which is a deep convolutional neural network trained on a large dataset of simulated SEM images. EDGENet was originally developed by our research group and directly outputs predictions of true edge positions from a corrupted SEM image

    Probabilistic Load Forecasting with Deep Conformalized Quantile Regression

    Get PDF
    The establishment of smart grids and the introduction of distributed generation posed new challenges in energy analytics that can be tackled with machine learning algorithms. The latter, are able to handle a combination of weather and consumption data, grid measurements, and their historical records to compute inference and make predictions. An accurate energy load forecasting is essential to assure reliable grid operation and power provision at peak times when power consumption is high. However, most of the existing load forecasting algorithms provide only point estimates or probabilistic forecasting methods that construct prediction intervals without coverage guarantee. Nevertheless, information about uncertainty and prediction intervals is very useful to grid operators to evaluate the reliability of operations in the power network and to enable a risk-based strategy for configuring the grid over a conservative one. There are two popular statistical methods used to generate prediction intervals in regression tasks: Quantile regression is a non-parametric probabilistic forecasting technique producing prediction intervals adaptive to local variability within the data by estimating quantile functions directly from the data. However, the actual coverage of the prediction intervals obtained via quantile regression is not guaranteed to satisfy the designed coverage level for finite samples. Conformal prediction is an on-top probabilistic forecasting framework producing symmetric prediction intervals, most often with a fixed length, guaranteed to marginally satisfy the designed coverage level for finite samples. This thesis proposes a probabilistic load forecasting method for constructing marginally valid prediction intervals adaptive to local variability and suitable for data characterized by temporal dependencies. The method is applied in conjunction with recurrent neural networks, deep learning architectures for sequential data, which are mostly used to compute point forecasts rather than probabilistic forecasts. Specifically, the use of an ensemble of pinball-loss guided deep neural networks performing quantile regression is used together with conformal prediction to address the individual shortcomings of both techniques

    Effets climatiques et non climatiques sur la distribution des plantes et la migration prévue dans l'est de l'Amérique du Nord

    Get PDF
    Abstract : Understanding the spatial distributions and abundances of organisms is a central goal of ecology. The spatial distributions of many plant species are expected to change greatly in response to global climate change. To predict species’ future distributions, we need to answer two fundamental questions: 1. Are plants migrating in response to climate change? 2. Where are suitable habitats for plants under future climate scenarios? For the first question, as tree species provide foundations to many terrestrial ecosystems, it is important to predict whether or not tree species can track climate change. Differences between the distributions of tree saplings and adults in geographic or niche space have been used to infer climate change effects on tree range dynamics. Previous studies have reported narrower latitudinal or climatic niche ranges of juvenile trees compared to adults, concluding that tree ranges are contracting, contradicting climate-based predictions. However, more comprehensive sampling of adult trees than juvenile trees in many regional forest inventories could potentially bias ontogenetic comparisons. In Chapter 2, I first report spatial simulations showing that reduced sampling intensity can result in underestimates of range and niche limits. I then re-analyzed the U.S. Forest Inventory and Analysis data using two resampling procedures. These resampling procedures had a major influence on the estimation of range limits, most often by reducing, eliminating, or even reversing the tendency in the original analyses for saplings to have broader distributions than adult trees. These results suggest that previous conclusions that the distributions of juvenile trees were contracting were potentially artefacts of sampling in the underlying data. For the second question, species distribution models (SDMs) are widely used to predict plant future suitable habitats, but usually with only climatic predictors. However, plant distributions are also influenced by many non-climatic factors, such as soil properties, dispersal limitation, and the light environment. Understanding non-climate effects on plant distributions would provide more realistic predictions of biodiversity change. In this thesis, I explored soil effects on plant spatial distributions along a latitudinal gradient (Chapter 3) and an elevational gradient (Chapter 4). In Chapter 3, we built three species distribution models (SDMs) – one with only climate predictors, one with only soil predictors, and one with both – for each of 1870 plant species in Eastern North America. These models showed that while climate variables were the most important predictors, soil properties also had a substantial influence on continental-scale plant distributions. Under future climate scenarios, models including soil predicted much smaller northward shifts (~40% reduction) in distributions than climate-only models, strongly suggesting that high-latitude soils are likely to impede ongoing plant migration. However, macroecological studies rely on soil data at a much coarser spatial resolution than that experienced by plants. Studies along elevational gradients can provide detailed soil data at the same spatial resolution as occurrence and abundance data, while still covering a wide climatic gradient. In Chapter 4, I report an intensive field survey of four spring forest herbs and soil properties along an elevational gradient in southern Québec, Canada, testing the hypothesis that soil properties contribute to defining upper elevational range limits. I found that soil properties had substantial impacts on the occurrence or abundance of all four species, and soil effects were more pronounced at higher elevations. For two species, Trillium erectum and Claytonia caroliniana, very infrequent occurrences at high elevation (>950m) were strongly associated with rare microsites with high pH or nutrients, suggesting that soil properties play important roles in constraining plant upper range limit. Overall, these findings suggest that i) inferring plant migration processes should pay attention to the sampling bias underlying data; ii) soil properties can have major impacts on plant distributions along climatic gradients, and it is necessary to incorporate soil properties into models and predictions for plant distributions and migration under environmental change.Comprendre les distributions spatiales et l'abondance des organismes est un objectif central de l'écologie. Les distributions spatiales de nombreuses espèces végétales sont censées changer considérablement en réponse aux changements climatiques mondiaux. Pour prédire les distributions futures des espèces, nous devons répondre à deux questions fondamentales : 1. Les plantes migrent-elles en réponse aux changements climatiques ? 2. Où se trouvent les habitats propices aux plantes dans les scénarios climatiques futurs ? Pour la première question, étant donné que les espèces d'arbres constituent les fondations de nombreux écosystèmes terrestres, il est important de prédire si oui ou non les espèces d'arbres peuvent suivre la tendance climatique. Les différences entre les distributions des jeunes arbres et des adultes dans l'espace et à l’échelle de la niche ont été utilisées pour déduire les effets des changements climatiques sur la dynamique des distributions d'arbres. Des études antérieures ont signalé des distributions de niches latitudinales ou climatiques plus étroites d'arbres juvéniles par rapport aux adultes, concluant que les distributions d'arbres se contractent, contredisant les prévisions basées sur le climat. Cependant, un échantillonnage plus complet des arbres adultes que des arbres juvéniles dans de nombreux inventaires forestiers régionaux pourrait potentiellement biaiser les comparaisons ontogénétiques. Au chapitre 2, je rapporte d'abord des simulations spatiales montrant qu'une intensité d'échantillonnage réduite peut entraîner une sous-estimation des limites de l'aire de répartition et de la niche. J'ai ensuite réanalysé les données de l'inventaire et de l'analyse des forêts des États-Unis à l'aide de deux procédures de rééchantillonnage. Ces procédures de rééchantillonnage ont eu une influence majeure sur l'estimation des limites de l'aire de répartition, le plus souvent en réduisant, en éliminant ou même en inversant la tendance des analyses originales pour les jeunes arbres à avoir des distributions plus larges que les arbres adultes, ce qui suggère que les conclusions précédentes selon lesquelles les distributions des arbres juvéniles se contractaient étaient potentiellement des artéfacts d'échantillonnage dans les données sous-jacentes. Pour la deuxième question, les modèles de distribution des espèces (SDM) sont largement utilisés pour prédire les futurs habitats convenables des plantes, mais généralement avec uniquement des prédicteurs climatiques. Cependant, la distribution des plantes est également influencée par de nombreux facteurs non climatiques, tels que les propriétés du sol, la limitation de la dispersion et l'environnement lumineux. Comprendre les effets non climatiques sur la distribution des plantes fournirait des prévisions plus réalistes des changements de la biodiversité. Dans cette thèse, j'ai exploré les effets du sol sur les distributions spatiales des plantes le long d'un gradient latitudinal (Chapitre 3) et d'un gradient altimétrique (Chapitre 4). Au chapitre 3, nous avons construit trois modèles de distribution d'espèces (SDM) - un avec uniquement des prédicteurs climatiques, un avec uniquement des prédicteurs de sol et un avec les deux - pour chacune des 1870 espèces végétales de l'est de l'Amérique du Nord. Ces modèles ont montré que si les variables climatiques étaient les prédicteurs les plus importants, les propriétés du sol exerçaient également une influence substantielle sur la distribution des plantes à l'échelle continentale. Dans les scénarios climatiques futurs, les modèles incluant le sol ont prédit des déplacements beaucoup plus faibles vers le nord (réduction d'environ 40 %) des distributions que les modèles uniquement climatiques, ce qui suggère fortement que les sols à haute latitude sont susceptibles d'entraver la migration continue des plantes. Cependant, les études macroécologiques s'appuient sur des données de sol à une résolution spatiale beaucoup plus grossière que celle vécue par les plantes. Les études le long des gradients altimétriques peuvent fournir des données pédologiques détaillées à la même résolution spatiale que les données d'occurrence et d'abondance, tout en couvrant un large gradient climatique. Dans le chapitre 4, je rapporte une étude intensive sur le terrain de quatre herbacées forestières printanières et des propriétés du sol le long d'un gradient d'altitude dans le sud du Québec, au Canada, testant l'hypothèse selon laquelle les propriétés du sol contribuent à définir les limites supérieures de la distribution altitudinale. J'ai découvert que les propriétés du sol avaient des impacts substantiels sur la présence ou l'abondance des quatre espèces, et que les effets du sol étaient plus prononcés à des altitudes plus élevées. Pour deux espèces, Trillium erectum et Claytonia caroliniana, de très faibles occurrences à haute altitude (> 950 m) étaient fortement associées à des microsites rares avec un pH élevé ou des nutriments abondants, ce qui suggère que les propriétés du sol jouent un rôle important dans la restriction de la limite supérieure de l'aire de répartition des plantes. Dans l'ensemble, ces résultats suggèrent que i) l'inférence des processus de migration des plantes devrait porter attention au biais d'échantillonnage sous-jacent aux données ; ii) les propriétés du sol peuvent avoir des impacts majeurs sur la distribution des plantes le long des gradients climatiques, et il est nécessaire d'incorporer les propriétés du sol dans les modèles et les prévisions pour la distribution et la migration des plantes sous changement environnemental

    Application of remote sensing to selected problems within the state of California

    Get PDF
    There are no author-identified significant results in this report

    Development, validation and application of in-silico methods to predict the macromolecular targets of small organic compounds

    Get PDF
    Computational methods to predict the macromolecular targets of small organic drugs and drug-like compounds play a key role in early drug discovery and drug repurposing efforts. These methods are developed by building predictive models that aim to learn the relationships between compounds and their targets in order to predict the bioactivity of the compounds. In this thesis, we analyzed the strategies used to validate target prediction approaches and how current strategies leave crucial questions about performance unanswered. Namely, how does an approach perform on a compound of interest, with its structural specificities, as opposed to the average query compound in the test data? We constructed and present new guidelines on validation strategies to address these short-comings. We then present the development and validation of two ligand-based target prediction approaches: a similarity-based approach and a binary relevance random forest (machine learning) based approach, which have a wide coverage of the target space. Importantly, we applied a new validation protocol to benchmark the performance of these approaches. The approaches were tested under three scenarios: a standard testing scenario with external data, a standard time-split scenario, and a close-to-real-world test scenario. We disaggregated the performance based on the distance of the testing data to the reference knowledge base, giving a more nuanced view of the performance of the approaches. We showed that, surprisingly, the similarity-based approach generally performed better than the machine learning based approach under all testing scenarios, while also having a target coverage which was twice as large. After validating two target prediction approaches, we present our work on a large-scale application of computational target prediction to curate optimized compound libraries. While screening large collections of compounds against biological targets is key to identifying new bioactivities, it is resource intensive and challenging. Small to medium-sized libraries, that have been optimized to have a higher chance of producing a true hit on an arbitrary target of interest are therefore valuable. We curated libraries of readily purchasable compounds by: i. utilizing property filters to ensure that the compounds have key physicochemical properties and are not overly reactive, ii. applying a similaritybased target prediction method, with a wide target scope, to predict the bioactivities of compounds, and iii. employing a genetic algorithm to select compounds for the library to maximize the biological diversity in the predicted bioactivities. These enriched small to medium-sized compound libraries provide valuable tool compounds to support early drug development and target identification efforts, and have been made available to the community. The distinctive contributions of this thesis include the development and benchmarking of two ligand-based target prediction approaches under novel validation scenarios, and the application of target prediction to enrich screening libraries with biologically diverse bioactive compounds. We hope that the insights presented in this thesis will help push data driven drug discovery forward.Doktorgradsavhandlin

    Feature Driven Learning Techniques for 3D Shape Segmentation

    Get PDF
    Segmentation is a fundamental problem in 3D shape analysis and machine learning. The abil-ity to partition a 3D shape into meaningful or functional parts is a vital ingredient of many down stream applications like shape matching, classification and retrieval. Early segmentation methods were based on approaches like fitting primitive shapes to parts or extracting segmen-tations from feature points. However, such methods had limited success on shapes with more complex geometry. Observing this, research began using geometric features to aid the segmen-tation, as certain features (e.g. Shape Diameter Function (SDF)) are less sensitive to complex geometry. This trend was also incorporated in the shift to set-wide segmentations, called co-segmentation, which provides a consistent segmentation throughout a shape dataset, meaning similar parts have the same segment identifier. The idea of co-segmentation is that a set of same class shapes (i.e. chairs) contain more information about the class than a single shape would, which could lead to an overall improvement to the segmentation of the individual shapes. Over the past decade many different approaches of co-segmentation have been explored covering supervised, unsupervised and even user-driven active learning. In each of the areas, there has been widely adopted use of geometric features to aid proposed segmentation algorithms, with each method typically using different combinations of features. The aim of this thesis is to ex-plore these different areas of 3D shape segmentation, perform an analysis of the effectiveness of geometric features in these areas and tackle core issues that currently exist in the literature.Initially, we explore the area of unsupervised segmentation, specifically looking at co-segmentation, and perform an analysis of several different geometric features. Our analysis is intended to compare the different features in a single unsupervised pipeline to evaluate their usefulness and determine their strengths and weaknesses. Our analysis also includes several features that have not yet been explored in unsupervised segmentation but have been shown effective in other areas.Later, with the ever increasing popularity of deep learning, we explore the area of super-vised segmentation and investigate the current state of Neural Network (NN) driven techniques. We specifically observe limitations in the current state-of-the-art and propose a novel Convolu-tional Neural Network (CNN) based method which operates on multi-scale geometric features to gain more information about the shapes being segmented. We also perform an evaluation of several different supervised segmentation methods using the same input features, but with vary-ing complexity of model design. This is intended to see if the more complex models provide a significant performance increase.Lastly, we explore the user-driven area of active learning, to tackle the large amounts of inconsistencies in current ground truth segmentation, which are vital for most segmentation methods. Active learning has been used to great effect for ground truth generation in the past, so we present a novel active learning framework using deep learning and geometric features to assist the user in co-segmentation of a dataset. Our method emphasises segmentation accu-racy while minimising user effort, providing an interactive visualisation for co-segmentation analysis and the application of automated optimisation tools.In this thesis we explore the effectiveness of different geometric features across varying segmentation tasks, providing an in-depth analysis and comparison of state-of-the-art methods

    Statistical Data Modeling and Machine Learning with Applications

    Get PDF
    The modeling and processing of empirical data is one of the main subjects and goals of statistics. Nowadays, with the development of computer science, the extraction of useful and often hidden information and patterns from data sets of different volumes and complex data sets in warehouses has been added to these goals. New and powerful statistical techniques with machine learning (ML) and data mining paradigms have been developed. To one degree or another, all of these techniques and algorithms originate from a rigorous mathematical basis, including probability theory and mathematical statistics, operational research, mathematical analysis, numerical methods, etc. Popular ML methods, such as artificial neural networks (ANN), support vector machines (SVM), decision trees, random forest (RF), among others, have generated models that can be considered as straightforward applications of optimization theory and statistical estimation. The wide arsenal of classical statistical approaches combined with powerful ML techniques allows many challenging and practical problems to be solved. This Special Issue belongs to the section “Mathematics and Computer Science”. Its aim is to establish a brief collection of carefully selected papers presenting new and original methods, data analyses, case studies, comparative studies, and other research on the topic of statistical data modeling and ML as well as their applications. Particular attention is given, but is not limited, to theories and applications in diverse areas such as computer science, medicine, engineering, banking, education, sociology, economics, among others. The resulting palette of methods, algorithms, and applications for statistical modeling and ML presented in this Special Issue is expected to contribute to the further development of research in this area. We also believe that the new knowledge acquired here as well as the applied results are attractive and useful for young scientists, doctoral students, and researchers from various scientific specialties
    corecore