38 research outputs found

    Time-varying nonstationary multivariate risk analysis using a dynamic Bayesian copula

    Get PDF
    A time-varying risk analysis is proposed for an adaptive design framework in nonstationary conditions arising from climate change. A Bayesian, dynamic conditional copula is developed for modeling the time-varying dependence structure between mixed continuous and discrete multiattributes of multidimensional hydrometeorological phenomena. Joint Bayesian inference is carried out to fit the marginals and copula in an illustrative example using an adaptive, Gibbs Markov Chain Monte Carlo (MCMC) sampler. Posterior mean estimates and credible intervals are provided for the model parameters and the Deviance Information Criterion (DIC) is used to select the model that best captures different forms of nonstationarity over time. This study also introduces a fully Bayesian, time-varying joint return period for multivariate time-dependent risk analysis in nonstationary environments.We thank the associate editor and three anonymous reviewers whose suggestions helped improve the paper. We acknowledge the CMIP5 climate coupled modelling groups, for producing and making their model outputs available, the U.S. Department of Energy's Program for Climate Model Diagnosis and Intercomparison (PCMDI), which provides coordinating support and led development of software infrastructure in partnership with the Global Organization for Earth System Science Portals. The CMIP5 model outputs used in the present study are available from http://cmip-pcmdi.llnl.gov/cmip5/data_portal.html. We also thank the Iran Meteorological Organization (IRIMO) for providing rainfall data recorded at the Tehran synoptic station. Funding support was provided by the Natural Sciences and Engineering Research Council (NSERC) of Canada

    Proceedings of the 35th International Workshop on Statistical Modelling : July 20- 24, 2020 Bilbao, Basque Country, Spain

    Get PDF
    466 p.The InternationalWorkshop on Statistical Modelling (IWSM) is a reference workshop in promoting statistical modelling, applications of Statistics for researchers, academics and industrialist in a broad sense. Unfortunately, the global COVID-19 pandemic has not allowed holding the 35th edition of the IWSM in Bilbao in July 2020. Despite the situation and following the spirit of the Workshop and the Statistical Modelling Society, we are delighted to bring you the proceedings book of extended abstracts

    Statistical and deep learning methods for geoscience problems

    Get PDF
    Machine learning is the new frontier for technology development in geosciences and has developed extremely fast in the past decade. With the increased compute power provided by distributed computing and Graphics Processing Units (GPUs) and their exploitation provided by machine learning (ML) frameworks such as Keras, Pytorch, and Tensorflow, ML algorithms can now solve complex scientific problems. Although powerful, ML algorithms need to be applied to suitable problems conditioned for optimal results. For this reason ML algorithms require not only a deep understanding of the problem but also of the algorithm’s ability. In this dissertation, I show that Simple statistical techniques can often outperform ML-based models if applied correctly. In this dissertation, I show the success of deep learning in addressing two difficult problems. In the first application I use deep learning to auto-detect the leaks in a carbon capture project using pressure field data acquired from the DOE Cranfield site in Mississippi. I use the history of pressure, rates, and cumulative injection volumes to detect leaks as pressure anomaly. I use a different deep learning workflow to forecast high-energy electrons in Earth’s outer radiation belt using in situ measurements of different space weather parameters such as solar wind density and pressure. I focus on predicting electron fluxes of 2 MeV and higher energy and introduce the ensemble of deep learning models to further improve the results as compared to using a single deep learning architecture. I also show an example where a carefully constructed statistical approach, guided by the human interpreter, outperforms deep learning algorithms implemented by others. Here, the goal is to correlate multiple well logs across a survey area in order to map not only the thickness, but also to characterize the behavior of stacked gamma ray parasequence sets. Using tools including maximum likelihood estimation (MLE) and dynamic time warping (DTW) provides a means of generating quantitative maps of upward fining and upward coarsening across the oil field. The ultimate goal is to link such extensive well control with the spectral attribute signature of 3D seismic data volumes to provide a detailed maps of not only the depositional history, but also insight into lateral and vertical variation of mineralogy important to the effective completion of shale resource plays

    Proceedings of the 35th International Workshop on Statistical Modelling : July 20- 24, 2020 Bilbao, Basque Country, Spain

    Get PDF
    466 p.The InternationalWorkshop on Statistical Modelling (IWSM) is a reference workshop in promoting statistical modelling, applications of Statistics for researchers, academics and industrialist in a broad sense. Unfortunately, the global COVID-19 pandemic has not allowed holding the 35th edition of the IWSM in Bilbao in July 2020. Despite the situation and following the spirit of the Workshop and the Statistical Modelling Society, we are delighted to bring you the proceedings book of extended abstracts

    Flexible estimation of temporal point processes and graphs

    Get PDF
    Handling complex data types with spatial structures, temporal dependencies, or discrete values, is generally a challenge in statistics and machine learning. In the recent years, there has been an increasing need of methodological and theoretical work to analyse non-standard data types, for instance, data collected on protein structures, genes interactions, social networks or physical sensors. In this thesis, I will propose a methodology and provide theoretical guarantees for analysing two general types of discrete data emerging from interactive phenomena, namely temporal point processes and graphs. On the one hand, temporal point processes are stochastic processes used to model event data, i.e., data that comes as discrete points in time or space where some phenomenon occurs. Some of the most successful applications of these discrete processes include online messages, financial transactions, earthquake strikes, and neuronal spikes. The popularity of these processes notably comes from their ability to model unobserved interactions and dependencies between temporally and spatially distant events. However, statistical methods for point processes generally rely on estimating a latent, unobserved, stochastic intensity process. In this context, designing flexible models and consistent estimation methods is often a challenging task. On the other hand, graphs are structures made of nodes (or agents) and edges (or links), where an edge represents an interaction or relationship between two nodes. Graphs are ubiquitous to model real-world social, transport, and mobility networks, where edges can correspond to virtual exchanges, physical connections between places, or migrations across geographical areas. Besides, graphs are used to represent correlations and lead-lag relationships between time series, and local dependence between random objects. Graphs are typical examples of non-Euclidean data, where adequate distance measures, similarity functions, and generative models need to be formalised. In the deep learning community, graphs have become particularly popular within the field of geometric deep learning. Structure and dependence can both be modelled by temporal point processes and graphs, although predominantly, the former act on the temporal domain while the latter conceptualise spatial interactions. Nonetheless, some statistical models combine graphs and point processes in order to account for both spatial and temporal dependencies. For instance, temporal point processes have been used to model the birth times of edges and nodes in temporal graphs. Moreover, some multivariate point processes models have a latent graph parameter governing the pairwise causal relationships between the components of the process. In this thesis, I will notably study such a model, called the Hawkes model, as well as graphs evolving in time. This thesis aims at designing inference methods that provide flexibility in the contexts of temporal point processes and graphs. This manuscript is presented in an integrated format, with four main chapters and two appendices. Chapters 2 and 3 are dedicated to the study of Bayesian nonparametric inference methods in the generalised Hawkes point process model. While Chapter 2 provides theoretical guarantees for existing methods, Chapter 3 also proposes, analyses, and evaluates a novel variational Bayes methodology. The other main chapters introduce and study model-free inference approaches for two estimation problems on graphs, namely spectral methods for the signed graph clustering problem in Chapter 4, and a deep learning algorithm for the network change point detection task on temporal graphs in Chapter 5. Additionally, Chapter 1 provides an introduction and background preliminaries on point processes and graphs. Chapter 6 concludes this thesis with a summary and critical thinking on the works in this manuscript, and proposals for future research. Finally, the appendices contain two supplementary papers. The first one, in Appendix A, initiated after the COVID-19 outbreak in March 2020, is an application of a discrete-time Hawkes model to COVID-related deaths counts during the first wave of the pandemic. The second work, in Appendix B, was conducted during an internship at Amazon Research in 2021, and proposes an explainability method for anomaly detection models acting on multivariate time series

    Quantitative Risk Analysis using Real-time Data and Change-point Analysis for Data-informed Risk Prediction

    Get PDF
    Incidents in highly hazardous process industries (HHPI) are a major concern for various stakeholders due to the impact on human lives, environment, and potentially huge financial losses. Because process activities, location and products are unique, risk analysis techniques applied in the HHPI has evolved over the years. Unfortunately, some limitations of the various quantitative risk analysis (QRA) method currently employed means alternative or more improved methods are required. This research has obtained one such method called Big Data QRA Method. This method relies entirely on big data techniques and real-time process data to identify the point at which process risk is imminent and provide the extent of contribution of other components interacting up to the time index of the risk. Unlike the existing QRA methods which are static and based on unvalidated assumptions and data from single case studies, the big data method is dynamic and can be applied to most process systems. This alternative method is my original contribution to science and the practice of risk analysis The detailed procedure which has been provided in Chapter 9 of this thesis applies multiple change-point analysis and other big data techniques like, (a) time series analysis, (b) data exploration and compression techniques, (c) decision tree modelling, (d) linear regression modelling. Since the distributional properties of process data can change over time, the big data approach was found to be more appropriate. Considering the unique conditions, activities and the process systems use within the HHPI, the dust fire and explosion incidents at the Imperial Sugar Factory and the New England Wood Pellet LLC both of which occurred in the USA were found to be suitable case histories to use as a guide for evaluation of data in this research. Data analysis was performed using open source software packages in R Studio. Based on the investigation, multiple-change-point analysis packages strucchange and changepoint were found to be successful at detecting early signs of deteriorating conditions of component in process equipment and the main process risk. One such process component is a bearing which was suspected as the source of ignition which led to the dust fire and explosion at the Imperial Sugar Factory. As a result, this this research applies the big data QRA method procedure to bearing vibration data to predict early deterioration of bearings and final period when the bearing’s performance begins the final phase of deterioration to failure. Model-based identification of these periods provides an indication of whether the conditions of a mechanical part in process equipment at a particular moment represent an unacceptable risk. The procedure starts with selection of process operation data based on the findings of an incident investigation report on the case history of a known process incident. As the defining components of risk, both the frequency and consequences associated with the risk were obtained from the incident investigation reports. Acceptance criteria for the risk can be applied to the periods between the risks detected by the two change-point packages. The method was validated with two case study datasets to demonstrate its applicability as procedure for QRA. The procedure was then tested with two other case study datasets as examples of its application as a QRA method. The insight obtained from the validation and the applied examples led to the conclusion that big data techniques can be applied to real-time process data for risk assessment in the HHPI

    Mining a large shopping database to predict where, when, and what consumers will buy next

    Get PDF
    Retailers with electronic point-of-sale systems continuously amass detailed data about the items each consumer buys (i.e. what item, how often, its package size, how many were bought, whether the item was on special, etc.). Where the retailer can also associate purchases with a particular individual for example, when an account or loyalty card is issued, the buying behaviour of the consumer can be tracked over time, providing the retailer with valuable information about a customer's changing preferences. This project is based on mining a large database, containing the purchase histories of some 300 000 customers of a retailer, for insights into the behaviour of those customers. Specifically, the aim is to build three predictive models, each forming a chapter of the dissertation; forecasting the number of daily customers to visit a store, detecting changes in consumers' inter-purchase times, and predicting repeat customers after being given a special offer. Having too many goods and not enough customers implies loss for a business; having too few goods implies a lost opportunity to turn a profit. The ideal situation is to stock the appropriate number of goods for the number of customers arriving, so you can minimize loss, and maximize profit. To attend to this problem, in the first chapter we forecast the number of customers that will visit a store each day to buy any product (i.e. store daily visits). In the process we also carry out a time-series forecasting methods comparison, with the main aim of comparing machine learning methods to classical statistical methods. The models are fitted into a univariate time-series data and the best model for this particular dataset is selected using three accuracy measures. The results showed that there was not much difference between the methods, but some classical methods slightly performed better than the machine learning algorithms, and this was consistent with outcomes obtained by Makridakis et al. (2018) on similar comparisons. It is also vital for retailers to know when there has been a change in their consumers purchase behaviour. This change can either be the time between purchases, change in brand selection or change in market share. It is critical for such changes to be detected as early as possible, as speedy detection can help managers act before incurring loses. In the second chapter, we use change-point models to detect changes in consumers' inter-purchase times. Change-point models are approaches that offer a flexible, general-purpose solution to the problem of detecting changes in customer historic behaviour. This multiple change-point model assumes that there is a sequence of underlying parameters, and that this sequence is partitioned into contiguous blocks. These partitions are such that the parameter values are equal within, and different between blocks, whereby a beginning of a block is considered to be a change point. This changepoint model is fitted to consumers inter-purchase times (i.e. we model time between purchases) to see whether there were any significant changes on the consumers buying behaviour over a one year purchase period. The results showed that, depending on the length of the sequences, minority to a handful of customers do experience changes in their purchasing behaviours, with the longer sequences having more changes than the shorter ones. The results seemed to be different to those obtained by Clark and Durbach (2014), but analysing a portion of sequences of same lengths as those analysed in Clark and Durbach (2014), lead to similar results. Increasing sales growth is also vital for retailers, and there are various possible ways in which this can be achieved. One of the strategies is what is referred to as up-selling (whereby a customer is persuaded to make an additional purchase of the same product or purchase a more expensive version of the product.) and cross-selling (whereby a retailer sells a different product or service to an existing customer). These involve campaigning to customers and sell certain products, and sometimes include incentives in the campaign with the aim of exposing customers to these products hoping they will become repeat customers afterwards. In Chapter 3 we build a model to predict which customers are likely to become repeat customers after being given a special offer. This model is fitted to customers' time between two purchases, which makes the input time-series data, and is sequential in nature. Therefore, we build models that provide a good way for dealing with sequential inputs (i.e. convolutional neural networks and recurrent neural networks), and compare them to models that do not take into account the sequence of the data (i.e. feedforward neural networks and decision trees). The results showed that, inter-purchase times are only useful when they are about the same product, as models did no better than random if inter-purchase times were from a different product in the same department. Secondly, it is useful to take the order of the sequence into account, as models that do this do better than those who do not, with the latter not doing any better than a null model. Lastly, while none of the models performed well, deep learning models perform better than standard classification models and produce some substantial lift

    Musculoskeletal Load Exposure Estimation by Non-supervised Annotation of Events on Motion Data

    Get PDF
    There is a significant number of work pressures that promote the incidence of musculoskeletal disorders in industrial environments. As, unfortunately, many workplace conditions are subject to these biomechanical hazards, this has become an extensively common health disorder. To properly adjust intervention strategies, an ergonomic assessment through surveillance measurements is required. However, most measurements still depend on subjective assessment tools like self-reporting and expert observation. The ideal approach for this scenario would be to use direct measurements that use sensors to retrieve more precise/accurate information of how workers interact with their work environment. Following this approach, one of the major constraints would be that a systematic retrieval of data from a labor environment would require a tiresome process of analysis and manual annotation, deviating resources and requiring data analysts. Hence, this work proposes an unsupervised methodology able to automatically annotate relevant events from direct acquisitions, with the final intent of promoting this type of analysis. The event detection methodology proposes to detect three different event types: 1) work period transition; 2) work cycle transition; and 3) sub-sequence matching by query. To achieve this, the multivariate time series are represented as a Self-Similarity matrix built with the features extracted. This matrix is analysed for each event needed to be searched. The results were successful in the segmentation of Active and Non-active working periods and in the detection of points of transition between repetitive human motions, i.e. work cycles. A method of search-by-example is also presented, being that it allows for the user to detect specific motions of interest. Although this method could still be further optimized in future work, this approach has a very promising prospect as it proposes a strategy of similarity analysis that has not yet been deeply explored in the context of ergonomic acquisition. These advances are also significant given that the summarization of ergonomic data is still a subject in expansion.Num contexto industrial, são várias as tensões que promovem a incidência de distúrbios musculosqueléticos. Uma vez que a maioria das condições laborais estão sujeitas a estas propensões do foro biomecânico, os distúrbiosmusculosqueléticos tornaram-se patologias amplamente diagnosticadas na população ativa. Para desenhar estratégias de intervenção eficientes, é necessário proceder a uma avaliação ergonómica baseada em metododologias de vigilância. Não obstante o reconhecimento desta necessidade, a maioria das medidas ainda depende de ferramentas subjetivas como a auto-avaliação e a observação externa por parte de especialistas. A abordagem preferencial para esta problemática passaria pela aplicação de medições diretas que recorressem a sensores com vista a extrair informação exata e fidedigna do ambiente laboral. Uma das maiores limitações deste leque de soluções consiste no facto de um sistema de recolha de dados neste ambiente implicar um processo exaustivo de análise e anotação manual, o que consome recursos e requer os serviços de analistas de dados. Assim, este trabalho propõe uma metodologia capaz de anotar automaticamente eventos relevantes provenientes de aquisições diretas, com o objetivo final de promover este tipo de análises mais eficientes. A metodologia de deteção de eventos proposta foca-se em três diferentes tipos de eventos: 1) transições entre tarefas; 2) transições entre ciclos de trabalho; e 3) procura de movimentos-exemplo em amostras segmentadas. Para concretizar este trabalho, realizou-se um estudo de matrizes de auto-semelhança. Os resultados provaram-se, na sua maioria, bem-sucedidos no caso da segmentação de períodos Ativos e Não-ativos e na deteção de momentos de transição entre movimentos repetitivos, isto é, ciclos de trabalho. É ainda apresentado um método de procura-porexemplo que permite ao utilizador detetar movimentos-exemplo do seu interesse. Embora este método possa ainda ser otimizado em trabalhos futuros, reflete uma abordagem promissora uma vez que propõe uma estratégia de análise de similaridade que não foi ainda especialmente explorada no contexto dos estudos ergonómicos. Estes avanços são ainda significantes na perspetiva de que a sumarização de dados ergonómicos é uma linha de investigação ainda em expansão
    corecore