374 research outputs found

    Integrating snp data and imputation methods into the DNA methylation analysis framework

    Get PDF
    DNA methylation is a widely studied epigenetic modification that can influence the expression and regulation of functional genes, especially those related to aging, cancer and other diseases. The common goal of methylation studies is to find differences in methylation levels between samples collected under different conditions. Differences can be detected at the site level, but regulated methylation targets are most commonly clustered into short regions. Thus, identifying differentially methylated regions (DMRs) between different groups is of prime interest. Despite advanced technology that enables measuring methylation genome-wide, misinterpretations in the readings can arise due to the existence of single nucleotide polymorphisms (SNPs) in the target sequence. One of the main pre-processing steps in DMR detection methods involves filtering out potential SNP-related probes due to this issue. In this work, it is proposed to leverage the current trend of collecting both SNP and methylation data on the same individual, making it possible to integrate SNP data into the DNA methylation analysis framework. This will enable the originally filtered potential SNPs to be restored if a SNP is not actually present. Furthermore, when a SNP is present or other missing data issues arise, imputation methods are proposed for methylation data. First, regularized linear regression (ridge, LASSO and elastic net) imputation models are proposed, along with a variable screening technique to restrict the number of variables in the models. Functional principal component regression imputation is also proposed as an alternative approach. The proposed imputation methods are compared to existing methods and evaluated based on imputation accuracy and DMR detection ability using both real and simulated data. One of the proposed methods (elastic net with variable screening) shows effective imputation accuracy without sacrificing computation efficiency across a variety of settings, while greatly improving the number of true positive DMR detections --Abstract, page iii

    Implementation of an hybrid machine learning methodology for pharmacological modeling

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional (Bioinformática) Universidade de Lisboa, Faculdade de Ciências, 2017Hoje em dia, especialmente na area biomedica, os dados contem milhares de variaveis de fontes diferentes e com apenas algumas instancias ao mesmo tempo. Devido a este facto, as abordagens da aprendizagem automatica enfrentam dois problemas, nomeadamente a questao da integracao de dados heterogeneos e a selecao das caracteristicas. Este trabalho propoe uma solucao eficiente para esta questao e proporciona uma implementacao funcional da metodologia hibrida. A inspiracao para este trabalho veio do desafio proposto no ambito da competicao AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge em 2016, e da solucao vencedora desenvolvida por Yuanfang Guan. Relativamente a motivacao do concurso, e observado que os tratamentos combinatorios para o cancro sao mais eficientes do que as terapias habituais de agente unico, desde que tem potencial para superar as desvantagens dos outros (limitado espetro de acao e desenvolvimento de resistencia). No entanto, o efeito combinatorio de drogas nao e obvio, produzindo possivelmente o resultado aditivo, sinergico ou antagonico. Assim, o objetivo da competicao era prever in vitro a sinergia dos compostos, sem ter acesso aos dados experimentais da terapia combinatoria. No ambito da competicao foram fornecidos ficheiros de varias fontes, contendo o conhecimento farmacologico tanto experimental como obtido de ajustamento das equacoes, a informacao sobre propriedades quimicas e estruturais de drogas, e por fim, os perfis moleculares de celulas, incluindo expressao de RNA, copy variants, sequencia e metilacao de DNA. O trabalho referido envolveu uma abordagem muito bem sucedida de integração dos dados heterogeneos, estendendo o modelo com conhecimento disponivel dentro do projeto The Cancer Cell Line Encyclopedia, e tambem introduzindo o passo decisivo de simulacao que permite imitar o efeito de terapia combinatoria no cancro. Apesar das descricoes pouco claras e da documentacao da solucao vencedora ineficiente, a reproducao da abordagem de Guan foi concluida, tentando ser o mais fiel possivel. A implementacao funcional foi escrita nas linguagens R e Python, e o seu desempenho foi verificado usando como referencia a matriz submetida no concurso. Para melhorar a metodologia, o workflow de selecao dos caracteristicas foi estabelecido e executado usando o algoritmo Lasso. Alem disso, o desempenho de dois metodos alternativos de modelacao foi experimentado, incluindo Support Vector Machine and Multivariate Adaptive Regression Splines (MARS). Varias versoes da equacao de integracao foram consideradas permitindo a determinacao de coeficientes aparentemente otimos. Como resultado, a compreensao da melhor solucao de competição foi desenvolvida e a implementacao funcional foi construida com sucesso. As melhorias foram propostas e no efeito o algoritmo SVM foi verificado como capaz de superar os outros na resolução deste problema, a equacao de integracao com melhor desempenho foi estabelecida e finalmente a lista de 75 variaveis moleculares mais informativas foi fornecida. Entre estes genes, poderiam ser encontrados possiveis candidatos de biomarcadores de cancro.Nowadays, especially in the biomedical field, the data sets usually contain thousands of multi-source variables and with only few instances in the same time. Due to this fact, Machine Learning approaches face two problems, namely the issue of heterogenous data integration and the feature selection. This work proposes an efficient solution for this question and provides a functional implementation of the hybrid methodology. The inspiration originated from the AstraZeneca-Sanger Drug Combination Prediction DREAM Challenge from 2016 and the winning solution by Yuanfang Guan. Regarding to the motivation of competition, the combinatory cancer treatments are believed to be more effective than standard single-agent therapies since they have a potential to overcome others weaknesses (narrow spectrum of action and development of the resistance). However, the combinatorial drug effect is not obvious bringing possibly additive, synergistic or antagonistic treatment result. Thus, the goal of the competition was to predict in vitro compound synergy, without the access to the experimental combinatory therapy data. Within the competition, the multi-source files were supplied, encompassing the pharmacological knowledge from experiments and equation-fitting, the information on chemical properties and structure of drugs, finally the molecular cell profiles including RNA expression, copy variants, DNA sequence and methylation. The referred work included very successful approach of heterogenous data integration, extending additionally the model with prior knowledge outsourced from The Cancer Cell Line Encyclopedia, as well as introduced a key step of simulation that allows to imitate effect of a combinatory therapy on cancer. Despite unexplicit descriptions and poor documentation of the winning solution, as accurate as possible, reproduction of Guan’s approach was accomplished. The functional implementation was written in R and Python languages, and its performance was verified using as a reference the submitted in challenge prediction matrix. In order to improve the methodology feature selection workflow was established and run using a Lasso algorithm. Moreover, the performance of two alternative modeling methods was experimented including Support Vector Machine and Multivariate Adaptive Regression Splines (MARS). Several versions of merging equation were considered allowing determination of apparently optimal coefficients. As the result, the understanding of the best challenge solution was developed and the functional implementation was successfully constructed. The improvements were proposed and in the effect the SVM algorithm was verified to surpass others in solving this problem, the best-performing merging equation was established, and finally the list of 75 most informative molecular variables was provided. Among those genes, potential cancer biomarker candidates could be found

    Automated data inspection in jet engines

    Get PDF
    Rolls Royce accumulate a large amount of sensor data throughout the testing and deployment of their engines. The availability of this rich source of data offers exciting opportunities to automate the monitoring and testing of the engines. In this thesis we have developed statistical models to make meaningful insights from engine test data. We have built a classification model to identify different types of engine running in Pass-Off tests. The labels can be used for post-analysis and highlight problematic engine tests. The model has been applied to two different types of engines, in which it gives close to perfect classification accuracy. We have also created an unsupervised approach when there are no defined classes of engine running. These models have been incorporated into Rolls Royce systems. Early warnings for potential issues can enable relatively cheap maintenance to be performed and reduce the risk of irreparable engine damage. We have therefore developed an outlier detection model to identify abnormal temperature behaviour. The capabilities of the model are shown theoretically and tested on experimental and real data. Lastly, in a test decisions are made by engineers to ensure the engine complies with certain standards. To support the engineers we have developed a predictive model to identify segments of the engine test that should be retested. The model is tested against the current decision making of the engineers, and gives good predictive performance. The model highlights the possibility of automating the decision making process within a test

    Automatic Extraction of Ordinary Differential Equations from Data: Sparse Regression Tools for System Identification

    Get PDF
    Studying nonlinear systems across engineering, physics, economics, biology, and chemistry often hinges upon successfully discovering their underlying dynamics. However, despite the abundance of data in today's world, a complete comprehension of these governing equations often remains elusive, posing a significant challenge. Traditional system identification methods for building mathematical models to describe these dynamics can be time-consuming, error-prone, and limited by data availability. This thesis presents three comprehensive strategies to address these challenges and automate model discovery. The procedures outlined here employ classic statistical and machine learning methods, such as signal filtering, sparse regression, bootstrap sampling, Bayesian inference, and unsupervised learning algorithms, to capture complex and nonlinear relationships in data. Building on these foundational techniques, the proposed processes offer a reliable and efficient approach to identifying models of ordinary differential equations from data, differing from and complementing existing frameworks. The results presented here provide rigorous benchmarking against state-of-the-art algorithms, demonstrating the proposed methods' effectiveness in model discovery and highlighting the potential for discovering governing equations across applications such as weather forecasting, chemical reaction and electrical circuit modelling, and predator-prey dynamics. These methods can aid in solving critical decision-making problems, including optimising resource allocation, predicting system failures, and facilitating adaptive control in various domains. Ultimately, the strategies developed in this thesis are designed to integrate seamlessly into current workflows, thereby promoting data-driven decision-making and enhancing understanding of complex system dynamics

    Monthly Paleostreamflow Reconstruction from Annual Tree-Ring Chronologies

    Get PDF
    Paleoclimate reconstructions are increasingly used to characterize annual climate variability prior to the instrumental record, to improve estimates of climate extremes, and to provide a baseline for climate-change projections. To date, paleoclimate records have seen limited engineering use to estimate hydrologic risks because water systems models and managers usually require streamflow input at the monthly scale. This study explores the hypothesis that monthly streamflows can be adequately modeled by statistically decomposing annual flow reconstructions. To test this hypothesis, a multiple linear regression model for monthly streamflow reconstruction is presented that expands the set of predictors to include annual streamflow reconstructions, reconstructions of global circulation, and potential differences among regional tree-ring chronologies related to tree species and geographic location. This approach is used to reconstruct 600 years of monthly streamflows at two sites on the Bear and Logan rivers in northern Utah. Nash-Sutcliffe Efficiencies remain above zero (0.26–0.60) for all months except April and Pearson’s correlation coefficients (R) are 0.94 and 0.88 for the Bear and Logan rivers, respectively, confirming that the model can adequately reproduce monthly flows during the reference period (10/1942 to 9/2015). Incorporating a flexible transition between the previous and concurrent annual reconstructed flows was the most important factor for model skill. Expanding the model to include global climate indices and regional tree-ring chronologies produced smaller, but still significant improvements in model fit. The model presented here is the only approach currently available to reconstruct monthly streamflows directly from tree-ring chronologies and climate reconstructions, rather than using resampling of the observed record. With reasonable estimates of monthly flow that extend back in time many centuries, water managers can challenge systems models with a larger range of natural variability in drought and pluvial events and better evaluate extreme events with recurrence intervals longer than the observed record. Establishing this natural baseline is critical when estimating future hydrologic risks under conditions of a non-stationary climate

    Characterizing silicate materials via Raman spectroscopy and machine learning: Implications for novel approaches to studying melt dynamics

    Get PDF
    Silicate melt characteristics impose dramatic influence over igneous processes that operate, or have operated on, differentiated bodies: such as the Earth and Mars. Current understanding of these melt properties, such as composition, primarily comes from investigations on their volcanic byproducts. Therefore, it is imperative to innovate on modalities capable of constraining melt information in environments where a reliance on laboratory methods is severed. Recent investigations have turned to Raman Spectroscopy and amorphous volcanics as a suitable pairing for exploring these ideas. Silicate glasses are a proxy for igneous melts; and Raman spectroscopy is a robust analytical technique capable of operating in-situ. Existing calibrations for retrieving geochemical information from such samples using their Raman data are extremely underdeveloped, with only a handful of approaches available. Here, two supervised machine learning algorithms; Partial Least Squares (PLS) and Least Absolute Shrinkage & Selection Operator (LASSO) are employed with Raman spectroscopy to quantify geochemical information in volcanic glasses and tephra, while also qualifying the underlying atomic mechanics that drive Raman signal variability. This approach establishes a foundation for future explorations into new-age modeling technologies for geoscience experiments. Chapter I’s PLS geochemical model predicted the concentrations of oxide constituents in synthetic silicate glasses (SiO2, Na2O, K2O, CaO, TiO2, Al2O3, FeOT, MgO) with increased accuracy and applicability over currently available offerings. The study presents the largest and most diverse sampling suite yet utilized to produce such models. Chapter II highlights the limitations to PLS and LASSO based strategies for constraining iron (Fe)-redox information in glasses but uncovers their ability to accurately predict glass structural parameters like polymerization (NBO/T). Chapter III yielded accurate predictions of tephra concentrations from various mixed sediment samplings using PLS and LASSO calibrations. Spectra parameterizations highlighted that tephra signatures are unique enough to be readily distinguished from more crystalline profiles using Raman spectroscopy and machine learning. PLS and LASSO technologies are shown to be suitable, yet immature, avenues for unraveling the geochemical underpinnings of the Raman collections made in this work and help set the stage for future applications to Raman data from planetary missions such as the Perseverance Rover

    Semiparametric and Nonparametric Methods in Econometrics

    Get PDF
    The main objective of this workshop was to bring together mathematical statisticians and econometricians who work in the field of nonparametric and semiparametric statistical methods. Nonparametric and semiparametric methods are active fields of research in econometric theory and are becoming increasingly important in applied econometrics. This is because the flexibility of non- and semiparametric modelling provides important new ways to investigate problems in substantive economics. Moreover, the development of non- and semiparametric methods that are suitable to the needs of economics presents a variety of mathematical challenges. Topics to be addressed in the workshop included nonparametric methods in finance, identification and estimation of nonseparable models, nonparametric estimation under the constraints of economic theory, statistical inverse problems, long-memory time-series, and nonparametric cointegration

    Real-time Data Analytics for Condition Monitoring of Complex Industrial Systems

    Get PDF
    Modern industrial systems are now fitted with several sensors for condition monitoring. This is advantageous because these sensors can provide mass amounts of data that have the potential for aiding in tasks such as fault detection, diagnosis, and prognostics. However, the information valuable for performing these tasks is often clouded in noise and must be mined from high-dimensional data structures. Therefore, this dissertation presents a data analytics framework for performing these condition monitoring tasks using high-dimensional data. Demonstrations of this framework are detailed for challenges related to power generation systems in automobiles, power plants, and aircraft engines. These implementations leverage data collected from state-of-the-art, industry class test-rigs. Results indicate the ability of this framework to develop effective methodologies for condition monitoring of complex systems.Ph.D

    Penalized estimation in high-dimensional data analysis

    Get PDF

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio
    • …
    corecore