77 research outputs found

    UNSUPERVISED LEARNING IN PHYLOGENOMIC ANALYSIS OVER THE SPACE OF PHYLOGENETIC TREES

    Get PDF
    A phylogenetic tree is a tree to represent an evolutionary history between species or other entities. Phylogenomics is a new field intersecting phylogenetics and genomics and it is well-known that we need statistical learning methods to handle and analyze a large amount of data which can be generated relatively cheaply with new technologies. Based on the existing Markov models, we introduce a new method, CURatio, to identify outliers in a given gene data set. This method, intrinsically an unsupervised method, can find outliers from thousands or even more genes. This ability to analyze large amounts of genes (even with missing information) makes it unique in many parametric methods. At the same time, the exploration of statistical analysis in high-dimensional space of phylogenetic trees has never stopped, many tree metrics are proposed to statistical methodology. Tropical metric is one of them. We implement a MCMC sampling method to estimate the principal components in a tree space with the tropical metric for achieving dimension reduction and visualizing the result in a 2-D tropical triangle

    Modeling Mosquito Activity Built on Mosquito Population Dynamics: A Simulation Study

    Get PDF
    Background: West Nile virus (WNv) continues to be one of the most destructive mosquito borne diseases in the world, and Saskatchewan has experienced the highest incidence rates for WNv in North America. Its primary transmitters are mosquitoes, with Culex tarsalis serving as the main vector in Saskatchewan. For this reason, mosquito population dynamics is an important determinant of WNv risk. Weather factors, in turn, exert a pronounced impact on mosquito populations. It is important to understand the environmental factors playing a crucial role in oscillations of the mosquito population. It is also important to construct a model or create a method which can monitor and accurately estimate the overall dynamics of the mosquito population. Methods: In this study, a Probability Generating model is developed to simulate the mosquito observation counts, making use of a pre-existing System Dynamics Model to simulate a mosquito population. A MCMC method was further used to draw samples from a posterior distribution for Bayesian inference and analyse how frequency of observation of mosquito trap counts can improve performance of our model or method. Purpose of study: This study mainly focuses on investigating the feasibility of estimating the regression coefficients of the logistic regression model for the parameters (β) by using the proposed computational method. Meanwhile, we consider comparing the performance of this method with analysis under different sampling frequencies. Results: The results of the Probability Generating model depicts the distribution of the simulated observation data (y_i) over our study region (city of Saskatoon) seasonally, which suggests the environmental variables have a significant effect in driving variations in mosquito populations under the simulation experiments; the results of the three different sampling frequencies suggest that the current frequency (weekly) of measuring counts of trapped mosquitos is insufficient for reliable estimation of the parameters (β) for the durations examined. Conclusion: In this study, we formulated a probabilistic model from a combination of a reasonably complex dynamic model and a probabilistic generating model. Additionally, we have investigated the frequency of collecting real-world data associated with the accuracy of the model and revealed the importance of sampling mosquito population every day for reliably estimating parameter values, rather than pursuing the standard of sampling mosquito population every week

    Bayesian Recurrent Neural Network Models for Forecasting and Quantifying Uncertainty in Spatial-Temporal Data

    Full text link
    Recurrent neural networks (RNNs) are nonlinear dynamical models commonly used in the machine learning and dynamical systems literature to represent complex dynamical or sequential relationships between variables. More recently, as deep learning models have become more common, RNNs have been used to forecast increasingly complicated systems. Dynamical spatio-temporal processes represent a class of complex systems that can potentially benefit from these types of models. Although the RNN literature is expansive and highly developed, uncertainty quantification is often ignored. Even when considered, the uncertainty is generally quantified without the use of a rigorous framework, such as a fully Bayesian setting. Here we attempt to quantify uncertainty in a more formal framework while maintaining the forecast accuracy that makes these models appealing, by presenting a Bayesian RNN model for nonlinear spatio-temporal forecasting. Additionally, we make simple modifications to the basic RNN to help accommodate the unique nature of nonlinear spatio-temporal data. The proposed model is applied to a Lorenz simulation and two real-world nonlinear spatio-temporal forecasting applications

    Spatio-temporal modelling and mapping of malaria in Angola.

    Get PDF
    Master of Science in Statistics. University of KwaZulu-Natal, Pietermaritzburg 2017.About half of the world's population is at risk of contracting malaria. The growing number of malaria cases and deaths due to this disease in Africa has become a major challenge to public health care sector. Malaria is reported to be the primary cause of mortality in Angola, hence major focus needs to be put on intervention and prevention methods that reduce the disease risk and mortality to a level which is no longer a public risk problem. The common risk factors for malaria can be linked to environmental, socio-economic, demographic and climatic factors. The mortality rate due to malaria in Angola is analyzed using spatial disease mapping models. Such models are widely used to study the disease incidence, spatial distribution of diseases, prediction of the disease outcome and also to inform intervention strategies in various regions across the world. The methodology used is based on the Bayesian hierarchical modelling (BHM) framework. Four models namely the Poisson-gamma model, Poisson log-normal model, conditional autoregressive (CAR) model as well as the convolution model were used to study the relative risk of malaria mortality at provincial level in Angola using the National malaria control programme data from the period of 2003 to 2010. The Deviance Information Criteria selection was applied to compare and select the best fitted model. A total of 109; 320 deaths due to malaria were observed during the period of 2003-2010 in Angola. The lowest crude death rate was estimated as 124.14 per 100,000 in Lunda Sul province and the highest was 1583.63 per 100,000 in Luanda province. The results revealed that when comparing the four fitted models, the convolution model when we fitted to the data with both spatial structured and unstructured random effects performed better than the other three models. The structured and unstructured random effects were used to capture variation of risk specific to a province and across provinces respectively. The risk maps revealed variation of risk among provinces with very high relative risks in the South-East parts of Angola. A full Bayesian approach was also applied to perform a spatial and spatio-temporal modeling of malaria prevalence in Angola among children under 5 years using the 2006-2007 and 2011 Angola malaria indicator survey (AMIS) data. The Bayesian logistic model was applied in the spatio-temporal analysis to investigate the relationship between malaria prevalence and some reported socio-economic and demographic factors for data collected over the years 2006-2007 and 2011. The space-time effect of the association between malaria and these factors has practical implications for informing strategies for malaria control. Other than temporal variation, the risks factors were also found to vary spatially. The study found that there was a significant difference in the effects of socioeconomic and demographic variations on malaria between these two time periods. Wealth has a negative relationship with malaria prevalence while age was found to have a positive linear relationship with malaria prevalence which is indicative that this covariate play an important role in contracting malaria. Children living in urban areas and those who had bed nets were less likely to contract malaria as compared to those who lived in rural areas and those who did not have bed nets. The temporal analysis show that the prevalence for malaria was lower in 2011 as compared to the 2006-2007 period

    Development of a Dynamic Linear Model Procedure for Quantifying Long-term Trends in Atmospheric Time Series

    Get PDF
    With satellite remote sensing instruments, global data records of various atmospheric species, spanning considerable periods of time, have been produced. These data provide insight into atmospheric processes and the evolution of our atmosphere. Statistical analysis on them is essential. One thing in particular that we often wish to know about is the long-term trend in a species concentration on the order of decades. This is important because it allows us to monitor changes in our atmosphere. Changes that can be traced back to human activity, giving us feedback on how we are affecting the atmosphere, or changes from natural phenomena, such as volcanic eruptions. In this thesis, a statistical procedure is developed for modelling atmospheric remote sensing data records, with particular emphasis placed on the ability to extract accurate and informative information about the long-term trend. Procedures operating on the same principals have been used in the past for time series analysis in general. For example, on economic time series, as well as on atmospheric remote sensing data records, or just any atmospheric data. In this thesis, we show the theory behind the procedure in detail as well as describe how to implement and use it in practice. This is done with the intent of making the rather complicated procedure more accessible so that it can become more adopted by scientists working with atmospheric remote sensing data if desired, and compared to current methods for obtaining long-term trends. For an example application of this procedure, we apply it to a stratospheric ozone data record that extends from 1984 to present (2019). Ozone is a species that is of considerable interest since we know without a doubt that the changing chlorine situation in the atmosphere due to human activity has a significant effect on it, and because of its importance in absorbing ultraviolet radiation, which can seriously harm life on the Earth. The results we give paint a detailed picture of the long-term trends in stratospheric ozone concentration in the 65ºS to 65ºN latitude region

    A combined model of statistical downscaling and latent process multivariable spatial modeling of precipitation extremes

    Get PDF
    Future projections of extreme precipitation can help engineers and scientists with infrastructure design projects and risk assessment studies. Extreme events are usually represented as return levels which are equivalent to upper percentiles of an extreme value distribution, such as the Generalized Pareto distribution, which is used for exceedances above a certain threshold. My dissertation focus is on uncertainty quantification related to estimation of future return levels for precipitation at the local (weather station) to regional level. Variance reduction is achieved through spatial modeling and optimally combining suites of climate model outputs. The main contribution is a unified statistical model that combines the variance reduction methods with a latent model statistical downscaling technique. The dissertation is presented in three chapters: (I) Single-Location Bayesian Estimation of Generalized Pareto Distribution (GPD); (II) Multiple-Location Bayesian Estimation of GPD with a Spatial Latent Process. (III) Spatial Combining of Multiple Climate Model Outputs and Downscaling for Projections of Future Extreme Precipitation

    The genetic heritage of China: A genomic study of PR China based on nine representative ethnic populations

    Get PDF
    During the course of the last decade, genetic data have increasingly complemented linguistic, archaeological and palaeontological evidence in efforts to reconstruct human history. As technology has developed, studies have utilised genomic techniques in tracing the origins and migratory patterns of modem humans. East Asia is a particular hotspot of human migration, _especially Mainland China where a large number of human fossils have been unearthed and more than 20% of the wor1d\u27s population now resides. There are 56 officially recognised ethnic groups (minzu) within the population of PR China which totals 1,300 million. The majority Han population is distributed throughout the country and forms 90% of the total, whereas the other 55 minority populations mostly live in peripheral and boundary regions. To date, information on these minorities has been fragmentary and, from both evolutionary and historical perspectives, data on their genetic profiles would be of considerable value in identifying their founding populations and genetic inter-relationships. There are also strongly conflicting opinions on the origins of the Han and the degree to which they can be regarded as genetically homogenous. The current study measured the genetic diversity and ancestry of nine ethnic populations resident in PR China. In addition to the Han, these study populations comprised the Miao, Yao, Kucong and Tibetan communities from Yunnan province in the southwest of the country, and four Muslim populations, the Hui, Bo\u27an, Dongxiang and Sala from northern and central China. Both biparental and uniparental genetic influences on the populations were examined by the analysis of autosomal, mitochondrial and Y -chromosome markers. In general, it was found that the study populations displayed diverse paternal ancestries but more restricted maternal ancestries. From the Y-chromosome data in particular, major events such as the Neolithic population expansion and more recent historical events, such-as migration along the Silk Road, could be inferred. Through the use of autosomal markers, aspects of the internal structure of the study populations were uncovered, such as endogamy and/or consanguinity. These conclusions were made possible, in part, by experimental Likelihood-based stochastic coalescent modelling. Intriguingly. it was revealed that the Kucong of Yunnan, an ethnic group not previously surveyed for genetic diversity and not accorded official minority status within PR China. could possibly be representative of indigenous populations dating from the first migrations into East Asia. While other\u27 more recent events could be inferred from summary statistics and phylogenetic and coalescent-based genetic analyses of the study populations, the changing definition of the ethnic study populations themselves proved to be the most important factor. It is therefore recommended that future studies primarily utilize a community-by-community approach, and not rely on the official minzu category as an accurate indicator of genetic ancestry
    corecore