1,564 research outputs found

    Augment-and-Conquer Negative Binomial Processes

    Full text link
    By developing data augmentation methods unique to the negative binomial (NB) distribution, we unite seemingly disjoint count and mixture models under the NB process framework. We develop fundamental properties of the models and derive efficient Gibbs sampling inference. We show that the gamma-NB process can be reduced to the hierarchical Dirichlet process with normalization, highlighting its unique theoretical, structural and computational advantages. A variety of NB processes with distinct sharing mechanisms are constructed and applied to topic modeling, with connections to existing algorithms, showing the importance of inferring both the NB dispersion and probability parameters.Comment: Neural Information Processing Systems, NIPS 201

    A New MI-Based Visualization Aided Validation Index for Mining Big Longitudinal Web Trial Data

    Get PDF
    Web-delivered clinical trials generate big complex data. To help untangle the heterogeneity of treatment effects, unsupervised learning methods have been widely applied. However, identifying valid patterns is a priority but challenging issue for these methods. This paper, built upon our previous research on multiple imputation (MI)-based fuzzy clustering and validation, proposes a new MI-based Visualization-aided validation index (MIVOOS) to determine the optimal number of clusters for big incomplete longitudinal Web-trial data with inflated zeros. Different from a recently developed fuzzy clustering validation index, MIVOOS uses a more suitable overlap and separation measures for Web-trial data but does not depend on the choice of fuzzifiers as the widely used Xie and Beni (XB) index. Through optimizing the view angles of 3-D projections using Sammon mapping, the optimal 2-D projection-guided MIVOOS is obtained to better visualize and verify the patterns in conjunction with trajectory patterns. Compared with XB and VOS, our newly proposed MIVOOS shows its robustness in validating big Web-trial data under different missing data mechanisms using real and simulated Web-trial data

    Streaming visualisation of quantitative mass spectrometry data based on a novel raw signal decomposition method

    Get PDF
    As data rates rise, there is a danger that informatics for high-throughput LC-MS becomes more opaque and inaccessible to practitioners. It is therefore critical that efficient visualisation tools are available to facilitate quality control, verification, validation, interpretation, and sharing of raw MS data and the results of MS analyses. Currently, MS data is stored as contiguous spectra. Recall of individual spectra is quick but panoramas, zooming and panning across whole datasets necessitates processing/memory overheads impractical for interactive use. Moreover, visualisation is challenging if significant quantification data is missing due to data-dependent acquisition of MS/MS spectra. In order to tackle these issues, we leverage our seaMass technique for novel signal decomposition. LC-MS data is modelled as a 2D surface through selection of a sparse set of weighted B-spline basis functions from an over-complete dictionary. By ordering and spatially partitioning the weights with an R-tree data model, efficient streaming visualisations are achieved. In this paper, we describe the core MS1 visualisation engine and overlay of MS/MS annotations. This enables the mass spectrometrist to quickly inspect whole runs for ionisation/chromatographic issues, MS/MS precursors for coverage problems, or putative biomarkers for interferences, for example. The open-source software is available from http://seamass.net/viz/

    Count data time series models and their applications

    Get PDF
    “Due to fast developments of advanced sensors, count data sets have become ubiquitous in many fields. Modeling and forecasting such time series have generated great interest. Modeling can shed light on the behavior of the count series and to see how they are related to other factors such as the environmental conditions under which the data are generated. In this research, three approaches to modeling such count data are proposed. First, a periodic autoregressive conditional Poisson (PACP) model is proposed as a natural generalization of the autoregressive conditional Poisson (ACP) model. By allowing for cyclical variations in the parameters of the model, it provides a way to explain the periodicity inherent in many count data series. For example, in epidemiology the prevalence of a disease may depend on the season. The autoregressive conditional Poisson hidden Markov model (ACP-HMM) is developed to deal with count data time series whose mean, conditional on the past, is a function of previous observations, with this relationship possibly determined by an unobserved process that switches its state or regime as time progresses. This model, in a sense, is the combination of the discrete version of the autoregressive conditional heteroscedastic (ARCH) formulation and the Poisson hidden Markov model. Both the above models address the frequently present serial correlation and the clustering of high or low counts observed in time series of count data, while at the same time allowing the underlying data generating mechanism to change cyclically or according to a hidden Markov process. Applications to empirical data sets show that these models provide a better fit than the standard ACP models. In addition to the above models, a modification of a zero-inflated Poisson model is used to analyze activity counts of the fruit fly. The model captures the dynamic structure of activity patterns and the fly\u27s propensity to sleep. The obtained results when fed to a convolutional neural network provides the possibility of building a predictive model to identify fruit flies with short and long lifespans”--Abstract, page iv

    Statistical Modeling of Influenza-Like-Illness in Montana using Spatial and Temporal Methods

    Get PDF
    Studying air pollution and public health has been a historically important question in science. It has long been hypothesized that severe air pollution conditions lead to negative implications in basic human health. Primarily, areas thats are prone to severe degrees of human pollution are the focus of such studies. Such research relating to less populated areas are scarce, and this scarcity raises the question of how such pollution dynamics (human-made and natural) influence human health in more rural areas. The aim of this study is to explore this hole in research; in particular we explore possible links between air pollution and Influenza-like-illness in Montana. We begin with a discussion of our starting hypotheses, the data we have accumulated to test these hypotheses, and some exploratory analysis of these data. The body of this research is based on modeling of the natural factors that influence influenza dynamics in general and how these factors apply in the state of Montana. Here, we will explore different modeling approaches and how to apply them to the given data. To conclude this research, a summary is provided and the implications this has for the state of Montana

    The Effects of Inaccurate and Missing Highway-Rail Grade Crossing Inventory Data on Crash and Severity Model Estimation and Prediction

    Get PDF
    Highway-Rail Grade Crossings (HRGCs) present a significant safety risk to motorists, pedestrians, and train passengers as they are intersections where roads and railways intersect. Every year, HRGCs in the US experience a high number of crashes leading to injuries and fatalities. Estimations of crash and severity models for HRGCs provide insights into safety and mitigation of the risk posed by such incidents. The accuracy of these models plays a vital role in predicting future crashes at these crossings, enabling necessary safety measures to be taken proactively. In the United States, most of these models rely on the Federal Railroad Administration\u27s (FRA) HRGCs inventory database, which serves as the primary source of information for these models. However, errors or incomplete information in this database can significantly impact the accuracy of the estimated crash model parameters and subsequent crash predictions. This study examined the potential differences in expected number of crashes and severity obtained from the Federal Railroad Administration\u27s (FRA) 2020 Accident Prediction and Severity (APS) model when using two different input datasets for 560 HRGCs in Nebraska. The first dataset was the unaltered, original FRA HRGCs inventory dataset, while the second was a field-validated inventory dataset, specifically for those 560 HRGCs. The results showed statistically significant differences in the expected number of crashes and severity predictions using the two different input datasets. Furthermore, to understand how data inaccuracy impacts model estimation for crash frequency and severity prediction, two new zero-inflated negative binomial models for crash prediction and two ordered probit models for crash severity, were estimated based on the two datasets. The analysis revealed significant differences in estimated parameters’ coefficients values of the base and comparison models, and different crash-risk rankings were obtained based on the two datasets. The results emphasize the importance of obtaining accurate and complete inventory data when developing HRGCs crash and severity models to improve their precision and enhance their ability to predict and prevent crashes. Advisor: Aemal J. Khatta
    • …
    corecore