1,158 research outputs found

    Weighted k-Nearest-Neighbor Techniques and Ordinal Classification

    Get PDF
    In the field of statistical discrimination k-nearest neighbor classification is a well-known, easy and successful method. In this paper we present an extended version of this technique, where the distances of the nearest neighbors can be taken into account. In this sense there is a close connection to LOESS, a local regression technique. In addition we show possibilities to use nearest neighbor for classification in the case of an ordinal class structure. Empirical studies show the advantages of the new techniques

    Critical analysis of information: an epidemiologic perspective

    Get PDF
    PresentationA slideshow presentation discussing methods for critically evaluating medical literature and observational studie

    The Study of a Generalized Fitness Education Program\u27s Effect on Personality Traits

    Get PDF
    This longitudinal study was to measure the influence of a generalized fitness education program on the percentile of individuals’ personality traits in college aged individuals. The personality traits are those determined in the Five Factor Model of Personality. The Five Factor Model of Personality is a beneficial tool used in exercise psychology that refers to a broad domain of personality traits often referred to as the “Big Five”. The findings reject the hypothesis that a fitness education program would shift an individual’s personality a statistically significant amount

    A Bayesian Approach to Learning Hidden Markov Model Topology with Applications to Biological Sequence Analysis

    Get PDF
    Hidden-Markov-Models (HMMs) are a widely and successfully used tool in statistical modeling and statistical pattern recognition. One fundamental problem in the application of HMMs is finding the underlying architecture or topology, particularly when there is no strong evidence from the application domain — e.g., when doing black box modeling. Topology is important with regard to good parameter estimates and with regard to performance: A model with “too many” states — and hence too many parameters — requires too much training data while an model with “not enough” states impedes the HMM from capturing subtle statistical patterns. We have developed a novel algorithm that, given sequence data originating from an ergodic process, infers an HMM, its topology and its parameters. We introduce a Bayesian approach

    Spatial probit models for multivariate ordinal data: computational efficiency and parameter identifiability

    Get PDF
    2013 Summer.Includes bibliographical references.The Colorado Natural Heritage Program (CNHP) at Colorado State University evaluates Colorado's rare and at-risk species and habitats and promotes conservation of biological resources. One of the goals of the program is to determine the condition of wetlands across the state of Colorado. The data collected are measurements, or metrics, representing landscape condition, biotic condition, hydrologic condition, and physiochemical condition in river basins statewide. The metrics differ in variable type, including binary, ordinal, count, and continuous response data. It is common practice to uniformly discretize the metrics into ordinal values and combine them using a weighted-average to obtain a univariate measure of wetland condition. The weights assigned to each metric are based on best professional judgement. The motivation of this work was to improve on the user-defined weights by developing a statistical model to estimate the weights using observed data. The challenges of creating a model that fulfills this requirement are many. First, the observed data are multivariate and consist of different variable types which we wish to preserve. Second, the multivariate response data are not independent across river basin because wetlands at close proximity are correlated. Third, we want the model to provide a univariate measure of wetland condition that can be compared across the state. Lastly, it is of interest to the ecologists to predict the univariate measure of wetland condition at unobserved locations requiring covariate information to be incorporated into the model. We propose a multivariate multilevel latent variable model to address these challenges. Latent continuous response variables are used to model the different types of response variables. An additional latent variable, or common factor, is used as a univariate measure of wetland condition. The mean of the common factor contains observable covariate data in order to predict at unobserved locations. The variance of the common factor is defined by a spatial covariance function to account for the dependence between wetlands. The majority of the metrics reported by the CNHP are ordinal. Therefore, our primary focus is modeling multivariate ordinal response data where binary data is a special case. Probit linear models and probit linear mixed models are examples of models for ordinal response data. Probit models are attractive in that they can be defined in terms of latent variables. Computational efficiency is a major issue when fitting multivariate latent variable models in a Bayesian framework using Markov chain Monte Carlo (MCMC). There is also a high computation cost for running MCMC when fitting geostatistical spatial models. Data augmentation and parameter expansion are both modeling techniques that can lead to optimal iterative sampling algorithms for MCMC. Data augmentation allows for simpler and more feasible simulation from a posterior distribution. Parameter expansion is a method for accelerating convergence of iterative sample algorithms and can enhance data augmentation algorithms. We propose data augmentation and parameter-expanded data augmentation algorithms for fitting MCMC to spatial probit models for binary and ordinal response data. Parameter identifiability is another challenge when fitting multivariate latent variable models due to the multivariate response data, number of parameters, unobserved latent variables, and spatial random effects. We investigate parameter identifiability for the common factor model for multivariate ordinal response data. We extend the common factor model to include covariates and spatial correlation so we can predict wetland condition at unobserved locations. The partial sill and range parameter of a spatial covariance function are difficult to estimate because they are near-nonidentifiable. We propose a new parameterization for the covariance function of the spatial probit model that leads to better mixing and faster convergence of the MCMC. Whereas our spatial probit model for ordinal response data follows the common factor model approach, there are other forms of the spatial probit model. We give a comprehensive comparison of two types of spatial probit models, which we refer to as the first-stage and second-stage spatial probit model. We discuss the implications of fitting each model and compare them in terms of their impact on parameter estimation and prediction at unobserved locations. We propose a new approximation for predicting ordinal response data that is both accurate and efficient. We apply the multivariate multilevel latent variable model to data collected in the North Platte and Rio Grande River Basins to evaluate wetland condition. We obtain statistically derived weights for each of the response metrics with confidence limits. Lastly, we predict the univariate measure of wetland condition at unobserved locations

    SLIQ: Simple Linear Inequalities for Efficient Contig Scaffolding

    Full text link
    Scaffolding is an important subproblem in "de novo" genome assembly in which mate pair data are used to construct a linear sequence of contigs separated by gaps. Here we present SLIQ, a set of simple linear inequalities derived from the geometry of contigs on the line that can be used to predict the relative positions and orientations of contigs from individual mate pair reads and thus produce a contig digraph. The SLIQ inequalities can also filter out unreliable mate pairs and can be used as a preprocessing step for any scaffolding algorithm. We tested the SLIQ inequalities on five real data sets ranging in complexity from simple bacterial genomes to complex mammalian genomes and compared the results to the majority voting procedure used by many other scaffolding algorithms. SLIQ predicted the relative positions and orientations of the contigs with high accuracy in all cases and gave more accurate position predictions than majority voting for complex genomes, in particular the human genome. Finally, we present a simple scaffolding algorithm that produces linear scaffolds given a contig digraph. We show that our algorithm is very efficient compared to other scaffolding algorithms while maintaining high accuracy in predicting both contig positions and orientations for real data sets.Comment: 16 pages, 6 figures, 7 table

    Doctor of Philosophy

    Get PDF
    dissertationCaffeinated and fructose-rich beverages are widely consumed among women of reproductive age but their association with reproductive hormones is not well understood, due in part to inadequate exposure assessment. Our objectives were to 1) assess the relationship between caffeine and fructose intake and reproductive hormones in healthy premenopausal women, evaluating potential effect modification by race; and 2) determine the validity of the Food Frequency Questionnaire (FFQ) for measuring monthly caffeinated beverage intake compared to multiple 24-hour dietary recalls (24HDR). The BioCycle Study (2005-2007) prospective cohort (n=259) included women, ages 18-44, who were followed for 2 menstrual cycles, providing fasting blood specimens at up to 8 visits per cycle, 4 24HDRs per cycle, and an FFQ at the end of each cycle. Caffeine intake ≥200 mg/day was inversely associated with free estradiol (E2) concentrations among white women (β=-0.15 [95% confidence interval (CI): -0.26, - 0.05] and positively associated among Asian women (β=0.61 [95% CI: 0.31, 0.92] after taking into account potential confounders. Women who consumed more added sugar than an average American woman (≥ 73.2 grams/day) or above the 66th percentile in fructose intake (≥ 41.5 grams/day) had elevated free E2 concentrations compared to women who consumed less. Women who consumed ≥1 cup/day of sweetened soda had elevated free E2 (β=0.15 [95% CI: 0.06, 0.24]. Neither artificially sweetened soda intake nor fruit juice iv intake ≥1 cup/day was significantly associated with reproductive hormones. Caffeine intake reported in the FFQ was greater than that reported in the 24HDRs (mean=114.1 versus 92.6 mg/day; P=0.006) despite high correlation (r=0.80, P<0.001) and moderate agreement (kappa=0.56, 95% CI: 0.42-0.70). In summary, moderate caffeine consumption was associated with reduced E2 among white women and elevated E2 among Asian women. Added sugars, total fructose, and sweetened soda were associated with elevated E2 among all races. Further research on the association between caffeine, caffeinated beverage components and reproductive hormones, and whether these relationships differ by race, is warranted. Additionally, although caffeine exposures are highly correlated, absolute intakes differ significantly between measurement tools, highlighting the importance of considering potential misclassification of caffeine exposure when conducting women's health epidemiologic studies
    corecore