14,282 research outputs found

    DM-PhyClus: A Bayesian phylogenetic algorithm for infectious disease transmission cluster inference

    Full text link
    Background. Conventional phylogenetic clustering approaches rely on arbitrary cutpoints applied a posteriori to phylogenetic estimates. Although in practice, Bayesian and bootstrap-based clustering tend to lead to similar estimates, they often produce conflicting measures of confidence in clusters. The current study proposes a new Bayesian phylogenetic clustering algorithm, which we refer to as DM-PhyClus, that identifies sets of sequences resulting from quick transmission chains, thus yielding easily-interpretable clusters, without using any ad hoc distance or confidence requirement. Results. Simulations reveal that DM-PhyClus can outperform conventional clustering methods, as well as the Gap procedure, a pure distance-based algorithm, in terms of mean cluster recovery. We apply DM-PhyClus to a sample of real HIV-1 sequences, producing a set of clusters whose inference is in line with the conclusions of a previous thorough analysis. Conclusions. DM-PhyClus, by eliminating the need for cutpoints and producing sensible inference for cluster configurations, can facilitate transmission cluster detection. Future efforts to reduce incidence of infectious diseases, like HIV-1, will need reliable estimates of transmission clusters. It follows that algorithms like DM-PhyClus could serve to better inform public health strategies

    Clustering of Trading Activity in the DAX Index Options Market

    Get PDF
    Trades in DAX index options with identical maturities cluster around particular classes of strike prices. For example, options with strikes ending on 50 are less traded than options with strikes ending on 00. Clustering is higher when options with close strike prices are good substitutes. The degree of substitution between options with neighboring strikes depends on the strike price grid and options' characteristics. Using regression analysis we analyze the relation between clustering, grid size, and the options' characteristics. To our knowledge this paper is the first to explore how the grid size of strike prices affects options' trading volume.Clustering, Incidental Truncation, Index Options, Volume

    Effect of breastfeeding on gastrointestinal infection in infants: A targeted maximum likelihood approach for clustered longitudinal data

    Full text link
    The PROmotion of Breastfeeding Intervention Trial (PROBIT) cluster-randomized a program encouraging breastfeeding to new mothers in hospital centers. The original studies indicated that this intervention successfully increased duration of breastfeeding and lowered rates of gastrointestinal tract infections in newborns. Additional scientific and popular interest lies in determining the causal effect of longer breastfeeding on gastrointestinal infection. In this study, we estimate the expected infection count under various lengths of breastfeeding in order to estimate the effect of breastfeeding duration on infection. Due to the presence of baseline and time-dependent confounding, specialized "causal" estimation methods are required. We demonstrate the double-robust method of Targeted Maximum Likelihood Estimation (TMLE) in the context of this application and review some related methods and the adjustments required to account for clustering. We compare TMLE (implemented both parametrically and using a data-adaptive algorithm) to other causal methods for this example. In addition, we conduct a simulation study to determine (1) the effectiveness of controlling for clustering indicators when cluster-specific confounders are unmeasured and (2) the importance of using data-adaptive TMLE.Comment: Published in at http://dx.doi.org/10.1214/14-AOAS727 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Sound Event Detection with Sequentially Labelled Data Based on Connectionist Temporal Classification and Unsupervised Clustering

    Full text link
    Sound event detection (SED) methods typically rely on either strongly labelled data or weakly labelled data. As an alternative, sequentially labelled data (SLD) was proposed. In SLD, the events and the order of events in audio clips are known, without knowing the occurrence time of events. This paper proposes a connectionist temporal classification (CTC) based SED system that uses SLD instead of strongly labelled data, with a novel unsupervised clustering stage. Experiments on 41 classes of sound events show that the proposed two-stage method trained on SLD achieves performance comparable to the previous state-of-the-art SED system trained on strongly labelled data, and is far better than another state-of-the-art SED system trained on weakly labelled data, which indicates the effectiveness of the proposed two-stage method trained on SLD without any onset/offset time of sound events

    Non-Compositional Term Dependence for Information Retrieval

    Full text link
    Modelling term dependence in IR aims to identify co-occurring terms that are too heavily dependent on each other to be treated as a bag of words, and to adapt the indexing and ranking accordingly. Dependent terms are predominantly identified using lexical frequency statistics, assuming that (a) if terms co-occur often enough in some corpus, they are semantically dependent; (b) the more often they co-occur, the more semantically dependent they are. This assumption is not always correct: the frequency of co-occurring terms can be separate from the strength of their semantic dependence. E.g. "red tape" might be overall less frequent than "tape measure" in some corpus, but this does not mean that "red"+"tape" are less dependent than "tape"+"measure". This is especially the case for non-compositional phrases, i.e. phrases whose meaning cannot be composed from the individual meanings of their terms (such as the phrase "red tape" meaning bureaucracy). Motivated by this lack of distinction between the frequency and strength of term dependence in IR, we present a principled approach for handling term dependence in queries, using both lexical frequency and semantic evidence. We focus on non-compositional phrases, extending a recent unsupervised model for their detection [21] to IR. Our approach, integrated into ranking using Markov Random Fields [31], yields effectiveness gains over competitive TREC baselines, showing that there is still room for improvement in the very well-studied area of term dependence in IR

    Employment status mobility from a life-cycle perspective

    Get PDF
    In this paper we apply optimal matching techniques to individual work-histories in the British Household Panel Survey (BHPS), with a two-fold objective. First, to explore the usefulness of this sequence-oriented approach to analyze work-histories. Second, to analyze the impact of involuntary job separations on life courses. The study covers the whole range of employment statuses, including unemployment and inactivity periods, from the first job held to the year 1993. Our main findings are the following: (i) mobility in employment status has increased along the twentieth century; (ii) it has become more similar between men and women; (iii) birth cohorts in the second half of the century have especially been affected by involuntary job separations; (iv) in general, involuntary job separations provoke employment status sequences which substantially differ from the typical sequence in each cohort.cohort analysis, employment, employment status mobility, involuntary job separations, optimal matching analysis, work-life history analysis

    Recovering complete and draft population genomes from metagenome datasets.

    Get PDF
    Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution

    Developing Efficient Strategies For Global Sensitivity Analysis Of Complex Environmental Systems Models

    Get PDF
    Complex Environmental Systems Models (CESMs) have been developed and applied as vital tools to tackle the ecological, water, food, and energy crises that humanity faces, and have been used widely to support decision-making about management of the quality and quantity of Earth’s resources. CESMs are often controlled by many interacting and uncertain parameters, and typically integrate data from multiple sources at different spatio-temporal scales, which make them highly complex. Global Sensitivity Analysis (GSA) techniques have proven to be promising for deepening our understanding of the model complexity and interactions between various parameters and providing helpful recommendations for further model development and data acquisition. Aside from the complexity issue, the computationally expensive nature of the CESMs precludes effective application of the existing GSA techniques in quantifying the global influence of each parameter on variability of the CESMs’ outputs. This is because a comprehensive sensitivity analysis often requires performing a very large number of model runs. Therefore, there is a need to break down this barrier by the development of more efficient strategies for sensitivity analysis. The research undertaken in this dissertation is mainly focused on alleviating the computational burden associated with GSA of the computationally expensive CESMs through developing efficiency-increasing strategies for robust sensitivity analysis. This is accomplished by: (1) proposing an efficient sequential sampling strategy for robust sampling-based analysis of CESMs; (2) developing an automated parameter grouping strategy of high-dimensional CESMs, (3) introducing a new robustness measure for convergence assessment of the GSA methods; and (4) investigating time-saving strategies for handling simulation failures/crashes during the sensitivity analysis of computationally expensive CESMs. This dissertation provides a set of innovative numerical techniques that can be used in conjunction with any GSA algorithm and be integrated in model building and systems analysis procedures in any field where models are used. A range of analytical test functions and environmental models with varying complexity and dimensionality are utilized across this research to test the performance of the proposed methods. These methods, which are embedded in the VARS–TOOL software package, can also provide information useful for diagnostic testing, parameter identifiability analysis, model simplification, model calibration, and experimental design. They can be further applied to address a range of decision making-related problems such as characterizing the main causes of risk in the context of probabilistic risk assessment and exploring the CESMs’ sensitivity to a wide range of plausible future changes (e.g., hydrometeorological conditions) in the context of scenario analysis
    • 

    corecore