581 research outputs found

    Maximally Consistent Sampling and the Jaccard Index of Probability Distributions

    Full text link
    We introduce simple, efficient algorithms for computing a MinHash of a probability distribution, suitable for both sparse and dense data, with equivalent running times to the state of the art for both cases. The collision probability of these algorithms is a new measure of the similarity of positive vectors which we investigate in detail. We describe the sense in which this collision probability is optimal for any Locality Sensitive Hash based on sampling. We argue that this similarity measure is more useful for probability distributions than the similarity pursued by other algorithms for weighted MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201

    Mutation supply and the repeatability of selection for antibiotic resistance

    Full text link
    Whether evolution can be predicted is a key question in evolutionary biology. Here we set out to better understand the repeatability of evolution. We explored experimentally the effect of mutation supply and the strength of selective pressure on the repeatability of selection from standing genetic variation. Different sizes of mutant libraries of an antibiotic resistance gene, TEM-1 β\beta-lactamase in Escherichia coli, were subjected to different antibiotic concentrations. We determined whether populations went extinct or survived, and sequenced the TEM gene of the surviving populations. The distribution of mutations per allele in our mutant libraries- generated by error-prone PCR- followed a Poisson distribution. Extinction patterns could be explained by a simple stochastic model that assumed the sampling of beneficial mutations was key for survival. In most surviving populations, alleles containing at least one known large-effect beneficial mutation were present. These genotype data also support a model which only invokes sampling effects to describe the occurrence of alleles containing large-effect driver mutations. Hence, evolution is largely predictable given cursory knowledge of mutational fitness effects, the mutation rate and population size. There were no clear trends in the repeatability of selected mutants when we considered all mutations present. However, when only known large-effect mutations were considered, the outcome of selection is less repeatable for large libraries, in contrast to expectations. Furthermore, we show experimentally that alleles carrying multiple mutations selected from large libraries confer higher resistance levels relative to alleles with only a known large-effect mutation, suggesting that the scarcity of high-resistance alleles carrying multiple mutations may contribute to the decrease in repeatability at large library sizes.Comment: 31pages, 9 figure

    ConNEcT:A Novel Network Approach for Investigating the Co-occurrence of Binary Psychopathological Symptoms Over Time

    Get PDF
    Network analysis is an increasingly popular approach to study mental disorders in all their complexity. Multiple methods have been developed to extract networks from cross-sectional data, with these data being either continuous or binary. However, when it comes to time series data, most efforts have focused on continuous data. We therefore propose ConNEcT, a network approach for binary symptom data across time. ConNEcT allows to visualize and study the prevalence of different symptoms as well as their co-occurrence, measured by means of a contingency measure in one single network picture. ConNEcT can be complemented with a significance test that accounts for the serial dependence in the data. To illustrate the usefulness of ConNEcT, we re-analyze data from a study in which patients diagnosed with major depressive disorder weekly reported the absence or presence of eight depression symptoms. We first extract ConNEcTs for all patients that provided data during at least 104 weeks, revealing strong inter-individual differences in which symptom pairs co-occur significantly. Second, to gain insight into these differences, we apply Hierarchical Classes Analysis on the co-occurrence patterns of all patients, showing that they can be grouped into meaningful clusters. Core depression symptoms (i.e., depressed mood and/or diminished interest), cognitive problems and loss of energy seem to co-occur universally, but preoccupation with death, psychomotor problems or eating problems only co-occur with other symptoms for specific patient subgroups

    Configuration model for correlation matrices preserving the node strength

    Get PDF
    Correlation matrices are a major type of multivariate data. To examine properties of a given correlation matrix, a common practice is to compare the same quantity between the original correlation matrix and reference correlation matrices, such as those derived from random matrix theory, that partially preserve properties of the original matrix. We propose a model to generate such reference correlation and covariance matrices for the given matrix. Correlation matrices are often analysed as networks, which are heterogeneous across nodes in terms of the total connectivity to other nodes for each node. Given this background, the present algorithm generates random networks that preserve the expectation of total connectivity of each node to other nodes, akin to configuration models for conventional networks. Our algorithm is derived from the maximum entropy principle. We will apply the proposed algorithm to measurement of clustering coefficients and community detection, both of which require a null model to assess the statistical significance of the obtained results.Comment: 8 figures, 4 table

    Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels

    Full text link
    IoU losses are surrogates that directly optimize the Jaccard index. In semantic segmentation, leveraging IoU losses as part of the loss function is shown to perform better with respect to the Jaccard index measure than optimizing pixel-wise losses such as the cross-entropy loss alone. The most notable IoU losses are the soft Jaccard loss and the Lovasz-Softmax loss. However, these losses are incompatible with soft labels which are ubiquitous in machine learning. In this paper, we propose Jaccard metric losses (JMLs), which are identical to the soft Jaccard loss in a standard setting with hard labels, but are compatible with soft labels. With JMLs, we study two of the most popular use cases of soft labels: label smoothing and knowledge distillation. With a variety of architectures, our experiments show significant improvements over the cross-entropy loss on three semantic segmentation datasets (Cityscapes, PASCAL VOC and DeepGlobe Land), and our simple approach outperforms state-of-the-art knowledge distillation methods by a large margin. Code is available at: \href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.Comment: Submitted to ICML2023. Code is available at https://github.com/zifuwanggg/JDTLosse
    corecore