581 research outputs found
Maximally Consistent Sampling and the Jaccard Index of Probability Distributions
We introduce simple, efficient algorithms for computing a MinHash of a
probability distribution, suitable for both sparse and dense data, with
equivalent running times to the state of the art for both cases. The collision
probability of these algorithms is a new measure of the similarity of positive
vectors which we investigate in detail. We describe the sense in which this
collision probability is optimal for any Locality Sensitive Hash based on
sampling. We argue that this similarity measure is more useful for probability
distributions than the similarity pursued by other algorithms for weighted
MinHash, and is the natural generalization of the Jaccard index.Comment: To appear in ICDMW 201
Mutation supply and the repeatability of selection for antibiotic resistance
Whether evolution can be predicted is a key question in evolutionary biology.
Here we set out to better understand the repeatability of evolution. We
explored experimentally the effect of mutation supply and the strength of
selective pressure on the repeatability of selection from standing genetic
variation. Different sizes of mutant libraries of an antibiotic resistance
gene, TEM-1 -lactamase in Escherichia coli, were subjected to different
antibiotic concentrations. We determined whether populations went extinct or
survived, and sequenced the TEM gene of the surviving populations. The
distribution of mutations per allele in our mutant libraries- generated by
error-prone PCR- followed a Poisson distribution. Extinction patterns could be
explained by a simple stochastic model that assumed the sampling of beneficial
mutations was key for survival. In most surviving populations, alleles
containing at least one known large-effect beneficial mutation were present.
These genotype data also support a model which only invokes sampling effects to
describe the occurrence of alleles containing large-effect driver mutations.
Hence, evolution is largely predictable given cursory knowledge of mutational
fitness effects, the mutation rate and population size. There were no clear
trends in the repeatability of selected mutants when we considered all
mutations present. However, when only known large-effect mutations were
considered, the outcome of selection is less repeatable for large libraries, in
contrast to expectations. Furthermore, we show experimentally that alleles
carrying multiple mutations selected from large libraries confer higher
resistance levels relative to alleles with only a known large-effect mutation,
suggesting that the scarcity of high-resistance alleles carrying multiple
mutations may contribute to the decrease in repeatability at large library
sizes.Comment: 31pages, 9 figure
ConNEcT:A Novel Network Approach for Investigating the Co-occurrence of Binary Psychopathological Symptoms Over Time
Network analysis is an increasingly popular approach to study mental disorders in all their complexity. Multiple methods have been developed to extract networks from cross-sectional data, with these data being either continuous or binary. However, when it comes to time series data, most efforts have focused on continuous data. We therefore propose ConNEcT, a network approach for binary symptom data across time. ConNEcT allows to visualize and study the prevalence of different symptoms as well as their co-occurrence, measured by means of a contingency measure in one single network picture. ConNEcT can be complemented with a significance test that accounts for the serial dependence in the data. To illustrate the usefulness of ConNEcT, we re-analyze data from a study in which patients diagnosed with major depressive disorder weekly reported the absence or presence of eight depression symptoms. We first extract ConNEcTs for all patients that provided data during at least 104 weeks, revealing strong inter-individual differences in which symptom pairs co-occur significantly. Second, to gain insight into these differences, we apply Hierarchical Classes Analysis on the co-occurrence patterns of all patients, showing that they can be grouped into meaningful clusters. Core depression symptoms (i.e., depressed mood and/or diminished interest), cognitive problems and loss of energy seem to co-occur universally, but preoccupation with death, psychomotor problems or eating problems only co-occur with other symptoms for specific patient subgroups
Configuration model for correlation matrices preserving the node strength
Correlation matrices are a major type of multivariate data. To examine
properties of a given correlation matrix, a common practice is to compare the
same quantity between the original correlation matrix and reference correlation
matrices, such as those derived from random matrix theory, that partially
preserve properties of the original matrix. We propose a model to generate such
reference correlation and covariance matrices for the given matrix. Correlation
matrices are often analysed as networks, which are heterogeneous across nodes
in terms of the total connectivity to other nodes for each node. Given this
background, the present algorithm generates random networks that preserve the
expectation of total connectivity of each node to other nodes, akin to
configuration models for conventional networks. Our algorithm is derived from
the maximum entropy principle. We will apply the proposed algorithm to
measurement of clustering coefficients and community detection, both of which
require a null model to assess the statistical significance of the obtained
results.Comment: 8 figures, 4 table
Jaccard Metric Losses: Optimizing the Jaccard Index with Soft Labels
IoU losses are surrogates that directly optimize the Jaccard index. In
semantic segmentation, leveraging IoU losses as part of the loss function is
shown to perform better with respect to the Jaccard index measure than
optimizing pixel-wise losses such as the cross-entropy loss alone. The most
notable IoU losses are the soft Jaccard loss and the Lovasz-Softmax loss.
However, these losses are incompatible with soft labels which are ubiquitous in
machine learning. In this paper, we propose Jaccard metric losses (JMLs), which
are identical to the soft Jaccard loss in a standard setting with hard labels,
but are compatible with soft labels. With JMLs, we study two of the most
popular use cases of soft labels: label smoothing and knowledge distillation.
With a variety of architectures, our experiments show significant improvements
over the cross-entropy loss on three semantic segmentation datasets
(Cityscapes, PASCAL VOC and DeepGlobe Land), and our simple approach
outperforms state-of-the-art knowledge distillation methods by a large margin.
Code is available at:
\href{https://github.com/zifuwanggg/JDTLosses}{https://github.com/zifuwanggg/JDTLosses}.Comment: Submitted to ICML2023. Code is available at
https://github.com/zifuwanggg/JDTLosse
- …