179 research outputs found
Bolt: Accelerated Data Mining with Fast Vector Compression
Vectors of data are at the heart of machine learning and data mining.
Recently, vector quantization methods have shown great promise in reducing both
the time and space costs of operating on vectors. We introduce a vector
quantization algorithm that can compress vectors over 12x faster than existing
techniques while also accelerating approximate vector operations such as
distance and dot product computations by up to 10x. Because it can encode over
2GB of vectors per second, it makes vector quantization cheap enough to employ
in many more circumstances. For example, using our technique to compute
approximate dot products in a nested loop can multiply matrices faster than a
state-of-the-art BLAS implementation, even when our algorithm must first
compress the matrices.
In addition to showing the above speedups, we demonstrate that our approach
can accelerate nearest neighbor search and maximum inner product search by over
100x compared to floating point operations and up to 10x compared to other
vector quantization methods. Our approximate Euclidean distance and dot product
computations are not only faster than those of related algorithms with slower
encodings, but also faster than Hamming distance computations, which have
direct hardware support on the tested platforms. We also assess the errors of
our algorithm's approximate distances and dot products, and find that it is
competitive with existing, slower vector quantization algorithms.Comment: Research track paper at KDD 201
Learning the Probability of Activation in the Presence of Latent Spreaders
When an infection spreads in a community, an individual's probability of
becoming infected depends on both her susceptibility and exposure to the
contagion through contact with others. While one often has knowledge regarding
an individual's susceptibility, in many cases, whether or not an individual's
contacts are contagious is unknown. We study the problem of predicting if an
individual will adopt a contagion in the presence of multiple modes of
infection (exposure/susceptibility) and latent neighbor influence. We present a
generative probabilistic model and a variational inference method to learn the
parameters of our model. Through a series of experiments on synthetic data, we
measure the ability of the proposed model to identify latent spreaders, and
predict the risk of infection. Applied to a real dataset of 20,000 hospital
patients, we demonstrate the utility of our model in predicting the onset of a
healthcare associated infection using patient room-sharing and nurse-sharing
networks. Our model outperforms existing benchmarks and provides actionable
insights for the design and implementation of targeted interventions to curb
the spread of infection.Comment: To appear in AAA1-1
Anatomical Priors in Convolutional Networks for Unsupervised Biomedical Segmentation
We consider the problem of segmenting a biomedical image into anatomical
regions of interest. We specifically address the frequent scenario where we
have no paired training data that contains images and their manual
segmentations. Instead, we employ unpaired segmentation images to build an
anatomical prior. Critically these segmentations can be derived from imaging
data from a different dataset and imaging modality than the current task. We
introduce a generative probabilistic model that employs the learned prior
through a convolutional neural network to compute segmentations in an
unsupervised setting. We conducted an empirical analysis of the proposed
approach in the context of structural brain MRI segmentation, using a
multi-study dataset of more than 14,000 scans. Our results show that an
anatomical prior can enable fast unsupervised segmentation which is typically
not possible using standard convolutional networks. The integration of
anatomical priors can facilitate CNN-based anatomical segmentation in a range
of novel clinical problems, where few or no annotations are available and thus
standard networks are not trainable. The code is freely available at
http://github.com/adalca/neuron.Comment: Presented at CVPR 2018. IEEE CVPR proceedings pp. 9290-929
SizeGAN: Improving Size Representation in Clothing Catalogs
Online clothing catalogs lack diversity in body shape and garment size.
Brands commonly display their garments on models of one or two sizes, rarely
including plus-size models. In this work, we propose a new method, SizeGAN, for
generating images of garments on different-sized models. To change the garment
and model size while maintaining a photorealistic image, we incorporate image
alignment ideas from the medical imaging literature into the StyleGAN2-ADA
architecture. Our method learns deformation fields at multiple resolutions and
uses a spatial transformer to modify the garment and model size. We evaluate
our approach along three dimensions: realism, garment faithfulness, and size.
To our knowledge, SizeGAN is the first method to focus on this size
under-representation problem for modeling clothing. We provide an analysis
comparing SizeGAN to other plausible approaches and additionally provide the
first clothing dataset with size labels. In a user study comparing SizeGAN and
two recent virtual try-on methods, we show that our method ranks first in each
dimension, and was vastly preferred for realism and garment faithfulness. In
comparison to most previous work, which has focused on generating
photorealistic images of garments, our work shows that it is possible to
generate images that are both photorealistic and cover diverse garment sizes
Weighted Time Warping for Temporal Segmentation of Multi-Parameter Physiological Signals
We present a novel approach to segmenting a quasiperiodic multi-parameter physiological signal in the presence of noise and transient corruption. We use Weighted Time Warping (WTW), to combine the partially correlated signals. We then use the relationship between the channels and the repetitive morphology of the time series to partition it into quasiperiodic units by matching it against a constantly evolving template. The method can accurately segment a multi-parameter signal, even when all the individual channels are so corrupted that they cannot be individually segmented. Experiments carried out on MIMIC, a multi-parameter physiological dataset recorded on ICU patients, demonstrate the effectiveness of the method. Our method performs as well as a widely used QRS detector on clean raw data, and outperforms it on corrupted data. Under additive noise at SNR 0 dB the average errors were 5:81 ms for our method and 303:48 ms for the QRS detector. Under transient corruption they were 2:89 ms and 387:32 ms respectively
A Framework for Understanding Unintended Consequences of Machine Learning
As machine learning increasingly affects people and society, it is important
that we strive for a comprehensive and unified understanding of potential
sources of unwanted consequences. For instance, downstream harms to particular
groups are often blamed on "biased data," but this concept encompass too many
issues to be useful in developing solutions. In this paper, we provide a
framework that partitions sources of downstream harm in machine learning into
six distinct categories spanning the data generation and machine learning
pipeline. We describe how these issues arise, how they are relevant to
particular applications, and how they motivate different solutions. In doing
so, we aim to facilitate the development of solutions that stem from an
understanding of application-specific populations and data generation
processes, rather than relying on general statements about what may or may not
be "fair."Comment: 6 pages, 2 figures; updated with corrected figure
- …