22 research outputs found
A Flexible and Adaptive Framework for Abstention Under Class Imbalance
In practical applications of machine learning, it is often desirable to
identify and abstain on examples where the model's predictions are likely to be
incorrect. Much of the prior work on this topic focused on out-of-distribution
detection or performance metrics such as top-k accuracy. Comparatively little
attention was given to metrics such as area-under-the-curve or Cohen's Kappa,
which are extremely relevant for imbalanced datasets. Abstention strategies
aimed at top-k accuracy can produce poor results on these metrics when applied
to imbalanced datasets, even when all examples are in-distribution. We propose
a framework to address this gap. Our framework leverages the insight that
calibrated probability estimates can be used as a proxy for the true class
labels, thereby allowing us to estimate the change in an arbitrary metric if an
example were abstained on. Using this framework, we derive computationally
efficient metric-specific abstention algorithms for optimizing the sensitivity
at a target specificity level, the area under the ROC, and the weighted Cohen's
Kappa. Because our method relies only on calibrated probability estimates, we
further show that by leveraging recent work on domain adaptation under label
shift, we can generalize to test-set distributions that may have a different
class imbalance compared to the training set distribution. On various
experiments involving medical imaging, natural language processing, computer
vision and genomics, we demonstrate the effectiveness of our approach. Source
code available at https://github.com/blindauth/abstention. Colab notebooks
reproducing results available at
https://github.com/blindauth/abstention_experiments
Making Neural Networks Interpretable with Attribution: Application to Implicit Signals Prediction
Explaining recommendations enables users to understand whether recommended
items are relevant to their needs and has been shown to increase their trust in
the system. More generally, if designing explainable machine learning models is
key to check the sanity and robustness of a decision process and improve their
efficiency, it however remains a challenge for complex architectures,
especially deep neural networks that are often deemed "black-box". In this
paper, we propose a novel formulation of interpretable deep neural networks for
the attribution task. Differently to popular post-hoc methods, our approach is
interpretable by design. Using masked weights, hidden features can be deeply
attributed, split into several input-restricted sub-networks and trained as a
boosted mixture of experts. Experimental results on synthetic data and
real-world recommendation tasks demonstrate that our method enables to build
models achieving close predictive performances to their non-interpretable
counterparts, while providing informative attribution interpretations.Comment: 14th ACM Conference on Recommender Systems (RecSys '20
Complaint-driven Training Data Debugging for Query 2.0
As the need for machine learning (ML) increases rapidly across all industry
sectors, there is a significant interest among commercial database providers to
support "Query 2.0", which integrates model inference into SQL queries.
Debugging Query 2.0 is very challenging since an unexpected query result may be
caused by the bugs in training data (e.g., wrong labels, corrupted features).
In response, we propose Rain, a complaint-driven training data debugging
system. Rain allows users to specify complaints over the query's intermediate
or final output, and aims to return a minimum set of training examples so that
if they were removed, the complaints would be resolved. To the best of our
knowledge, we are the first to study this problem. A naive solution requires
retraining an exponential number of ML models. We propose two novel heuristic
approaches based on influence functions which both require linear retraining
steps. We provide an in-depth analytical and empirical analysis of the two
approaches and conduct extensive experiments to evaluate their effectiveness
using four real-world datasets. Results show that Rain achieves the highest
recall@k among all the baselines while still returns results interactively.Comment: Proceedings of the 2020 ACM SIGMOD International Conference on
Management of Dat
Deciphering regulatory DNA sequences and noncoding genetic variants using neural network models of massively parallel reporter assays.
The relationship between noncoding DNA sequence and gene expression is not well-understood. Massively parallel reporter assays (MPRAs), which quantify the regulatory activity of large libraries of DNA sequences in parallel, are a powerful approach to characterize this relationship. We present MPRA-DragoNN, a convolutional neural network (CNN)-based framework to predict and interpret the regulatory activity of DNA sequences as measured by MPRAs. While our method is generally applicable to a variety of MPRA designs, here we trained our model on the Sharpr-MPRA dataset that measures the activity of ∼500,000 constructs tiling 15,720 regulatory regions in human K562 and HepG2 cell lines. MPRA-DragoNN predictions were moderately correlated (Spearman ρ = 0.28) with measured activity and were within range of replicate concordance of the assay. State-of-the-art model interpretation methods revealed high-resolution predictive regulatory sequence features that overlapped transcription factor (TF) binding motifs. We used the model to investigate the cell type and chromatin state preferences of predictive TF motifs. We explored the ability of our model to predict the allelic effects of regulatory variants in an independent MPRA experiment and fine map putative functional SNPs in loci associated with lipid traits. Our results suggest that interpretable deep learning models trained on MPRA data have the potential to reveal meaningful patterns in regulatory DNA sequences and prioritize regulatory genetic variants, especially as larger, higher-quality datasets are produced
Bias factorized, base-resolution deep learning models of chromatin accessibility reveal cis-regulatory sequence syntax, transcription factor footprints and regulatory variants.
<ul>
<li>(MAJOR) Bug in chrombpnet modisco_motifs command. seqlets was limited to 50000. If users wanted to change it to 1 million this did not happen.</li>
<li>Filter peaks at edges for pred_bw command and bias pipleline. So bias evaluation now done on these filtered peaks.</li>
<li>Preprocessing deafulted to use unix sort. Provided option to switch to bedtools sort.</li>
<li>Provided option to use filter chromosomes option in preprocessing.</li>
</ul>
<p><strong>Full Changelog</strong>: https://github.com/kundajelab/chrombpnet/compare/v0.1.3...v0.1.4</p>If you use this software, please cite it as below