1,089 research outputs found
Data Augmentation for Sample Efficient and Robust Document Ranking
Contextual ranking models have delivered impressive performance improvements
over classical models in the document ranking task. However, these highly
over-parameterized models tend to be data-hungry and require large amounts of
data even for fine-tuning. In this paper, we propose data-augmentation methods
for effective and robust ranking performance. One of the key benefits of using
data augmentation is in achieving sample efficiency or learning effectively
when we have only a small amount of training data. We propose supervised and
unsupervised data augmentation schemes by creating training data using parts of
the relevant documents in the query-document pairs. We then adapt a family of
contrastive losses for the document ranking task that can exploit the augmented
data to learn an effective ranking model. Our extensive experiments on subsets
of the MS MARCO and TREC-DL test sets show that data augmentation, along with
the ranking-adapted contrastive losses, results in performance improvements
under most dataset sizes. Apart from sample efficiency, we conclusively show
that data augmentation results in robust models when transferred to
out-of-domain benchmarks. Our performance improvements in in-domain and more
prominently in out-of-domain benchmarks show that augmentation regularizes the
ranking model and improves its robustness and generalization capability
: Domain-Specific Fast Pre-training Technique using Document-Level Metadata and Taxonomy
As the demand for sophisticated Natural Language Processing (NLP) models
continues to grow, so does the need for efficient pre-training techniques.
Current NLP models undergo resource-intensive pre-training. In response, we
introduce (Fast Pre-training Technique using Document-Level Metadata
and Taxonomy), a novel approach designed to significantly reduce computational
demands. leverages document metadata and domain-specific taxonomy as
supervision signals. It involves continual pre-training of an open-domain
transformer encoder using sentence-level embeddings, followed by fine-tuning
using token-level embeddings. We evaluate on six tasks across nine
datasets spanning three distinct domains. Remarkably, achieves
remarkable compute reductions of approximately 1,000x, 4,500x, 500x compared to
competitive approaches in Customer Support, Scientific, and Legal domains,
respectively. Importantly, these efficiency gains do not compromise performance
relative to competitive baselines. Furthermore, reduced pre-training data
mitigates catastrophic forgetting, ensuring consistent performance in
open-domain scenarios. offers a promising solution for
resource-efficient pre-training, with potential applications spanning various
domains.Comment: 38 pages, 7 figure
Learning Multi-modal Representations by Watching Hundreds of Surgical Video Lectures
Recent advancements in surgical computer vision applications have been driven
by fully-supervised methods, primarily using only visual data. These methods
rely on manually annotated surgical videos to predict a fixed set of object
categories, limiting their generalizability to unseen surgical procedures and
downstream tasks. In this work, we put forward the idea that the surgical video
lectures available through open surgical e-learning platforms can provide
effective supervisory signals for multi-modal representation learning without
relying on manual annotations. We address the surgery-specific linguistic
challenges present in surgical video lectures by employing multiple
complementary automatic speech recognition systems to generate text
transcriptions. We then present a novel method, SurgVLP - Surgical Vision
Language Pre-training, for multi-modal representation learning. SurgVLP
constructs a new contrastive learning objective to align video clip embeddings
with the corresponding multiple text embeddings by bringing them together
within a joint latent space. To effectively show the representation capability
of the learned joint latent space, we introduce several vision-and-language
tasks for surgery, such as text-based video retrieval, temporal activity
grounding, and video captioning, as benchmarks for evaluation. We further
demonstrate that without using any labeled ground truth, our approach can be
employed for traditional vision-only surgical downstream tasks, such as
surgical tool, phase, and triplet recognition. The code will be made available
at https://github.com/CAMMA-public/SurgVL
DeSIQ: Towards an Unbiased, Challenging Benchmark for Social Intelligence Understanding
Social intelligence is essential for understanding and reasoning about human
expressions, intents and interactions. One representative benchmark for its
study is Social Intelligence Queries (Social-IQ), a dataset of multiple-choice
questions on videos of complex social interactions. We define a comprehensive
methodology to study the soundness of Social-IQ, as the soundness of such
benchmark datasets is crucial to the investigation of the underlying research
problem. Our analysis reveals that Social-IQ contains substantial biases, which
can be exploited by a moderately strong language model to learn spurious
correlations to achieve perfect performance without being given the context or
even the question. We introduce DeSIQ, a new challenging dataset, constructed
by applying simple perturbations to Social-IQ. Our empirical analysis shows
DeSIQ significantly reduces the biases in the original Social-IQ dataset.
Furthermore, we examine and shed light on the effect of model size, model
style, learning settings, commonsense knowledge, and multi-modality on the new
benchmark performance. Our new dataset, observations and findings open up
important research questions for the study of social intelligence.Comment: 12 pages, 5 figures, EMNLP 2023 Long Pape
- …