15 research outputs found
How Does Beam Search improve Span-Level Confidence Estimation in Generative Sequence Labeling?
Sequence labeling is a core task in text understanding for IE/IR systems.
Text generation models have increasingly become the go-to solution for such
tasks (e.g., entity extraction and dialog slot filling). While most research
has focused on the labeling accuracy, a key aspect -- of vital practical
importance -- has slipped through the cracks: understanding model confidence.
More specifically, we lack a principled understanding of how to reliably gauge
the confidence of a model in its predictions for each labeled span. This paper
aims to provide some empirical insights on estimating model confidence for
generative sequence labeling. Most notably, we find that simply using the
decoder's output probabilities \textbf{is not} the best in realizing
well-calibrated confidence estimates. As verified over six public datasets of
different tasks, we show that our proposed approach -- which leverages
statistics from top- predictions by a beam search -- significantly reduces
calibration errors of the predictions of a generative sequence labeling model
TextMI: Textualize Multimodal Information for Integrating Non-verbal Cues in Pre-trained Language Models
Pre-trained large language models have recently achieved ground-breaking
performance in a wide variety of language understanding tasks. However, the
same model can not be applied to multimodal behavior understanding tasks (e.g.,
video sentiment/humor detection) unless non-verbal features (e.g., acoustic and
visual) can be integrated with language. Jointly modeling multiple modalities
significantly increases the model complexity, and makes the training process
data-hungry. While an enormous amount of text data is available via the web,
collecting large-scale multimodal behavioral video datasets is extremely
expensive, both in terms of time and money. In this paper, we investigate
whether large language models alone can successfully incorporate non-verbal
information when they are presented in textual form. We present a way to
convert the acoustic and visual information into corresponding textual
descriptions and concatenate them with the spoken text. We feed this augmented
input to a pre-trained BERT model and fine-tune it on three downstream
multimodal tasks: sentiment, humor, and sarcasm detection. Our approach,
TextMI, significantly reduces model complexity, adds interpretability to the
model's decision, and can be applied for a diverse set of tasks while achieving
superior (multimodal sarcasm detection) or near SOTA (multimodal sentiment
analysis and multimodal humor detection) performance. We propose TextMI as a
general, competitive baseline for multimodal behavioral analysis tasks,
particularly in a low-resource setting
Multi-Vector Retrieval as Sparse Alignment
Multi-vector retrieval models improve over single-vector dual encoders on
many information retrieval tasks. In this paper, we cast the multi-vector
retrieval problem as sparse alignment between query and document tokens. We
propose AligneR, a novel multi-vector retrieval model that learns sparsified
pairwise alignments between query and document tokens (e.g. `dog' vs. `puppy')
and per-token unary saliences reflecting their relative importance for
retrieval. We show that controlling the sparsity of pairwise token alignments
often brings significant performance gains. While most factoid questions
focusing on a specific part of a document require a smaller number of
alignments, others requiring a broader understanding of a document favor a
larger number of alignments. Unary saliences, on the other hand, decide whether
a token ever needs to be aligned with others for retrieval (e.g. `kind' from
`kind of currency is used in new zealand}'). With sparsified unary saliences,
we are able to prune a large number of query and document token vectors and
improve the efficiency of multi-vector retrieval. We learn the sparse unary
saliences with entropy-regularized linear programming, which outperforms other
methods to achieve sparsity. In a zero-shot setting, AligneR scores 51.1 points
nDCG@10, achieving a new retriever-only state-of-the-art on 13 tasks in the
BEIR benchmark. In addition, adapting pairwise alignments with a few examples
(<= 8) further improves the performance up to 15.7 points nDCG@10 for argument
retrieval tasks. The unary saliences of AligneR helps us to keep only 20% of
the document token representations with minimal performance loss. We further
show that our model often produces interpretable alignments and significantly
improves its performance when initialized from larger language models
Gecko: Versatile Text Embeddings Distilled from Large Language Models
We present Gecko, a compact and versatile text embedding model. Gecko
achieves strong retrieval performance by leveraging a key idea: distilling
knowledge from large language models (LLMs) into a retriever. Our two-step
distillation process begins with generating diverse, synthetic paired data
using an LLM. Next, we further refine the data quality by retrieving a set of
candidate passages for each query, and relabeling the positive and hard
negative passages using the same LLM. The effectiveness of our approach is
demonstrated by the compactness of the Gecko. On the Massive Text Embedding
Benchmark (MTEB), Gecko with 256 embedding dimensions outperforms all existing
entries with 768 embedding size. Gecko with 768 embedding dimensions achieves
an average score of 66.31, competing with 7x larger models and 5x higher
dimensional embeddings.Comment: 18 page
Convergence of the EM Algorithm for Gaussian Mixtures with Unbalanced Mixing Coefficients
The speed of convergence of the Expectation Maximization (EM) algorithm for Gaussian mixture model fitting is known to be dependent on the amount of overlap among the mixture components. In this paper, we study the impact of mixing coefficients on the convergence of EM. We show that when the mixture components exhibit some overlap, the convergence of EM becomes slower as the dynamic range among the mixing coefficients increases. We propose a deterministic anti-annealing algorithm, that significantly improves the speed of convergence of EM for such mixtures with unbalanced mixing coefficients. The proposed algorithm is compared against other standard optimization techniques like BFGS, Conjugate Gradient, and the traditional EM algorithm. Finally, we propose a similar deterministic antiannealing based algorithm for the Dirichlet process mixture model and demonstrate its advantages over the conventional variational Bayesian approach. 1
Unsupervised alignment of natural language with video
Thesis (Ph. D.)--University of Rochester. Department of Computer Science, 2016.Today we encounter large amounts of video data, often accompanied with text descriptions (e.g., cooking videos and recipes, videos of wetlab experiments and protocols, movies and scripts). Extracting meaningful information from these multimodal sequences requires aligning the video frames with the corresponding sentences in the text. Previous methods for connecting language and videos relied on manual annotations,
which are often tedious and expensive to collect. In this thesis, we focus on automatically aligning sentences with the corresponding video frames without any direct human supervision.
We first propose two hierarchical generative alignment models, which jointly align each sentence with the corresponding video frames, and each noun in a sentence with the corresponding object in the video frames. Next, we propose several latent-variable discriminative alignment models, which incorporate rich features involving verbs and video actions, and outperform the generative models. Our alignment algorithms are primarily applied to align biological wetlab videos with text instructions. Furthermore, we extend our alignment models for automatically aligning movie scenes with associated scripts and learning word-level translations between language pairs for which bilingual training data is unavailable.
Thesis: By exploiting the temporal ordering constraints between video and associated text, it is possible to automatically align the sentences in the text with the corresponding video frames without any direct human supervision
SWIFT: Scalable weighted iterative sampling for flow cytometry clustering
Flow cytometry (FC) is a powerful technology for rapid multivariate analysis and functional discrimination of cells. Current FC platforms generate large, high-dimensional datasets which pose a significant challenge for traditional manual bivariate analysis. Automated multivariate clustering, though highly desirable, is also stymied by the critical requirement of identifying rare populations that form rather small clusters, in addition to the computational challenges posed by the large size and dimensionality of the datasets. In this paper, we address these twin challenges by developing a two-stage scalable multivariate parametric clustering algorithm. In the first stage, we model the data as a mixture of Gaussians and use an iterative weighted sampling technique to estimate the mixture components successively in order of decreasing size. In the second stage, we apply a graphbased hierarchical merging technique to combine Gaussian components with significant overlaps into the final number of desired clusters. The resulting algorithm offers a reduction in complexity over conventional mixture modeling while simultaneously allowing for better detection of small populations. We demonstrate the effectiveness of our method both on simulated data and actual flow cytometry datasets. Index Terms — Flow cytometry, clustering, Gaussian mixture model, sampling, expectation-maximizatio
Unsupervised Alignment of Natural Language Instructions with Video Segments
We propose an unsupervised learning algorithm for automatically inferring the mappings between English nouns and corresponding video objects. Given a sequence of natural language instructions and an unaligned video recording, we simultaneously align each instruction to its corresponding video segment, and also align nouns in each instruction to their corresponding objects in video. While existing grounded language acquisition algorithms rely on pre-aligned supervised data (each sentence paired with corresponding image frame or video segment), our algorithm aims to automatically infer the alignment from the temporal structure of the video and parallel text instructions. We propose two generative models that are closely related to the HMM and IBM 1 word alignment models used in statistical machine translation. We evaluate our algorithm on videos of biological experiments performed in wetlabs, and demonstrate its capability of aligning video segments to text instructions and matching video objects to nouns in the absence of any direct supervision