126 research outputs found
Insights into Analogy Completion from the Biomedical Domain
Analogy completion has been a popular task in recent years for evaluating the
semantic properties of word embeddings, but the standard methodology makes a
number of assumptions about analogies that do not always hold, either in recent
benchmark datasets or when expanding into other domains. Through an analysis of
analogies in the biomedical domain, we identify three assumptions: that of a
Single Answer for any given analogy, that the pairs involved describe the Same
Relationship, and that each pair is Informative with respect to the other. We
propose modifying the standard methodology to relax these assumptions by
allowing for multiple correct answers, reporting MAP and MRR in addition to
accuracy, and using multiple example pairs. We further present BMASS, a novel
dataset for evaluating linguistic regularities in biomedical embeddings, and
demonstrate that the relationships described in the dataset pose significant
semantic challenges to current word embedding methods.Comment: Accepted to BioNLP 2017. (10 pages
Selective Demonstrations for Cross-domain Text-to-SQL
Large language models (LLMs) with in-context learning have demonstrated
impressive generalization capabilities in the cross-domain text-to-SQL task,
without the use of in-domain annotations. However, incorporating in-domain
demonstration examples has been found to greatly enhance LLMs' performance. In
this paper, we delve into the key factors within in-domain examples that
contribute to the improvement and explore whether we can harness these benefits
without relying on in-domain annotations. Based on our findings, we propose a
demonstration selection framework ODIS which utilizes both out-of-domain
examples and synthetically generated in-domain examples to construct
demonstrations. By retrieving demonstrations from hybrid sources, ODIS
leverages the advantages of both, showcasing its effectiveness compared to
baseline methods that rely on a single data source. Furthermore, ODIS
outperforms state-of-the-art approaches on two cross-domain text-to-SQL
datasets, with improvements of 1.1 and 11.8 points in execution accuracy,
respectively.Comment: EMNLP 202
How to Prompt LLMs for Text-to-SQL: A Study in Zero-shot, Single-domain, and Cross-domain Settings
Large language models (LLMs) with in-context learning have demonstrated
remarkable capability in the text-to-SQL task. Previous research has prompted
LLMs with various demonstration-retrieval strategies and intermediate reasoning
steps to enhance the performance of LLMs. However, those works often employ
varied strategies when constructing the prompt text for text-to-SQL inputs,
such as databases and demonstration examples. This leads to a lack of
comparability in both the prompt constructions and their primary contributions.
Furthermore, selecting an effective prompt construction has emerged as a
persistent problem for future research. To address this limitation, we
comprehensively investigate the impact of prompt constructions across various
settings and provide insights for future work
Writing habits and telltale neighbors: analyzing clinical concept usage patterns with sublanguage embeddings
Natural language processing techniques are being applied to increasingly
diverse types of electronic health records, and can benefit from in-depth
understanding of the distinguishing characteristics of medical document types.
We present a method for characterizing the usage patterns of clinical concepts
among different document types, in order to capture semantic differences beyond
the lexical level. By training concept embeddings on clinical documents of
different types and measuring the differences in their nearest neighborhood
structures, we are able to measure divergences in concept usage while
correcting for noise in embedding learning. Experiments on the MIMIC-III corpus
demonstrate that our approach captures clinically-relevant differences in
concept usage and provides an intuitive way to explore semantic characteristics
of clinical document collections.Comment: LOUHI 2019 (co-located with EMNLP
Characterizing the impact of geometric properties of word embeddings on task performance
Analysis of word embedding properties to inform their use in downstream NLP
tasks has largely been studied by assessing nearest neighbors. However,
geometric properties of the continuous feature space contribute directly to the
use of embedding features in downstream models, and are largely unexplored. We
consider four properties of word embedding geometry, namely: position relative
to the origin, distribution of features in the vector space, global pairwise
distances, and local pairwise distances. We define a sequence of
transformations to generate new embeddings that expose subsets of these
properties to downstream models and evaluate change in task performance to
understand the contribution of each property to NLP models. We transform
publicly available pretrained embeddings from three popular toolkits (word2vec,
GloVe, and FastText) and evaluate on a variety of intrinsic tasks, which model
linguistic information in the vector space, and extrinsic tasks, which use
vectors as input to machine learning models. We find that intrinsic evaluations
are highly sensitive to absolute position, while extrinsic tasks rely primarily
on local similarity. Our findings suggest that future embedding models and
post-processing techniques should focus primarily on similarity to nearby
points in vector space.Comment: Appearing in the Third Workshop on Evaluating Vector Space
Representations for NLP (RepEval 2019). 7 pages + reference
Automated Coding of Under-Studied Medical Concept Domains: Linking Physical Activity Reports to the International Classification of Functioning, Disability, and Health
Linking clinical narratives to standardized vocabularies and coding systems
is a key component of unlocking the information in medical text for analysis.
However, many domains of medical concepts lack well-developed terminologies
that can support effective coding of medical text. We present a framework for
developing natural language processing (NLP) technologies for automated coding
of under-studied types of medical information, and demonstrate its
applicability via a case study on physical mobility function. Mobility is a
component of many health measures, from post-acute care and surgical outcomes
to chronic frailty and disability, and is coded in the International
Classification of Functioning, Disability, and Health (ICF). However, mobility
and other types of functional activity remain under-studied in medical
informatics, and neither the ICF nor commonly-used medical terminologies
capture functional status terminology in practice. We investigated two
data-driven paradigms, classification and candidate selection, to link
narrative observations of mobility to standardized ICF codes, using a dataset
of clinical narratives from physical therapy encounters. Recent advances in
language modeling and word embedding were used as features for established
machine learning models and a novel deep learning approach, achieving a macro
F-1 score of 84% on linking mobility activity reports to ICF codes. Both
classification and candidate selection approaches present distinct strengths
for automated coding in under-studied domains, and we highlight that the
combination of (i) a small annotated data set; (ii) expert definitions of codes
of interest; and (iii) a representative text corpus is sufficient to produce
high-performing automated coding systems. This study has implications for the
ongoing growth of NLP tools for a variety of specialized applications in
clinical care and research.Comment: Updated final version, published in Frontiers in Digital Health,
https://doi.org/10.3389/fdgth.2021.620828. 34 pages (23 text + 11
references); 9 figures, 2 table
End-to-End real time tracking of children's reading with pointer network
In this work, we explore how a real time reading tracker can be built
efficiently for children's voices. While previously proposed reading trackers
focused on ASR-based cascaded approaches, we propose a fully end-to-end model
making it less prone to lags in voice tracking. We employ a pointer network
that directly learns to predict positions in the ground truth text conditioned
on the streaming speech. To train this pointer network, we generate ground
truth training signals by using forced alignment between the read speech and
the text being read on the training set. Exploring different forced alignment
models, we find a neural attention based model is at least as close in
alignment accuracy to the Montreal Forced Aligner, but surprisingly is a better
training signal for the pointer network. Our results are reported on one adult
speech data (TIMIT) and two children's speech datasets (CMU Kids and Reading
Races). Our best model can accurately track adult speech with 87.8% accuracy
and the much harder and disfluent children's speech with 77.1% accuracy on CMU
Kids data and a 65.3% accuracy on the Reading Races dataset.Comment: 5 pages, 3 figure
- …