93,976 research outputs found
Generate to Understand for Representation
In recent years, a significant number of high-quality pretrained models have
emerged, greatly impacting Natural Language Understanding (NLU), Natural
Language Generation (NLG), and Text Representation tasks. Traditionally, these
models are pretrained on custom domain corpora and finetuned for specific
tasks, resulting in high costs related to GPU usage and labor. Unfortunately,
recent trends in language modeling have shifted towards enhancing performance
through scaling, further exacerbating the associated costs.
Introducing GUR: a pretraining framework that combines language modeling and
contrastive learning objectives in a single training step. We select similar
text pairs based on their Longest Common Substring (LCS) from raw unlabeled
documents and train the model using masked language modeling and unsupervised
contrastive learning. The resulting model, GUR, achieves impressive results
without any labeled training data, outperforming all other pretrained baselines
as a retriever at the recall benchmark in a zero-shot setting. Additionally,
GUR maintains its language modeling ability, as demonstrated in our ablation
experiment. Our code is available at \url{https://github.com/laohur/GUR}
Mira: A Framework for Static Performance Analysis
The performance model of an application can pro- vide understanding about its
runtime behavior on particular hardware. Such information can be analyzed by
developers for performance tuning. However, model building and analyzing is
frequently ignored during software development until perfor- mance problems
arise because they require significant expertise and can involve many
time-consuming application runs. In this paper, we propose a fast, accurate,
flexible and user-friendly tool, Mira, for generating performance models by
applying static program analysis, targeting scientific applications running on
supercomputers. We parse both the source code and binary to estimate
performance attributes with better accuracy than considering just source or
just binary code. Because our analysis is static, the target program does not
need to be executed on the target architecture, which enables users to perform
analysis on available machines instead of conducting expensive exper- iments on
potentially expensive resources. Moreover, statically generated models enable
performance prediction on non-existent or unavailable architectures. In
addition to flexibility, because model generation time is significantly reduced
compared to dynamic analysis approaches, our method is suitable for rapid
application performance analysis and improvement. We present several scientific
application validation results to demonstrate the current capabilities of our
approach on small benchmarks and a mini application
Towards Automatic Generation of Shareable Synthetic Clinical Notes Using Neural Language Models
Large-scale clinical data is invaluable to driving many computational
scientific advances today. However, understandable concerns regarding patient
privacy hinder the open dissemination of such data and give rise to suboptimal
siloed research. De-identification methods attempt to address these concerns
but were shown to be susceptible to adversarial attacks. In this work, we focus
on the vast amounts of unstructured natural language data stored in clinical
notes and propose to automatically generate synthetic clinical notes that are
more amenable to sharing using generative models trained on real de-identified
records. To evaluate the merit of such notes, we measure both their privacy
preservation properties as well as utility in training clinical NLP models.
Experiments using neural language models yield notes whose utility is close to
that of the real ones in some clinical NLP tasks, yet leave ample room for
future improvements.Comment: Clinical NLP Workshop 201
- …