579 research outputs found
Biomedical Language Models are Robust to Sub-optimal Tokenization
As opposed to general English, many concepts in biomedical terminology have
been designed in recent history by biomedical professionals with the goal of
being precise and concise. This is often achieved by concatenating meaningful
biomedical morphemes to create new semantic units. Nevertheless, most modern
biomedical language models (LMs) are pre-trained using standard domain-specific
tokenizers derived from large scale biomedical corpus statistics without
explicitly leveraging the agglutinating nature of biomedical language. In this
work, we first find that standard open-domain and biomedical tokenizers are
largely unable to segment biomedical terms into meaningful components.
Therefore, we hypothesize that using a tokenizer which segments biomedical
terminology more accurately would enable biomedical LMs to improve their
performance on downstream biomedical NLP tasks, especially ones which involve
biomedical terms directly such as named entity recognition (NER) and entity
linking. Surprisingly, we find that pre-training a biomedical LM using a more
accurate biomedical tokenizer does not improve the entity representation
quality of a language model as measured by several intrinsic and extrinsic
measures such as masked language modeling prediction (MLM) accuracy as well as
NER and entity linking performance. These quantitative findings, along with a
case study which explores entity representation quality more directly, suggest
that the biomedical pre-training process is quite robust to instances of
sub-optimal tokenization.Comment: BioNLP @ ACL 202
Is Robustness the Cost of Accuracy? -- A Comprehensive Study on the Robustness of 18 Deep Image Classification Models
The prediction accuracy has been the long-lasting and sole standard for
comparing the performance of different image classification models, including
the ImageNet competition. However, recent studies have highlighted the lack of
robustness in well-trained deep neural networks to adversarial examples.
Visually imperceptible perturbations to natural images can easily be crafted
and mislead the image classifiers towards misclassification. To demystify the
trade-offs between robustness and accuracy, in this paper we thoroughly
benchmark 18 ImageNet models using multiple robustness metrics, including the
distortion, success rate and transferability of adversarial examples between
306 pairs of models. Our extensive experimental results reveal several new
insights: (1) linear scaling law - the empirical and
distortion metrics scale linearly with the logarithm of classification error;
(2) model architecture is a more critical factor to robustness than model size,
and the disclosed accuracy-robustness Pareto frontier can be used as an
evaluation criterion for ImageNet model designers; (3) for a similar network
architecture, increasing network depth slightly improves robustness in
distortion; (4) there exist models (in VGG family) that exhibit
high adversarial transferability, while most adversarial examples crafted from
one model can only be transferred within the same family. Experiment code is
publicly available at \url{https://github.com/huanzhang12/Adversarial_Survey}.Comment: Accepted by the European Conference on Computer Vision (ECCV) 201
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
Text-guided image editing is widely needed in daily life, ranging from
personal use to professional applications such as Photoshop. However, existing
methods are either zero-shot or trained on an automatically synthesized
dataset, which contains a high volume of noise. Thus, they still require lots
of manual tuning to produce desirable outcomes in practice. To address this
issue, we introduce MagicBrush (https://osu-nlp-group.github.io/MagicBrush/),
the first large-scale, manually annotated dataset for instruction-guided real
image editing that covers diverse scenarios: single-turn, multi-turn,
mask-provided, and mask-free editing. MagicBrush comprises over 10K manually
annotated triplets (source image, instruction, target image), which supports
trainining large-scale text-guided image editing models. We fine-tune
InstructPix2Pix on MagicBrush and show that the new model can produce much
better images according to human evaluation. We further conduct extensive
experiments to evaluate current image editing baselines from multiple
dimensions including quantitative, qualitative, and human evaluations. The
results reveal the challenging nature of our dataset and the gap between
current baselines and real-world editing needs.Comment: NeurIPS 2023; Website: https://osu-nlp-group.github.io/MagicBrush
GPT-4V(ision) is a Generalist Web Agent, if Grounded
The recent development on large multimodal models (LMMs), especially
GPT-4V(ision) and Gemini, has been quickly expanding the capability boundaries
of multimodal models beyond traditional tasks like image captioning and visual
question answering. In this work, we explore the potential of LMMs like GPT-4V
as a generalist web agent that can follow natural language instructions to
complete tasks on any given website. We propose SEEACT, a generalist web agent
that harnesses the power of LMMs for integrated visual understanding and acting
on the web. We evaluate on the recent MIND2WEB benchmark. In addition to
standard offline evaluation on cached websites, we enable a new online
evaluation setting by developing a tool that allows running web agents on live
websites. We show that GPT-4V presents a great potential for web agents -- it
can successfully complete 51.1 of the tasks on live websites if we manually
ground its textual plans into actions on the websites. This substantially
outperforms text-only LLMs like GPT-4 or smaller models (FLAN-T5 and BLIP-2)
specifically fine-tuned for web agents. However, grounding still remains a
major challenge. Existing LMM grounding strategies like set-of-mark prompting
turns out to be not effective for web agents, and the best grounding strategy
we develop in this paper leverages both the HTML structure and visuals. Yet,
there is still a substantial gap with oracle grounding, leaving ample room for
further improvement. All code, data, and evaluation tools are available at
https://github.com/OSU-NLP-Group/SeeAct
Federated Learning for Semantic Parsing: Task Formulation, Evaluation Setup, New Algorithms
This paper studies a new task of federated learning (FL) for semantic
parsing, where multiple clients collaboratively train one global model without
sharing their semantic parsing data. By leveraging data from multiple clients,
the FL paradigm can be especially beneficial for clients that have little
training data to develop a data-hungry neural semantic parser on their own. We
propose an evaluation setup to study this task, where we re-purpose widely-used
single-domain text-to-SQL datasets as clients to form a realistic heterogeneous
FL setting and collaboratively train a global model. As standard FL algorithms
suffer from the high client heterogeneity in our realistic setup, we further
propose a novel LOss Reduction Adjusted Re-weighting (Lorar) mechanism to
mitigate the performance degradation, which adjusts each client's contribution
to the global model update based on its training loss reduction during each
round. Our intuition is that the larger the loss reduction, the further away
the current global model is from the client's local optimum, and the larger
weight the client should get. By applying Lorar to three widely adopted FL
algorithms (FedAvg, FedOPT and FedProx), we observe that their performance can
be improved substantially on average (4%-20% absolute gain under MacroAvg) and
that clients with smaller datasets enjoy larger performance gains. In addition,
the global model converges faster for almost all the clients.Comment: ACL 2023 long pape
- …