3,971 research outputs found
Satellite Workshop On Language, Artificial Intelligence and Computer Science for Natural Language Processing Applications (LAICS-NLP): Discovery of Meaning from Text
This paper proposes a novel method to disambiguate important words from a collection of documents. The
hypothesis that underlies this approach is that there is a
minimal set of senses that are significant in characterizing a context. We extend Yarowsky’s one sense
per discourse [13] further to a collection of related
documents rather than a single document. We perform
distributed clustering on a set of features representing
each of the top ten categories of documents in the
Reuters-21578 dataset. Groups of terms that have a
similar term distributional pattern across documents were
identified. WordNet-based similarity measurement was
then computed for terms within each cluster. An
aggregation of the associations in WordNet that was
employed to ascertain term similarity within clusters has
provided a means of identifying clusters’ root senses
The Automatic Detection of Dataset Names in Scientific Articles
We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub
Pushing the Limits of ChatGPT on NLP Tasks
Despite the success of ChatGPT, its performances on most NLP tasks are still
well below the supervised baselines. In this work, we looked into the causes,
and discovered that its subpar performance was caused by the following factors:
(1) token limit in the prompt does not allow for the full utilization of the
supervised datasets; (2) mismatch between the generation nature of ChatGPT and
NLP tasks; (3) intrinsic pitfalls of LLMs models, e.g., hallucination, overly
focus on certain keywords, etc.
In this work, we propose a collection of general modules to address these
issues, in an attempt to push the limits of ChatGPT on NLP tasks. Our proposed
modules include (1) a one-input-multiple-prompts strategy that employs multiple
prompts for one input to accommodate more demonstrations; (2) using fine-tuned
models for better demonstration retrieval; (3) transforming tasks to formats
that are more tailored to the generation nature; (4) employing reasoning
strategies that are tailored to addressing the task-specific complexity; (5)
the self-verification strategy to address the hallucination issue of LLMs; (6)
the paraphrase strategy to improve the robustness of model predictions.
We conduct experiments on 21 datasets of 10 representative NLP tasks,
including question answering, commonsense reasoning, natural language
inference, sentiment analysis, named entity recognition, entity-relation
extraction, event extraction, dependency parsing, semantic role labeling, and
part-of-speech tagging. Using the proposed assemble of techniques, we are able
to significantly boost the performance of ChatGPT on the selected NLP tasks,
achieving performances comparable to or better than supervised baselines, or
even existing SOTA performances
Splitting Arabic Texts into Elementary Discourse Units
International audienceIn this article, we propose the first work that investigates the feasibility of Arabic discourse segmentation into elementary discourse units within the segmented discourse representation theory framework. We first describe our annotation scheme that defines a set of principles to guide the segmentation process. Two corpora have been annotated according to this scheme: elementary school textbooks and newspaper documents extracted from the syntactically annotated Arabic Treebank. Then, we propose a multiclass supervised learning approach that predicts nested units. Our approach uses a combination of punctuation, morphological, lexical, and shallow syntactic features. We investigate how each feature contributes to the learning process. We show that an extensive morphological analysis is crucial to achieve good results in both corpora. In addition, we show that adding chunks does not boost the performance of our system
Machine Reading Comprehension using Case-based Reasoning
We present an accurate and interpretable method for answer extraction in
machine reading comprehension that is reminiscent of case-based reasoning (CBR)
from classical AI. Our method (CBR-MRC) builds on the hypothesis that
contextualized answers to similar questions share semantic similarities with
each other. Given a target question, CBR-MRC retrieves a set of similar
questions from a memory of observed cases and predicts an answer by selecting
the span in the target context that is most similar to the contextualized
representations of answers in the retrieved cases. The semi-parametric nature
of our approach allows CBR-MRC to attribute a prediction to the specific set of
cases used during inference, making it a desirable choice for building reliable
and debuggable QA systems. We show that CBR-MRC achieves high test accuracy
comparable with large reader models, outperforming baselines by 11.5 and 8.4 EM
on NaturalQuestions and NewsQA, respectively. Further, we also demonstrate the
ability of CBR-MRC in identifying not just the correct answer tokens but also
the span with the most relevant supporting evidence. Lastly, we observe that
contexts for certain question types show higher lexical diversity than others
and find CBR-MRC to be robust to these variations while performance using
fully-parametric methods drops.Comment: 9 pages, 2 figure
Content Recognition and Context Modeling for Document Analysis and Retrieval
The nature and scope of available documents are changing significantly in many areas of document analysis and retrieval as complex, heterogeneous collections become accessible to virtually everyone via the web. The increasing level of diversity presents a great challenge for document image content categorization, indexing, and retrieval. Meanwhile, the processing of documents with unconstrained layouts and complex formatting often requires effective leveraging of broad contextual knowledge.
In this dissertation, we first present a novel approach for document image content categorization, using a lexicon of shape features. Each lexical word corresponds to a scale and rotation invariant local shape feature that is generic enough to be detected repeatably and is segmentation free. A concise, structurally indexed shape lexicon is learned by clustering and partitioning feature types through graph cuts. Our idea finds successful application in several challenging tasks, including content recognition of diverse web images and language identification on documents composed of mixed machine printed text and handwriting.
Second, we address two fundamental problems in signature-based document image retrieval. Facing continually increasing volumes of documents, detecting and recognizing unique, evidentiary visual entities (\eg, signatures and logos) provides a practical and reliable supplement to the OCR recognition of printed text. We propose a novel multi-scale framework to detect and segment signatures jointly from document images, based on the structural saliency under a signature production model. We formulate the problem of signature retrieval in the unconstrained setting of geometry-invariant deformable shape matching and demonstrate state-of-the-art performance in signature matching and verification.
Third, we present a model-based approach for extracting relevant named entities from unstructured documents. In a wide range of applications that require structured information from diverse, unstructured document images, processing OCR text does not give satisfactory results due to the absence of linguistic context. Our approach enables learning of inference rules collectively based on contextual information from both page layout and text features.
Finally, we demonstrate the importance of mining general web user behavior data for improving document ranking and other web search experience. The context of web user activities reveals their preferences and intents, and we emphasize the analysis of individual user sessions for creating aggregate models. We introduce a novel algorithm for estimating web page and web site importance, and discuss its theoretical foundation based on an intentional surfer model. We demonstrate that our approach significantly improves large-scale document retrieval performance
Revisiting Out-of-distribution Robustness in NLP: Benchmark, Analysis, and LLMs Evaluations
This paper reexamines the research on out-of-distribution (OOD) robustness in
the field of NLP. We find that the distribution shift settings in previous
studies commonly lack adequate challenges, hindering the accurate evaluation of
OOD robustness. To address these issues, we propose a benchmark construction
protocol that ensures clear differentiation and challenging distribution
shifts. Then we introduce BOSS, a Benchmark suite for Out-of-distribution
robustneSS evaluation covering 5 tasks and 20 datasets. Based on BOSS, we
conduct a series of experiments on pre-trained language models for analysis and
evaluation of OOD robustness. First, for vanilla fine-tuning, we examine the
relationship between in-distribution (ID) and OOD performance. We identify
three typical types that unveil the inner learning mechanism, which could
potentially facilitate the forecasting of OOD robustness, correlating with the
advancements on ID datasets. Then, we evaluate 5 classic methods on BOSS and
find that, despite exhibiting some effectiveness in specific cases, they do not
offer significant improvement compared to vanilla fine-tuning. Further, we
evaluate 5 LLMs with various adaptation paradigms and find that when sufficient
ID data is available, fine-tuning domain-specific models outperform LLMs on ID
examples significantly. However, in the case of OOD instances, prioritizing
LLMs with in-context learning yields better results. We identify that both
fine-tuned small models and LLMs face challenges in effectively addressing
downstream tasks. The code is public at
\url{https://github.com/lifan-yuan/OOD_NLP}.Comment: Accepted to NeurIPS 2023 Dataset and Benchmark Track. Code is
available at \url{https://github.com/lifan-yuan/OOD_NLP
- …