18 research outputs found
Ontology-Aware Token Embeddings for Prepositional Phrase Attachment
Type-level word embeddings use the same set of parameters to represent all
instances of a word regardless of its context, ignoring the inherent lexical
ambiguity in language. Instead, we embed semantic concepts (or synsets) as
defined in WordNet and represent a word token in a particular context by
estimating a distribution over relevant semantic concepts. We use the new,
context-sensitive embeddings in a model for predicting prepositional phrase(PP)
attachments and jointly learn the concept embeddings and model parameters. We
show that using context-sensitive embeddings improves the accuracy of the PP
attachment model by 5.4% absolute points, which amounts to a 34.4% relative
reduction in errors.Comment: ACL 201
Recommended from our members
CODACT: Towards Identifying Orthographic Variants in Dialectal Arabic
Dialectal Arabic (DA) is the spoken vernacular for over 300M people worldwide. DA is emerging as the form of Arabic written in online communication: chats, emails, blogs, etc. However, most existing NLP tools for Arabic are designed for processing Modern Standard Arabic, a variety that is more formal and scripted. Apart from the genre variation that is a hindrance for any language processing, even in English, DA has no orthographic standard, compared to MSA that has a standard orthography and script. Accordingly, a word may be written in many possible inconsistent spellings rendering the processing of DA very challenging. To solve this problem, such inconsistencies have to be normalized. This work is the ļ¬rst step towards addressing this problem, as we attempt to identify spelling variants in a given textual document. We present an unsupervised clustering approach that addresses the problem of identifying orthographic variants in DA. We employ different similarity measures that exploit string similarity and contextual semantic similarity. To our knowledge this is the ļ¬rst attempt at solving the problem for DA. Our approaches are tested on data in two dialects of Arabic - Egyptian and Levantine. Our system achieves the highest Entropy of 0.19 for Egyptian (corresponding to 68% cluster precision) and Levantine (corresponding to 64% cluster precision) respectively. This constitutes a signiļ¬cant reduction in entropy (from 0.47 for Egyptian and 0.51 for Levantine) and improvement in cluster precision (from 29% for both) from the baseline
Genre Independent Subgroup Detection in Online Discussion Threads: A Pilot Study of Implicit Attitude using Latent Textual Semantics
We describe an unsupervised approach to the problem of automatically detecting subgroups of people holding similar opinions in a discussion thread. An intuitive way of identifying this is to detect the attitudes of discussants towards each other or named entities or topics mentioned in the discussion. Sentiment tags play an important role in this detection, but we also note another dimension to the detection of peopleās attitudes in a discussion: if two persons share the same opinion, they tend to use similar language content. We consider the latter to be an implicit attitude. In this paper, we investigate the impact of implicit and explicit attitude in two genres of social media discussion data, more formal wikipedia discussions and a debate discussion forum that is much more informal. experimental results strongly suggest that implicit attitude is an important complement for explicit attitudes (expressed via sentiment) and it can improve the sub-group detection performance independent of genre
TRAM: Bridging Trust Regions and Sharpness Aware Minimization
Sharpness-aware minimization (SAM) reports improving domain generalization by
reducing the loss surface curvature in the parameter space. However,
generalization during fine-tuning is often more dependent on the
transferability of representations in the function space. Trust-region methods
(TR) target this goal by regularizing representation curvature to reduce
catastrophic forgetting of pre-trained task-agnostic information while adopting
task-specific skills. We consider unifying these strategies for low curvature
in both parameter space and function space to improve out-of-domain (OOD)
generalization. We propose Trust Region Aware Minimization (TRAM), a SAM
algorithm fine-tuning for low parameter sharpness and smooth, informative
representations preserving pre-trained structure. TRAM uses a trust region
bound to inform the SAM adversarial neighborhood, introducing an awareness of
function curvature within optimization for flatter minima. We empirically
validate TRAM in vision (cross-dataset adaptation) and text (OOD language
modeling, zero-shot cross-lingual transfer) tasks where robust domain transfer
and representation generality are critical. TRAM outperforms SAM- and TR-based
optimization across all tasks, notably surpassing competing methods for hard
transfer between anticorrelated domains. TRAM establishes a novel standard in
fine-tuning for domain-generalizable models with minimal additional computation
over previous sharpness-aware methods.Comment: Camera Ready for ICLR 2024 (Accepted as Spotlight). 21 pages, 14
tables, 2 figure
Subgroup Detection in Ideological Discussions
The rapid and continuous growth of social networking sites has led to the emergence of many communities of communicating groups. Many of these groups discuss ideological and political topics. It is not uncommon that the participants in such discussions split into two or more subgroups. The members of each subgroup share the same opinion toward the discussion topic and are more likely to agree with members of the same subgroup and disagree with members from opposing subgroups. In this paper, we propose an unsupervised approach for automatically detecting discussant subgroups in online communities. We analyze the text exchanged between the participants of a discussion to identify the attitude they carry toward each other and towards the various aspects of the discussion topic. We use attitude predictions to construct an attitude vector for each discussant. We use clustering techniques to cluster these vectors and, hence, determine the subgroup membership of each participant. We compare our methods to text clustering and other baselines, and show that our method achieves promising results
Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
Since the release of T\"ULU [Wang et al., 2023b], open resources for
instruction tuning have developed quickly, from better base models to new
finetuning techniques. We test and incorporate a number of these advances into
T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing
the understanding and best practices of adapting pretrained language models to
downstream tasks and user preferences. Concretely, we release: (1)
T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2)
T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU
2 models trained with direct preference optimization (DPO), including the
largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE
LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its
instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple
perspectives shows that the T\"ULU 2 suite achieves state-of-the-art
performance among open models and matches or exceeds the performance of
GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data,
training and evaluation code to facilitate future open efforts on adapting
large language models.Comment: technical report; fixed zephyr number
BMKEG/sciDT: Release for ISWC2017
This release provides code that supports the submission to ISWC 2017 entitled: "Resources for Distant Supervision of Structured Biomedical Data Based on Text Surrounding Figure References".
This system is intended to be used in conjunction with the following libraries:
(https://github.com/BMKEG/sciDT-pipeline) - System for running SciDT discourse tagger.
(https://github.com/BMKEG/cosid_linkage) - System to use discourse tagging to extract figure based on heuristic rule