295 research outputs found
Elimination of Spurious Ambiguity in Transition-Based Dependency Parsing
We present a novel technique to remove spurious ambiguity from transition
systems for dependency parsing. Our technique chooses a canonical sequence of
transition operations (computation) for a given dependency tree. Our technique
can be applied to a large class of bottom-up transition systems, including for
instance Nivre (2004) and Attardi (2006)
Conversation Trees: A Grammar Model for Topic Structure in Forums
Online forum discussions proceed differently from face-to-face conversations and any single thread on an online forum contains posts on different subtopics. This work aims to characterize the content of a forum thread as a conversation tree of topics. We present models that jointly per- form two tasks: segment a thread into sub- parts, and assign a topic to each part. Our core idea is a definition of topic structure using probabilistic grammars. By leveraging the flexibility of two grammar formalisms, Context-Free Grammars and Linear Context-Free Rewriting Systems, our models create desirable structures for forum threads: our topic segmentation is hierarchical, links non-adjacent segments on the same topic, and jointly labels the topic during segmentation. We show that our models outperform a number of tree generation baselines
Obfuscation for Privacy-preserving Syntactic Parsing
The goal of homomorphic encryption is to encrypt data such that another party
can operate on it without being explicitly exposed to the content of the
original data. We introduce an idea for a privacy-preserving transformation on
natural language data, inspired by homomorphic encryption. Our primary tool is
{\em obfuscation}, relying on the properties of natural language. Specifically,
a given English text is obfuscated using a neural model that aims to preserve
the syntactic relationships of the original sentence so that the obfuscated
sentence can be parsed instead of the original one. The model works at the word
level, and learns to obfuscate each word separately by changing it into a new
word that has a similar syntactic role. The text obfuscated by our model leads
to better performance on three syntactic parsers (two dependency and one
constituency parsers) in comparison to an upper-bound random substitution
baseline. More specifically, the results demonstrate that as more terms are
obfuscated (by their part of speech), the substitution upper bound
significantly degrades, while the neural model maintains a relatively high
performing parser. All of this is done without much sacrifice of privacy
compared to the random substitution upper bound. We also further analyze the
results, and discover that the substituted words have similar syntactic
properties, but different semantic content, compared to the original words.Comment: Accepted to IWPT 202
Recommended from our members
Shared and distinct transcriptional programs underlie the hybrid nature of iNKT cells
Invariant natural killer T (iNKT) cells are innate-like T lymphocytes that act as critical regulators of the immune response. To better characterize this population, we profiled iNKT cell gene expression during ontogeny and in peripheral subsets as part of the Immunological Genome Project (ImmGen). High-resolution comparative transcriptional analyses defined developmental and subset-specific iNKT cell gene expression programs. In addition, iNKT cells were found to share an extensive transcriptional program with natural killer (NK) cells, similar in magnitude to that shared with major histocompatibility complex (MHC)-restricted T cells. Strikingly, the NK- iNKT program also operated constitutively in γδT cells and in adaptive T cells following activation. Together, our findings highlight a core effector program regulated distinctly in innate and adaptive lymphocytes
DEPfold:RNA secondary structure prediction as dependency parsing
RNA secondary structure prediction is critical for understanding RNA function but remains challenging due to complex structural elements like pseudoknots and limited training data. We introduce DEPfold, a novel deep learning approach that re-frames RNA secondary structure prediction as a dependency parsing problem. DEPfold presents three key innovations: (1) a biologically motivated transformation of RNA structures into labeled dependency trees, (2) a biaffine attention mechanism for joint prediction of base pairings and their types, and (3) an optimal tree decoding algorithm that enforces valid RNA structural constraints. Unlike traditional energy-based methods, DEPfold learns directly from annotated data and leverages pretrained language models to predict RNA structure. We evaluate DEPfold on both within-family and cross-family RNA datasets, demonstrating significant performance improvements over existing methods. DEPfold shows strong performance in cross-family generalization when trained on data augmented by traditional energy-based models, outperforming existing methods on the bpRNA-new dataset. This demonstrates DEPfold’s ability to effectively learn structural information beyond what traditional methods capture. Our approach bridges natural language processing (NLP) with RNA biology, providing a computationally efficient and adaptable tool for advancing RNA structure prediction and analysis. Our code is available at https://github.com/Vicky-0256/DEPfold.git
Can large language models follow concept annotation guidelines? A case study on scientific and financial domains
Although large language models (LLMs) exhibit remarkable capacity to leverage in-context demonstrations, it is still unclear to what extent they can learn new facts or concept definitions via prompts. To address this question, we examine the capacity of instruction-tuned LLMs to follow in-context concept annotation guidelines for zero-shot sentence labeling tasks. We design guidelines that present different types of factual and counterfactual concept definitions, which are used as prompts for zero-shot sentence classification tasks. Our results show that although concept definitions consistently help in task performance, only the larger models (with 70B parameters or more) have limited ability to work under counterfactual contexts. Importantly, only proprietary models such as GPT-3.5 can recognize nonsensical guidelines, which we hypothesize is due to more sophisticated alignment methods. Finally, we find that FALCON-180B-CHAT is outperformed by LLAMA-2-70B-CHAT is most cases, which indicates that increasing model scale does not guarantee better adherence to guidelines. Altogether, our simple evaluation method reveals significant gaps in concept understanding between the most capable open-source language models and the leading proprietary APIs
Conversation Trees: A Grammar Model for Topic Structure in Forums
Online forum discussions proceed differently from face-to-face conversations and any single thread on an online forum contains posts on different subtopics. This work aims to characterize the content of a forum thread as a conversation tree of topics. We present models that jointly per- form two tasks: segment a thread into sub-parts, and assign a topic to each part. Our core idea is a definition of topic structure using probabilistic grammars. By leveraging the flexibility of two grammar formalisms, Context-Free Grammars and Linear Context-Free Rewriting Systems, our models create desirable structures for forum threads: our topic segmentation is hierarchical, links non-adjacent segments on the same topic, and jointly labels the topic during segmentation. We show that our models outperform a number of tree generation baselines
- …
