20 research outputs found
Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations
We investigate the pertinence of methods from algebraic topology for text
data analysis. These methods enable the development of
mathematically-principled isometric-invariant mappings from a set of vectors to
a document embedding, which is stable with respect to the geometry of the
document in the selected metric space. In this work, we evaluate the utility of
these topology-based document representations in traditional NLP tasks,
specifically document clustering and sentiment classification. We find that the
embeddings do not benefit text analysis. In fact, performance is worse than
simple techniques like , indicating that the geometry of the
document does not provide enough variability for classification on the basis of
topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201
Understanding How to Inform Blind and Low-Vision Users about Data Privacy through Privacy Question Answering Assistants
Understanding and managing data privacy in the digital world can be
challenging for sighted users, let alone blind and low-vision (BLV) users.
There is limited research on how BLV users, who have special accessibility
needs, navigate data privacy, and how potential privacy tools could assist
them. We conducted an in-depth qualitative study with 21 US BLV participants to
understand their data privacy risk perception and mitigation, as well as their
information behaviors related to data privacy. We also explored BLV users'
attitudes towards potential privacy question answering (Q&A) assistants that
enable them to better navigate data privacy information. We found that BLV
users face heightened security and privacy risks, but their risk mitigation is
often insufficient. They do not necessarily seek data privacy information but
clearly recognize the benefits of a potential privacy Q&A assistant. They also
expect privacy Q&A assistants to possess cross-platform compatibility, support
multi-modality, and demonstrate robust functionality. Our study sheds light on
BLV users' expectations when it comes to usability, accessibility, trust and
equity issues regarding digital data privacy.Comment: This research paper is accepted by USENIX Security '2
Lumos: Learning Agents with Unified Data, Modular Design, and Open-Source LLMs
We introduce Lumos, a novel framework for training language agents that
employs a unified data format and a modular architecture based on open-source
large language models (LLMs). Lumos consists of three distinct modules:
planning, grounding, and execution. The planning module breaks down a task into
a series of high-level, tool-agnostic subgoals, which are then made specific by
the grounding module through a set of low-level actions. These actions are
subsequently executed by the execution module, utilizing a range of
off-the-shelf tools and APIs. In order to train these modules effectively,
high-quality annotations of subgoals and actions were collected and are made
available for fine-tuning open-source LLMs for various tasks such as complex
question answering, web tasks, and math problems. Leveraging this unified data
and modular design, Lumos not only achieves comparable or superior performance
to current, state-of-the-art agents, but also exhibits several key advantages:
(1) Lumos surpasses GPT-4/3.5-based agents in complex question answering and
web tasks, while equalling the performance of significantly larger LLM agents
on math tasks; (2) Lumos outperforms open-source agents created through
conventional training methods and those using chain-of-thoughts training; and
(3) Lumos is capable of effectively generalizing to unseen interactive tasks,
outperforming larger LLM-based agents and even exceeding performance of
specialized agents.Comment: Project website: https://allenai.github.io/lumos
Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions
Large amounts of training data are one of the major reasons for the high
performance of state-of-the-art NLP models. But what exactly in the training
data causes a model to make a certain prediction? We seek to answer this
question by providing a language for describing how training data influences
predictions, through a causal framework. Importantly, our framework bypasses
the need to retrain expensive models and allows us to estimate causal effects
based on observational data alone. Addressing the problem of extracting factual
knowledge from pretrained language models (PLMs), we focus on simple data
statistics such as co-occurrence counts and show that these statistics do
influence the predictions of PLMs, suggesting that such models rely on shallow
heuristics. Our causal framework and our results demonstrate the importance of
studying datasets and the benefits of causality for understanding NLP models.Comment: We received a criticism regarding the validity of the causal
formulation in this paper. We will address them in an upcoming versio