39 research outputs found
Probabilistic Models for Scalable Knowledge Graph Construction
In the past decade, systems that extract information from millions of Internet documents have become commonplace. Knowledge graphs -- structured knowledge bases that describe entities, their attributes and the relationships between them -- are a powerful tool for understanding and organizing this vast amount of information. However, a significant obstacle to knowledge graph construction is the unreliability of the extracted information, due to noise and ambiguity in the underlying data or errors made by the extraction system and the complexity of reasoning about the dependencies between these noisy
extractions. My dissertation addresses these challenges by exploiting the interdependencies between facts to improve the quality of the knowledge graph in a scalable framework. I introduce a new approach called knowledge graph identification (KGI), which resolves the entities, attributes and relationships in the knowledge graph by incorporating uncertain extractions from multiple sources, entity co-references, and ontological constraints. I define a probability distribution over possible knowledge graphs and infer the most probable knowledge graph using a combination of probabilistic and logical reasoning. Such probabilistic models are frequently dismissed due to scalability concerns, but my implementation of KGI maintains tractable performance on large problems through the use of hinge-loss Markov random fields, which have a convex inference objective. This allows the inference of large knowledge graphs using 4M facts and 20M ground constraints in 2 hours. To further scale the solution, I develop a distributed approach to the KGI problem which runs in parallel across multiple machines, reducing inference time by 90%. Finally, I extend my model to the streaming setting, where a knowledge graph is continuously updated by incorporating newly extracted facts. I devise a general approach for approximately updating inference in convex probabilistic models, and quantify the approximation error by defining and bounding inference regret for online models. Together, my work retains the attractive features of probabilistic models while providing the scalability necessary for large-scale knowledge graph construction. These models have been applied on a number of real-world knowledge graph projects, including the NELL project at Carnegie Mellon and the Google Knowledge Graph
Estimating Numbers without Regression
Despite recent successes in language models, their ability to represent
numbers is insufficient. Humans conceptualize numbers based on their
magnitudes, effectively projecting them on a number line; whereas subword
tokenization fails to explicitly capture magnitude by splitting numbers into
arbitrary chunks. To alleviate this shortcoming, alternative approaches have
been proposed that modify numbers at various stages of the language modeling
pipeline. These methods change either the (1) notation in which numbers are
written (\eg scientific vs decimal), the (2) vocabulary used to represent
numbers or the entire (3) architecture of the underlying language model, to
directly regress to a desired number.
Previous work suggests that architectural change helps achieve
state-of-the-art on number estimation but we find an insightful ablation:
changing the model's vocabulary instead (\eg introduce a new token for numbers
in range 10-100) is a far better trade-off. In the context of masked number
prediction, a carefully designed tokenization scheme is both the simplest to
implement and sufficient, \ie with similar performance to the state-of-the-art
approach that requires making significant architectural changes. Finally, we
report similar trends on the downstream task of numerical fact estimation (for
Fermi Problems) and discuss reasons behind our findings.Comment: Workshop on Insights from Negative Results in NLP at EACL 202
Learn Your Tokens: Word-Pooled Tokenization for Language Modeling
Language models typically tokenize text into subwords, using a deterministic,
hand-engineered heuristic of combining characters into longer surface-level
strings such as 'ing' or whole words. Recent literature has repeatedly shown
the limitations of such a tokenization strategy, particularly for documents not
written in English and for representing numbers. On the other extreme,
byte/character-level language models are much less restricted but suffer from
increased sequence description lengths and a subsequent quadratic expansion in
self-attention computation. Recent attempts to compress and limit these context
lengths with fixed size convolutions is helpful but completely ignores the word
boundary. This paper considers an alternative 'learn your tokens' scheme which
utilizes the word boundary to pool bytes/characters into word representations,
which are fed to the primary language model, before again decoding individual
characters/bytes per word in parallel. We find that our moderately expressive
and moderately fast end-to-end tokenizer outperform by over 300% both subwords
and byte/character models over the intrinsic language modeling metric of
next-word prediction across datasets. It particularly outshines on rare words,
outperforming by a factor of 30! We extensively study the language modeling
setup for all three categories of tokenizers and theoretically analyze how our
end-to-end models can also be a strong trade-off in efficiency and robustness.Comment: Accepted to EMNLP 2023 Finding
Faithful Persona-based Conversational Dataset Generation with Large Language Models
High-quality conversational datasets are essential for developing AI models
that can communicate with users. One way to foster deeper interactions between
a chatbot and its user is through personas, aspects of the user's character
that provide insights into their personality, motivations, and behaviors.
Training Natural Language Processing (NLP) models on a diverse and
comprehensive persona-based dataset can lead to conversational models that
create a deeper connection with the user, and maintain their engagement. In
this paper, we leverage the power of Large Language Models (LLMs) to create a
large, high-quality conversational dataset from a seed dataset. We propose a
Generator-Critic architecture framework to expand the initial dataset, while
improving the quality of its conversations. The Generator is an LLM prompted to
output conversations. The Critic consists of a mixture of expert LLMs that
control the quality of the generated conversations. These experts select the
best generated conversations, which we then use to improve the Generator. We
release Synthetic-Persona-Chat, consisting of 20k conversations seeded from
Persona-Chat. We evaluate the quality of Synthetic-Persona-Chat and our
generation framework on different dimensions through extensive experiments, and
observe that the losing rate of Synthetic-Persona-Chat against Persona-Chat
during Turing test decreases from 17.2% to 8.8% over three iterations
Making Large Language Models Better Data Creators
Although large language models (LLMs) have advanced the state-of-the-art in
NLP significantly, deploying them for downstream applications is still
challenging due to cost, responsiveness, control, or concerns around privacy
and security. As such, trainable models are still the preferred option in some
cases. However, these models still require human-labeled data for optimal
performance, which is expensive and time-consuming to obtain. In order to
address this issue, several techniques to reduce human effort involve labeling
or generating data using LLMs. Although these methods are effective for certain
applications, in practice they encounter difficulties in real-world scenarios.
Labeling data requires careful data selection, while generating data
necessitates task-specific prompt engineering. In this paper, we propose a
unified data creation pipeline that requires only a single formatting example,
and which is applicable to a broad range of tasks, including traditionally
problematic ones with semantically devoid label spaces. In our experiments we
demonstrate that instruction-following LLMs are highly cost-effective data
creators, and that models trained with these data exhibit performance better
than those trained with human-labeled data (by up to 17.5%) on
out-of-distribution evaluation, while maintaining comparable performance on
in-distribution tasks. These results have important implications for the
robustness of NLP systems deployed in the real-world.Comment: Accepted to EMNLP 2023 main conference. 12 pages, 5 figures, 6
tables. Code is available at https://github.com/microsoft/llm-data-creatio