3,660 research outputs found
Learning Language from a Large (Unannotated) Corpus
A novel approach to the fully automated, unsupervised extraction of
dependency grammars and associated syntax-to-semantic-relationship mappings
from large text corpora is described. The suggested approach builds on the
authors' prior work with the Link Grammar, RelEx and OpenCog systems, as well
as on a number of prior papers and approaches from the statistical language
learning literature. If successful, this approach would enable the mining of
all the information needed to power a natural language comprehension and
generation system, directly from a large, unannotated corpus.Comment: 29 pages, 5 figures, research proposa
A Clustering Framework for Unsupervised and Semi-supervised New Intent Discovery
New intent discovery is of great value to natural language processing,
allowing for a better understanding of user needs and providing friendly
services. However, most existing methods struggle to capture the complicated
semantics of discrete text representations when limited or no prior knowledge
of labeled data is available. To tackle this problem, we propose a novel
clustering framework, USNID, for unsupervised and semi-supervised new intent
discovery, which has three key technologies. First, it fully utilizes
unsupervised or semi-supervised data to mine shallow semantic similarity
relations and provide well-initialized representations for clustering. Second,
it designs a centroid-guided clustering mechanism to address the issue of
cluster allocation inconsistency and provide high-quality self-supervised
targets for representation learning. Third, it captures high-level semantics in
unsupervised or semi-supervised data to discover fine-grained intent-wise
clusters by optimizing both cluster-level and instance-level objectives. We
also propose an effective method for estimating the cluster number in
open-world scenarios without knowing the number of new intents beforehand.
USNID performs exceptionally well on several benchmark intent datasets,
achieving new state-of-the-art results in unsupervised and semi-supervised new
intent discovery and demonstrating robust performance with different cluster
numbers.Comment: Accepted by IEEE TKD
Generalized Category Discovery with Decoupled Prototypical Network
Generalized Category Discovery (GCD) aims to recognize both known and novel
categories from a set of unlabeled data, based on another dataset labeled with
only known categories. Without considering differences between known and novel
categories, current methods learn about them in a coupled manner, which can
hurt model's generalization and discriminative ability. Furthermore, the
coupled training approach prevents these models transferring category-specific
knowledge explicitly from labeled data to unlabeled data, which can lose
high-level semantic information and impair model performance. To mitigate above
limitations, we present a novel model called Decoupled Prototypical Network
(DPN). By formulating a bipartite matching problem for category prototypes, DPN
can not only decouple known and novel categories to achieve different training
targets effectively, but also align known categories in labeled and unlabeled
data to transfer category-specific knowledge explicitly and capture high-level
semantics. Furthermore, DPN can learn more discriminative features for both
known and novel categories through our proposed Semantic-aware Prototypical
Learning (SPL). Besides capturing meaningful semantic information, SPL can also
alleviate the noise of hard pseudo labels through semantic-weighted soft
assignment. Extensive experiments show that DPN outperforms state-of-the-art
models by a large margin on all evaluation metrics across multiple benchmark
datasets. Code and data are available at https://github.com/Lackel/DPN.Comment: Accepted by AAAI 202
From Frequency to Meaning: Vector Space Models of Semantics
Computers understand very little of the meaning of human language. This
profoundly limits our ability to give instructions to computers, the ability of
computers to explain their actions to us, and the ability of computers to
analyse and process text. Vector space models (VSMs) of semantics are beginning
to address these limits. This paper surveys the use of VSMs for semantic
processing of text. We organize the literature on VSMs according to the
structure of the matrix in a VSM. There are currently three broad classes of
VSMs, based on term-document, word-context, and pair-pattern matrices, yielding
three classes of applications. We survey a broad range of applications in these
three categories and we take a detailed look at a specific open source project
in each category. Our goal in this survey is to show the breadth of
applications of VSMs for semantics, to provide a new perspective on VSMs for
those who are already familiar with the area, and to provide pointers into the
literature for those who are less familiar with the field
Understanding Chat Messages for Sticker Recommendation in Messaging Apps
Stickers are popularly used in messaging apps such as Hike to visually
express a nuanced range of thoughts and utterances to convey exaggerated
emotions. However, discovering the right sticker from a large and ever
expanding pool of stickers while chatting can be cumbersome. In this paper, we
describe a system for recommending stickers in real time as the user is typing
based on the context of the conversation. We decompose the sticker
recommendation (SR) problem into two steps. First, we predict the message that
the user is likely to send in the chat. Second, we substitute the predicted
message with an appropriate sticker. Majority of Hike's messages are in the
form of text which is transliterated from users' native language to the Roman
script. This leads to numerous orthographic variations of the same message and
makes accurate message prediction challenging. To address this issue, we learn
dense representations of chat messages employing character level convolution
network in an unsupervised manner. We use them to cluster the messages that
have the same meaning. In the subsequent steps, we predict the message cluster
instead of the message. Our approach does not depend on human labelled data
(except for validation), leading to fully automatic updation and tuning
pipeline for the underlying models. We also propose a novel hybrid message
prediction model, which can run with low latency on low-end phones that have
severe computational limitations. Our described system has been deployed for
more than months and is being used by millions of users along with hundreds
of thousands of expressive stickers
Recommended from our members
Minimally supervised induction of morphology through bitexts
textA knowledge of morphology can be useful for many natural language processing systems. Thus, much effort has been expended in developing accurate computational tools for morphology that lemmatize, segment and generate new forms. The most powerful and accurate of these have been manually encoded, such endeavors being without exception expensive and time-consuming. There have been consequently many attempts to reduce this cost in the development of morphological systems through the development of unsupervised or minimally supervised algorithms and learning methods for acquisition of morphology. These efforts have yet to produce a tool that approaches the performance of manually encoded systems.
Here, I present a strategy for dealing with morphological clustering and segmentation in a minimally supervised manner but one that will be more linguistically informed than previous unsupervised approaches. That is, this study will attempt to induce clusters of words from an unannotated text that are inflectional variants of each other. Then a set of inflectional suffixes by part-of-speech will be induced from these clusters. This level of detail is made possible by a method known as alignment and transfer (AT), among other names, an approach that uses aligned bitexts to transfer linguistic resources developed for one language–the source language–to another language–the target. This approach has a further advantage in that it allows a reduction in the amount of training data without a significant degradation in performance making it useful in applications targeted at data collected from endangered languages. In the current study, however, I use English as the source and German as the target for ease of evaluation and for certain typlogical properties of German. The two main tasks, that of clustering and segmentation, are approached as sequential tasks with the clustering informing the segmentation to allow for greater accuracy in morphological analysis.
While the performance of these methods does not exceed the current roster of unsupervised or minimally supervised approaches to morphology acquisition, it attempts to integrate more learning methods than previous studies. Furthermore, it attempts to learn inflectional morphology as opposed to derivational morphology, which is a crucial distinction in linguistics.Linguistic
Discovering New Intents via Constrained Deep Adaptive Clustering with Cluster Refinement
Identifying new user intents is an essential task in the dialogue system.
However, it is hard to get satisfying clustering results since the definition
of intents is strongly guided by prior knowledge. Existing methods incorporate
prior knowledge by intensive feature engineering, which not only leads to
overfitting but also makes it sensitive to the number of clusters. In this
paper, we propose constrained deep adaptive clustering with cluster refinement
(CDAC+), an end-to-end clustering method that can naturally incorporate
pairwise constraints as prior knowledge to guide the clustering process.
Moreover, we refine the clusters by forcing the model to learn from the high
confidence assignments. After eliminating low confidence assignments, our
approach is surprisingly insensitive to the number of clusters. Experimental
results on the three benchmark datasets show that our method can yield
significant improvements over strong baselines.Comment: Accepted by AAAI202
- …