12,997 research outputs found
Morpho-syntactic Lexicon Generation Using Graph-based Semi-supervised Learning
Morpho-syntactic lexicons provide information about the morphological and
syntactic roles of words in a language. Such lexicons are not available for all
languages and even when available, their coverage can be limited. We present a
graph-based semi-supervised learning method that uses the morphological,
syntactic and semantic relations between words to automatically construct wide
coverage lexicons from small seed sets. Our method is language-independent, and
we show that we can expand a 1000 word seed lexicon to more than 100 times its
size with high quality for 11 languages. In addition, the automatically created
lexicons provide features that improve performance in two downstream tasks:
morphological tagging and dependency parsing.Comment: Transactions of the Association for Computational Linguistics (TACL)
201
Machine Learning with World Knowledge: The Position and Survey
Machine learning has become pervasive in multiple domains, impacting a wide
variety of applications, such as knowledge discovery and data mining, natural
language processing, information retrieval, computer vision, social and health
informatics, ubiquitous computing, etc. Two essential problems of machine
learning are how to generate features and how to acquire labels for machines to
learn. Particularly, labeling large amount of data for each domain-specific
problem can be very time consuming and costly. It has become a key obstacle in
making learning protocols realistic in applications. In this paper, we will
discuss how to use the existing general-purpose world knowledge to enhance
machine learning processes, by enriching the features or reducing the labeling
work. We start from the comparison of world knowledge with domain-specific
knowledge, and then introduce three key problems in using world knowledge in
learning processes, i.e., explicit and implicit feature representation,
inference for knowledge linking and disambiguation, and learning with direct or
indirect supervision. Finally we discuss the future directions of this research
topic
GOGGLES: Automatic Image Labeling with Affinity Coding
Generating large labeled training data is becoming the biggest bottleneck in
building and deploying supervised machine learning models. Recently, the data
programming paradigm has been proposed to reduce the human cost in labeling
training data. However, data programming relies on designing labeling functions
which still requires significant domain expertise. Also, it is prohibitively
difficult to write labeling functions for image datasets as it is hard to
express domain knowledge using raw features for images (pixels).
We propose affinity coding, a new domain-agnostic paradigm for automated
training data labeling. The core premise of affinity coding is that the
affinity scores of instance pairs belonging to the same class on average should
be higher than those of pairs belonging to different classes, according to some
affinity functions. We build the GOGGLES system that implements affinity coding
for labeling image datasets by designing a novel set of reusable affinity
functions for images, and propose a novel hierarchical generative model for
class inference using a small development set.
We compare GOGGLES with existing data programming systems on 5 image labeling
tasks from diverse domains. GOGGLES achieves labeling accuracies ranging from a
minimum of 71% to a maximum of 98% without requiring any extensive human
annotation. In terms of end-to-end performance, GOGGLES outperforms the
state-of-the-art data programming system Snuba by 21% and a state-of-the-art
few-shot learning technique by 5%, and is only 7% away from the fully
supervised upper bound.Comment: Published at 2020 ACM SIGMOD International Conference on Management
of Dat
A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective
Data collection is a major bottleneck in machine learning and an active
research topic in multiple communities. There are largely two reasons data
collection has recently become a critical issue. First, as machine learning is
becoming more widely-used, we are seeing new applications that do not
necessarily have enough labeled data. Second, unlike traditional machine
learning, deep learning techniques automatically generate features, which saves
feature engineering costs, but in return may require larger amounts of labeled
data. Interestingly, recent research in data collection comes not only from the
machine learning, natural language, and computer vision communities, but also
from the data management community due to the importance of handling large
amounts of data. In this survey, we perform a comprehensive study of data
collection from a data management point of view. Data collection largely
consists of data acquisition, data labeling, and improvement of existing data
or models. We provide a research landscape of these operations, provide
guidelines on which technique to use when, and identify interesting research
challenges. The integration of machine learning and data management for data
collection is part of a larger trend of Big data and Artificial Intelligence
(AI) integration and opens many opportunities for new research.Comment: 20 page
Class label autoencoder for zero-shot learning
Existing zero-shot learning (ZSL) methods usually learn a projection function
between a feature space and a semantic embedding space(text or attribute space)
in the training seen classes or testing unseen classes. However, the projection
function cannot be used between the feature space and multi-semantic embedding
spaces, which have the diversity characteristic for describing the different
semantic information of the same class. To deal with this issue, we present a
novel method to ZSL based on learning class label autoencoder (CLA). CLA can
not only build a uniform framework for adapting to multi-semantic embedding
spaces, but also construct the encoder-decoder mechanism for constraining the
bidirectional projection between the feature space and the class label space.
Moreover, CLA can jointly consider the relationship of feature classes and the
relevance of the semantic classes for improving zero-shot classification. The
CLA solution can provide both unseen class labels and the relation of the
different classes representation(feature or semantic information) that can
encode the intrinsic structure of classes. Extensive experiments demonstrate
the CLA outperforms state-of-art methods on four benchmark datasets, which are
AwA, CUB, Dogs and ImNet-2
Learning with Inadequate and Incorrect Supervision
Practically, we are often in the dilemma that the labeled data at hand are
inadequate to train a reliable classifier, and more seriously, some of these
labeled data may be mistakenly labeled due to the various human factors.
Therefore, this paper proposes a novel semi-supervised learning paradigm that
can handle both label insufficiency and label inaccuracy. To address label
insufficiency, we use a graph to bridge the data points so that the label
information can be propagated from the scarce labeled examples to unlabeled
examples along the graph edges. To address label inaccuracy, Graph Trend
Filtering (GTF) and Smooth Eigenbase Pursuit (SEP) are adopted to filter out
the initial noisy labels. GTF penalizes the l_0 norm of label difference
between connected examples in the graph and exhibits better local adaptivity
than the traditional l_2 norm-based Laplacian smoother. SEP reconstructs the
correct labels by emphasizing the leading eigenvectors of Laplacian matrix
associated with small eigenvalues, as these eigenvectors reflect real label
smoothness and carry rich class separation cues. We term our algorithm as
`Semi-supervised learning under Inadequate and Incorrect Supervision' (SIIS).
Thorough experimental results on image classification, text categorization, and
speech recognition demonstrate that our SIIS is effective in label error
correction, leading to superior performance to the state-of-the-art methods in
the presence of label noise and label scarcity
A Survey of Deep Learning Methods for Relation Extraction
Relation Extraction is an important sub-task of Information Extraction which
has the potential of employing deep learning (DL) models with the creation of
large datasets using distant supervision. In this review, we compare the
contributions and pitfalls of the various DL models that have been used for the
task, to help guide the path ahead
Unsupervised Transfer Learning for Spoken Language Understanding in Intelligent Agents
User interaction with voice-powered agents generates large amounts of
unlabeled utterances. In this paper, we explore techniques to efficiently
transfer the knowledge from these unlabeled utterances to improve model
performance on Spoken Language Understanding (SLU) tasks. We use Embeddings
from Language Model (ELMo) to take advantage of unlabeled data by learning
contextualized word representations. Additionally, we propose ELMo-Light
(ELMoL), a faster and simpler unsupervised pre-training method for SLU. Our
findings suggest unsupervised pre-training on a large corpora of unlabeled
utterances leads to significantly better SLU performance compared to training
from scratch and it can even outperform conventional supervised transfer.
Additionally, we show that the gains from unsupervised transfer techniques can
be further improved by supervised transfer. The improvements are more
pronounced in low resource settings and when using only 1000 labeled in-domain
samples, our techniques match the performance of training from scratch on
10-15x more labeled in-domain data.Comment: To appear at AAAI 201
Classifying Documents within Multiple Hierarchical Datasets using Multi-Task Learning
Multi-task learning (MTL) is a supervised learning paradigm in which the
prediction models for several related tasks are learned jointly to achieve
better generalization performance. When there are only a few training examples
per task, MTL considerably outperforms the traditional Single task learning
(STL) in terms of prediction accuracy. In this work we develop an MTL based
approach for classifying documents that are archived within dual concept
hierarchies, namely, DMOZ and Wikipedia. We solve the multi-class
classification problem by defining one-versus-rest binary classification tasks
for each of the different classes across the two hierarchical datasets. Instead
of learning a linear discriminant for each of the different tasks
independently, we use a MTL approach with relationships between the different
tasks across the datasets established using the non-parametric, lazy, nearest
neighbor approach. We also develop and evaluate a transfer learning (TL)
approach and compare the MTL (and TL) methods against the standard single task
learning and semi-supervised learning approaches. Our empirical results
demonstrate the strength of our developed methods that show an improvement
especially when there are fewer number of training examples per classification
task.Comment: IEEE International Conference on Tools with Artificial Intelligence
(ICTAI), 201
COBRA: Contrastive Bi-Modal Representation Algorithm
There are a wide range of applications that involve multi-modal data, such as
cross-modal retrieval, visual question-answering, and image captioning. Such
applications are primarily dependent on aligned distributions of the different
constituent modalities. Existing approaches generate latent embeddings for each
modality in a joint fashion by representing them in a common manifold. However
these joint embedding spaces fail to sufficiently reduce the modality gap,
which affects the performance in downstream tasks. We hypothesize that these
embeddings retain the intra-class relationships but are unable to preserve the
inter-class dynamics. In this paper, we present a novel framework COBRA that
aims to train two modalities (image and text) in a joint fashion inspired by
the Contrastive Predictive Coding (CPC) and Noise Contrastive Estimation (NCE)
paradigms which preserve both inter and intra-class relationships. We
empirically show that this framework reduces the modality gap significantly and
generates a robust and task agnostic joint-embedding space. We outperform
existing work on four diverse downstream tasks spanning across seven benchmark
cross-modal datasets.Comment: 13 Pages, 6 Figures and 10 Table
- …