13,626 research outputs found
CleanNet: Transfer Learning for Scalable Image Classifier Training with Label Noise
In this paper, we study the problem of learning image classification models
with label noise. Existing approaches depending on human supervision are
generally not scalable as manually identifying correct or incorrect labels is
time-consuming, whereas approaches not relying on human supervision are
scalable but less effective. To reduce the amount of human supervision for
label noise cleaning, we introduce CleanNet, a joint neural embedding network,
which only requires a fraction of the classes being manually verified to
provide the knowledge of label noise that can be transferred to other classes.
We further integrate CleanNet and conventional convolutional neural network
classifier into one framework for image classification learning. We demonstrate
the effectiveness of the proposed algorithm on both of the label noise
detection task and the image classification on noisy data task on several
large-scale datasets. Experimental results show that CleanNet can reduce label
noise detection error rate on held-out classes where no human supervision
available by 41.5% compared to current weakly supervised methods. It also
achieves 47% of the performance gain of verifying all images with only 3.2%
images verified on an image classification task. Source code and dataset will
be available at kuanghuei.github.io/CleanNetProject.Comment: Accepted to CVPR 201
Personalized Dialogue Generation with Diversified Traits
Endowing a dialogue system with particular personality traits is essential to
deliver more human-like conversations. However, due to the challenge of
embodying personality via language expression and the lack of large-scale
persona-labeled dialogue data, this research problem is still far from
well-studied. In this paper, we investigate the problem of incorporating
explicit personality traits in dialogue generation to deliver personalized
dialogues.
To this end, firstly, we construct PersonalDialog, a large-scale multi-turn
dialogue dataset containing various traits from a large number of speakers. The
dataset consists of 20.83M sessions and 56.25M utterances from 8.47M speakers.
Each utterance is associated with a speaker who is marked with traits like Age,
Gender, Location, Interest Tags, etc. Several anonymization schemes are
designed to protect the privacy of each speaker. This large-scale dataset will
facilitate not only the study of personalized dialogue generation, but also
other researches on sociolinguistics or social science.
Secondly, to study how personality traits can be captured and addressed in
dialogue generation, we propose persona-aware dialogue generation models within
the sequence to sequence learning framework. Explicit personality traits
(structured by key-value pairs) are embedded using a trait fusion module.
During the decoding process, two techniques, namely persona-aware attention and
persona-aware bias, are devised to capture and address trait-related
information. Experiments demonstrate that our model is able to address proper
traits in different contexts. Case studies also show interesting results for
this challenging research problem.Comment: Please contact [zhengyinhe1 at 163 dot com] for the PersonalDialog
datase
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
- …