292,017 research outputs found
A Curriculum Learning Approach for Multi-domain Text Classification Using Keyword weight Ranking
Text classification is a very classic NLP task, but it has two prominent
shortcomings: On the one hand, text classification is deeply domain-dependent.
That is, a classifier trained on the corpus of one domain may not perform so
well in another domain. On the other hand, text classification models require a
lot of annotated data for training. However, for some domains, there may not
exist enough annotated data. Therefore, it is valuable to investigate how to
efficiently utilize text data from different domains to improve the performance
of models in various domains. Some multi-domain text classification models are
trained by adversarial training to extract shared features among all domains
and the specific features of each domain. We noted that the distinctness of the
domain-specific features is different, so in this paper, we propose to use a
curriculum learning strategy based on keyword weight ranking to improve the
performance of multi-domain text classification models. The experimental
results on the Amazon review and FDU-MTL datasets show that our curriculum
learning strategy effectively improves the performance of multi-domain text
classification models based on adversarial learning and outperforms
state-of-the-art methods.Comment: Submitted to ICASSP2023 (currently under review
Improving Document Representation Using Retrofitting
Data-driven learning of document vectors that capture linkage between them is of immense importance in natural language processing (NLP). These document vectors can, in turn, be used for tasks like information retrieval, document classification, and clustering. Inherently, documents are linked together in the form of links or citations in case of web pages or academic papers respectively. Methods like PV-DM or PV-DBOW try to capture the semantic representation of the document using only the text information. These methods ignore the network information altogether while learning the representation. Similarly, methods developed for network representation learning like node2vec or DeepWalk, capture the linkage information between the documents but they ignore the text information altogether. In this thesis, we proposed a method based on Retrofit for learning word embeddings using a semantic lexicon, which tries to incorporate both the text and network information together while learning the document representation. We also analyze the optimum weight for adding network information that will give us the best embedding. Our experimentation result shows that our method improves the classification score by 4% and we also introduce a new dataset containing both network and content information
Double-Weighting for Covariate Shift Adaptation
Supervised learning is often affected by a covariate shift in which the
marginal distributions of instances (covariates ) of training and testing
samples and are different
but the label conditionals coincide. Existing approaches address such covariate
shift by either using the ratio
to weight training samples
(reweighted methods) or using the ratio
to weight testing samples
(robust methods). However, the performance of such approaches can be poor under
support mismatch or when the above ratios take large values. We propose a
minimax risk classification (MRC) approach for covariate shift adaptation that
avoids such limitations by weighting both training and testing samples. In
addition, we develop effective techniques that obtain both sets of weights and
generalize the conventional kernel mean matching method. We provide novel
generalization bounds for our method that show a significant increase in the
effective sample size compared with reweighted methods. The proposed method
also achieves enhanced classification performance in both synthetic and
empirical experiments
Can Automatic Abstracting Improve on Current Extracting Techniques in Aiding Users to Judge the Relevance of Pages in Search Engine Results?
Current search engines use sentence extraction techniques to produce snippet result summaries, which users may find less than ideal for determining the relevance of pages. Unlike extracting, abstracting programs analyse the context of documents and rewrite them into informative summaries. Our project aims to produce abstracting summaries which are coherent and easy to read thereby lessening users’ time in judging the relevance of pages. However, automatic abstracting technique has its domain restriction. For solving this problem we propose to employ text classification techniques. We propose a new approach to initially classify whole web documents into sixteen top level ODP categories by using machine learning and a Bayesian classifier. We then manually create sixteen templates for each category. The summarisation techniques we use include a natural language processing techniques to weight words and analyse lexical chains to identify salient phrases and place them into relevant template slots to produce summaries
- …