Search CORE

292,017 research outputs found

A Curriculum Learning Approach for Multi-domain Text Classification Using Keyword weight Ranking

Author: Li Yangning
Li Yinghui
Wu Wei
Xie Rui
Yuan Zilin
Zheng Hai-Tao
Publication venue
Publication date: 26/10/2022
Field of study

Text classification is a very classic NLP task, but it has two prominent shortcomings: On the one hand, text classification is deeply domain-dependent. That is, a classifier trained on the corpus of one domain may not perform so well in another domain. On the other hand, text classification models require a lot of annotated data for training. However, for some domains, there may not exist enough annotated data. Therefore, it is valuable to investigate how to efficiently utilize text data from different domains to improve the performance of models in various domains. Some multi-domain text classification models are trained by adversarial training to extract shared features among all domains and the specific features of each domain. We noted that the distinctness of the domain-specific features is different, so in this paper, we propose to use a curriculum learning strategy based on keyword weight ranking to improve the performance of multi-domain text classification models. The experimental results on the Amazon review and FDU-MTL datasets show that our curriculum learning strategy effectively improves the performance of multi-domain text classification models based on adversarial learning and outperforms state-of-the-art methods.Comment: Submitted to ICASSP2023 (currently under review

arXiv.org e-Print Archive

Improving Document Representation Using Retrofitting

Author: Mansoor Zeeshan
Publication venue: 'University of Windsor Leddy Library'
Publication date: 01/01/2019
Field of study

Data-driven learning of document vectors that capture linkage between them is of immense importance in natural language processing (NLP). These document vectors can, in turn, be used for tasks like information retrieval, document classification, and clustering. Inherently, documents are linked together in the form of links or citations in case of web pages or academic papers respectively. Methods like PV-DM or PV-DBOW try to capture the semantic representation of the document using only the text information. These methods ignore the network information altogether while learning the representation. Similarly, methods developed for network representation learning like node2vec or DeepWalk, capture the linkage information between the documents but they ignore the text information altogether. In this thesis, we proposed a method based on Retrofit for learning word embeddings using a semantic lexicon, which tries to incorporate both the text and network information together while learning the document representation. We also analyze the optimum weight for adding network information that will give us the best embedding. Our experimentation result shows that our method improves the classification score by 4% and we also introduce a new dataset containing both network and content information

Scholarship at UWindsor

Double-Weighting for Covariate Shift Adaptation

Author: Liu Anqi
Mazuelas Santiago
Segovia-Martín José I.
Publication venue
Publication date: 27/05/2023
Field of study

Supervised learning is often affected by a covariate shift in which the marginal distributions of instances (covariates

x

) of training and testing samples

\mathrm{p}_\text{tr}(x)

and

\mathrm{p}_\text{te}(x)

are different but the label conditionals coincide. Existing approaches address such covariate shift by either using the ratio

\mathrm{p}_\text{te}(x)/\mathrm{p}_\text{tr}(x)

to weight training samples (reweighted methods) or using the ratio

\mathrm{p}_\text{tr}(x)/\mathrm{p}_\text{te}(x)

to weight testing samples (robust methods). However, the performance of such approaches can be poor under support mismatch or when the above ratios take large values. We propose a minimax risk classification (MRC) approach for covariate shift adaptation that avoids such limitations by weighting both training and testing samples. In addition, we develop effective techniques that obtain both sets of weights and generalize the conventional kernel mean matching method. We provide novel generalization bounds for our method that show a significant increase in the effective sample size compared with reweighted methods. The proposed method also achieves enhanced classification performance in both synthetic and empirical experiments

arXiv.org e-Print Archive

Can Automatic Abstracting Improve on Current Extracting Techniques in Aiding Users to Judge the Relevance of Pages in Search Engine Results?

Author: Liang SF
Publication venue
Publication date: 01/01/2004
Field of study

Current search engines use sentence extraction techniques to produce snippet result summaries, which users may find less than ideal for determining the relevance of pages. Unlike extracting, abstracting programs analyse the context of documents and rewrite them into informative summaries. Our project aims to produce abstracting summaries which are coherent and easy to read thereby lessening users’ time in judging the relevance of pages. However, automatic abstracting technique has its domain restriction. For solving this problem we propose to employ text classification techniques. We propose a new approach to initially classify whole web documents into sixteen top level ODP categories by using machine learning and a Bayesian classifier. We then manually create sixteen templates for each category. The summarisation techniques we use include a natural language processing techniques to weight words and analyse lexical chains to identify salient phrases and place them into relevant template slots to produce summaries

Southampton (e-Prints Soton)