27 research outputs found
Outlier-Aware Training for Improving Group Accuracy Disparities
Methods addressing spurious correlations such as Just Train Twice (JTT,
arXiv:2107.09044v2) involve reweighting a subset of the training set to
maximize the worst-group accuracy. However, the reweighted set of examples may
potentially contain unlearnable examples that hamper the model's learning. We
propose mitigating this by detecting outliers to the training set and removing
them before reweighting. Our experiments show that our method achieves
competitive or better accuracy compared with JTT and can detect and remove
annotation errors in the subset being reweighted in JTT
XFEVER: Exploring Fact Verification across Languages
This paper introduces the Cross-lingual Fact Extraction and VERification
(XFEVER) dataset designed for benchmarking the fact verification models across
different languages. We constructed it by translating the claim and evidence
texts of the Fact Extraction and VERification (FEVER) dataset into six
languages. The training and development sets were translated using machine
translation, whereas the test set includes texts translated by professional
translators and machine-translated texts. Using the XFEVER dataset, two
cross-lingual fact verification scenarios, zero-shot learning and
translate-train learning, are defined, and baseline models for each scenario
are also proposed in this paper. Experimental results show that the
multilingual language model can be used to build fact verification models in
different languages efficiently. However, the performance varies by language
and is somewhat inferior to the English case. We also found that we can
effectively mitigate model miscalibration by considering the prediction
similarity between the English and target languages. The XFEVER dataset, code,
and model checkpoints are available at
https://github.com/nii-yamagishilab/xfever.Comment: Accepted for an oral presentation at the 35th Conference on
Computational Linguistics and Speech Processing (ROCLING 2023
A Practical Text Summarizer by Paragraph Extraction for Thai
In this paper, we propose a practical approach for extracting the most relevant paragraphs from the original document to form a summary for Thai text. The idea of our approach is to exploit both the local and global properties of paragraphs. The local property can be considered as clusters of significant words within each paragraph, while the global property can be though of as relations of all paragraphs in a document. These two properties are combined for ranking and extracting summaries. Experimental results on real-world data sets are encouraging.
A Parallel Learning Algorithm for Text Classification
Text classification is the process of classifying documents into predef'med categories based on their content. Existing supervised learning algorithms to automatically classify text need sufficient labeled documents to learn accurately. Applying the ExpectationMaximization (EM) algorithm to this problem is an alternative approach that utilizes a large pool of unlabeled documents to augment the available labeled documents. Unfortunately, the time needed to learn with these large unlabeled documents is too high. This paper introduces a novel parallel learning algorithm for text classification task. The parallel algorithm is based on the combination of the EM algorithm and the naive Bayes classifier. Our goal is to improve the computational time in learning and classifying process. We studied the performance of our parallel algorithm on a large Linux PC cluster called PIRUN Cluster. We report both timing and accuracy results. These results indicate that the proposed parallel algorithm is capable of handling large document collections
Refining A Divisive Partitioning Algorithm for Unsupervised Clustering
Abstract. The Principal Direction Divisive Partitioning (PDDP) algorithm is a fast and scalable clustering algorithm [3]. The basic idea is to recursively split the data set into sub-clusters based on principal direction vectors. However, the PDDP algorithm can yield poor results, especially when cluster structures are not well-separated from on