7,986 research outputs found
Few-Shot and Zero-Shot Learning for Historical Text Normalization
Historical text normalization often relies on small training datasets. Recent
work has shown that multi-task learning can lead to significant improvements by
exploiting synergies with related datasets, but there has been no systematic
study of different multi-task learning architectures. This paper evaluates
63~multi-task learning configurations for sequence-to-sequence-based historical
text normalization across ten datasets from eight languages, using
autoencoding, grapheme-to-phoneme mapping, and lemmatization as auxiliary
tasks. We observe consistent, significant improvements across languages when
training data for the target task is limited, but minimal or no improvements
when training data is abundant. We also show that zero-shot learning
outperforms the simple, but relatively strong, identity baseline.Comment: Accepted at DeepLo-201
Open Set Chinese Character Recognition using Multi-typed Attributes
Recognition of Off-line Chinese characters is still a challenging problem,
especially in historical documents, not only in the number of classes extremely
large in comparison to contemporary image retrieval methods, but also new
unseen classes can be expected under open learning conditions (even for CNN).
Chinese character recognition with zero or a few training samples is a
difficult problem and has not been studied yet. In this paper, we propose a new
Chinese character recognition method by multi-type attributes, which are based
on pronunciation, structure and radicals of Chinese characters, applied to
character recognition in historical books. This intermediate attribute code has
a strong advantage over the common `one-hot' class representation because it
allows for understanding complex and unseen patterns symbolically using
attributes. First, each character is represented by four groups of attribute
types to cover a wide range of character possibilities: Pinyin label, layout
structure, number of strokes, three different input methods such as Cangjie,
Zhengma and Wubi, as well as a four-corner encoding method. A convolutional
neural network (CNN) is trained to learn these attributes. Subsequently,
characters can be easily recognized by these attributes using a distance metric
and a complete lexicon that is encoded in attribute space. We evaluate the
proposed method on two open data sets: printed Chinese character recognition
for zero-shot learning, historical characters for few-shot learning and a
closed set: handwritten Chinese characters. Experimental results show a good
general classification of seen classes but also a very promising generalization
ability to unseen characters.Comment: 29 pages, submitted to Pattern Recognitio
A Comprehensive Overview of Large Language Models
Large Language Models (LLMs) have shown excellent generalization capabilities
that have led to the development of numerous models. These models propose
various new architectures, tweaking existing architectures with refined
training strategies, increasing context length, using high-quality training
data, and increasing training time to outperform baselines. Analyzing new
developments is crucial for identifying changes that enhance training stability
and improve generalization in LLMs. This survey paper comprehensively analyses
the LLMs architectures and their categorization, training strategies, training
datasets, and performance evaluations and discusses future research directions.
Moreover, the paper also discusses the basic building blocks and concepts
behind LLMs, followed by a complete overview of LLMs, including their important
features and functions. Finally, the paper summarizes significant findings from
LLM research and consolidates essential architectural and training strategies
for developing advanced LLMs. Given the continuous advancements in LLMs, we
intend to regularly update this paper by incorporating new sections and
featuring the latest LLM models
Multilingual Event Extraction from Historical Newspaper Adverts
NLP methods can aid historians in analyzing textual materials in greater volumes than manually feasible. Developing such methods poses substantial challenges though. First, acquiring large, annotated historical datasets is difficult, as only domain experts can reliably label them. Second, most available off-the-shelf NLP models are trained on modern language texts, rendering them significantly less effective when applied to historical corpora. This is particularly problematic for less well studied tasks, and for languages other than English. This paper addresses these challenges while focusing on the under-explored task of event extraction from a novel domain of historical texts. We introduce a new multilingual dataset in English, French, and Dutch composed of newspaper ads from the early modern colonial period reporting on enslaved people who liberated themselves from enslavement. We find that: 1) even with scarce annotated data, it is possible to achieve surprisingly good results by formulating the problem as an extractive QA task and leveraging existing datasets and models for modern languages; and 2) cross-lingual low-resource learning for historical languages is highly challenging, and machine translation of the historical datasets to the considered target languages is, in practice, often the best-performing solution
Research on CPI Prediction Based on Natural Language Processing
In the past, the seed keywords for CPI prediction were often selected based
on empirical summaries of research and literature studies, which were prone to
select omitted and invalid variables. In this paper, we design a keyword
expansion technique for CPI prediction based on the cutting-edge NLP model,
PANGU. We improve the CPI prediction ability using the corresponding web search
index. Compared with the unsupervised pre-training and supervised downstream
fine-tuning natural language processing models such as BERT and NEZHA, the
PANGU model can be expanded to obtain more reliable CPI-generated keywords by
its excellent zero-sample learning capability without the limitation of the
downstream fine-tuning data set. Finally, this paper empirically tests the
keyword prediction ability obtained by this keyword expansion method with
historical CPI data
Towards Realistic Unsupervised Fine-tuning with CLIP
The emergence of vision-language models (VLMs), such as CLIP, has spurred a
significant research effort towards their application for downstream supervised
learning tasks. Although some previous studies have explored the unsupervised
fine-tuning of CLIP, they often rely on prior knowledge in the form of class
names associated with ground truth labels. In this paper, we delve into a
realistic unsupervised fine-tuning scenario by assuming that the unlabeled data
might contain out-of-distribution samples from unknown classes. Furthermore, we
emphasize the importance of simultaneously enhancing out-of-distribution
detection capabilities alongside the recognition of instances associated with
predefined class labels.
To tackle this problem, we present a simple, efficient, and effective
fine-tuning approach called Universal Entropy Optimization (UEO). UEO leverages
sample-level confidence to approximately minimize the conditional entropy of
confident instances and maximize the marginal entropy of less confident
instances. Apart from optimizing the textual prompts, UEO also incorporates
optimization of channel-wise affine transformations within the visual branch of
CLIP. Through extensive experiments conducted across 15 domains and 4 different
types of prior knowledge, we demonstrate that UEO surpasses baseline methods in
terms of both generalization and out-of-distribution detection
- …