8,699 research outputs found
Adaptive text mining: Inferring structure from sequences
Text mining is about inferring structure from sequences representing natural language text, and may be defined as the process of analyzing text to extract information that is useful for particular purposes. Although hand-crafted heuristics are a common practical approach for extracting information from text, a general, and generalizable, approach requires adaptive techniques. This paper studies the way in which the adaptive techniques used in text compression can be applied to text mining. It develops several examples: extraction of hierarchical phrase structures from text, identification of keyphrases in documents, locating proper names and quantities of interest in a piece of text, text categorization, word segmentation, acronym extraction, and structure recognition. We conclude that compression forms a sound unifying principle that allows many text mining problems to be tacked adaptively
Translating Phrases in Neural Machine Translation
Phrases play an important role in natural language understanding and machine
translation (Sag et al., 2002; Villavicencio et al., 2005). However, it is
difficult to integrate them into current neural machine translation (NMT) which
reads and generates sentences word by word. In this work, we propose a method
to translate phrases in NMT by integrating a phrase memory storing target
phrases from a phrase-based statistical machine translation (SMT) system into
the encoder-decoder architecture of NMT. At each decoding step, the phrase
memory is first re-written by the SMT model, which dynamically generates
relevant target phrases with contextual information provided by the NMT model.
Then the proposed model reads the phrase memory to make probability estimations
for all phrases in the phrase memory. If phrase generation is carried on, the
NMT decoder selects an appropriate phrase from the memory to perform phrase
translation and updates its decoding state by consuming the words in the
selected phrase. Otherwise, the NMT decoder generates a word from the
vocabulary as the general NMT decoder does. Experiment results on the Chinese
to English translation show that the proposed model achieves significant
improvements over the baseline on various test sets.Comment: Accepted by EMNLP 201
A Corpus-based Approach to the Chinese Word Segmentation
For a society based upon laws and reason, it has become too easy for us to believe
that we live in a world without them. And given that our linguistics wisdom was
originally motivated by the search for rules, it seems strange that we now consider
these rules to be the exceptions and take exceptions as the norm.
The current task of contemporary computational linguistics is to describe these
exceptions. In particular, it suffices for most language processing needs, to just
describe the argument and predicate within an elementary sentence, under the
framework of local grammar. Therefore, a corpus-based approach to the Chinese
Word Segmentation problem is proposed, as the first step towards a local grammar
for the Chinese language.
The two main issues with existing lexicon-based approaches are (a) the classification
of unknown character sequences, i.e. sequences that are not listed in
the lexicon, and (b) the disambiguation of situations where two candidate words
overlap.
For (a), we propose an automatic method of enriching the lexicon by comparing
candidate sequences to occurrences of the same strings in a manually segmented
reference corpus, and using methods of machine learning to select the optimal
segmentation for them. These methods are developed in the course of the thesis
specifically for this task. The possibility of applying these machine learning
method will be discussed in NP-extraction and alignment domain.
(b) is approached by designing a general processing framework for Chinese text,
which will be called multi-level processing. Under this framework, sentences are
recursively split into fragments, according to a language-specific, but domainindependent
heuristics. The resulting fragments then define the ultimate boundaries
between candidate words and therefore resolve any segmentation ambiguity
caused by overlapping sequences. A new shallow semantical annotation is also
proposed under the frame work of multi-level processing.
A word segmentation algorithm based on these principles has been implemented
and tested; results of the evaluation are given and compared to the performance of
previous approaches as reported in the literature.
The first chapter of this thesis discusses the goals of segmentation and introduces
some background concepts. The second chapter analyses the current state-of-theart
approach to Chinese language segmentation. Chapter 3 proposes a new corpusbased
approach to the identification of unknown words. In chapter 4, a new shallow
semantical annotation is also proposed under the framework of multi-level
processing
How to Fine-Tune BERT for Text Classification?
Language model pre-training has proven to be useful in learning universal
language representations. As a state-of-the-art language model pre-training
model, BERT (Bidirectional Encoder Representations from Transformers) has
achieved amazing results in many language understanding tasks. In this paper,
we conduct exhaustive experiments to investigate different fine-tuning methods
of BERT on text classification task and provide a general solution for BERT
fine-tuning. Finally, the proposed solution obtains new state-of-the-art
results on eight widely-studied text classification datasets
- …