8,380 research outputs found
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
Mostly-Unsupervised Statistical Segmentation of Japanese Kanji Sequences
Given the lack of word delimiters in written Japanese, word segmentation is
generally considered a crucial first step in processing Japanese texts. Typical
Japanese segmentation algorithms rely either on a lexicon and syntactic
analysis or on pre-segmented data; but these are labor-intensive, and the
lexico-syntactic techniques are vulnerable to the unknown word problem. In
contrast, we introduce a novel, more robust statistical method utilizing
unsegmented training data. Despite its simplicity, the algorithm yields
performance on long kanji sequences comparable to and sometimes surpassing that
of state-of-the-art morphological analyzers over a variety of error metrics.
The algorithm also outperforms another mostly-unsupervised statistical
algorithm previously proposed for Chinese.
Additionally, we present a two-level annotation scheme for Japanese to
incorporate multiple segmentation granularities, and introduce two novel
evaluation metrics, both based on the notion of a compatible bracket, that can
account for multiple granularities simultaneously.Comment: 22 pages. To appear in Natural Language Engineerin
Japanese/English Cross-Language Information Retrieval: Exploration of Query Translation and Transliteration
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, has of late become one of the major topics within the
information retrieval community. This paper proposes a Japanese/English CLIR
system, where we combine a query translation and retrieval modules. We
currently target the retrieval of technical documents, and therefore the
performance of our system is highly dependent on the quality of the translation
of technical terms. However, the technical term translation is still
problematic in that technical terms are often compound words, and thus new
terms are progressively created by combining existing base words. In addition,
Japanese often represents loanwords based on its special phonogram.
Consequently, existing dictionaries find it difficult to achieve sufficient
coverage. To counter the first problem, we produce a Japanese/English
dictionary for base words, and translate compound words on a word-by-word
basis. We also use a probabilistic method to resolve translation ambiguity. For
the second problem, we use a transliteration method, which corresponds words
unlisted in the base word dictionary to their phonetic equivalents in the
target language. We evaluate our system using a test collection for CLIR, and
show that both the compound word translation and transliteration methods
improve the system performance
Applying Machine Translation to Two-Stage Cross-Language Information Retrieval
Cross-language information retrieval (CLIR), where queries and documents are
in different languages, needs a translation of queries and/or documents, so as
to standardize both of them into a common representation. For this purpose, the
use of machine translation is an effective approach. However, computational
cost is prohibitive in translating large-scale document collections. To resolve
this problem, we propose a two-stage CLIR method. First, we translate a given
query into the document language, and retrieve a limited number of foreign
documents. Second, we machine translate only those documents into the user
language, and re-rank them based on the translation result. We also show the
effectiveness of our method by way of experiments using Japanese queries and
English technical documents.Comment: 13 pages, 1 Postscript figur
Cross-Lingual Adaptation using Structural Correspondence Learning
Cross-lingual adaptation, a special case of domain adaptation, refers to the
transfer of classification knowledge between two languages. In this article we
describe an extension of Structural Correspondence Learning (SCL), a recently
proposed algorithm for domain adaptation, for cross-lingual adaptation. The
proposed method uses unlabeled documents from both languages, along with a word
translation oracle, to induce cross-lingual feature correspondences. From these
correspondences a cross-lingual representation is created that enables the
transfer of classification knowledge from the source to the target language.
The main advantages of this approach over other approaches are its resource
efficiency and task specificity.
We conduct experiments in the area of cross-language topic and sentiment
classification involving English as source language and German, French, and
Japanese as target languages. The results show a significant improvement of the
proposed method over a machine translation baseline, reducing the relative
error due to cross-lingual adaptation by an average of 30% (topic
classification) and 59% (sentiment classification). We further report on
empirical analyses that reveal insights into the use of unlabeled data, the
sensitivity with respect to important hyperparameters, and the nature of the
induced cross-lingual correspondences
A Factoid Question Answering System for Vietnamese
In this paper, we describe the development of an end-to-end factoid question
answering system for the Vietnamese language. This system combines both
statistical models and ontology-based methods in a chain of processing modules
to provide high-quality mappings from natural language text to entities. We
present the challenges in the development of such an intelligent user interface
for an isolating language like Vietnamese and show that techniques developed
for inflectional languages cannot be applied "as is". Our question answering
system can answer a wide range of general knowledge questions with promising
accuracy on a test set.Comment: In the proceedings of the HQA'18 workshop, The Web Conference
Companion, Lyon, Franc
- …