658 research outputs found
Chinese text chunking using lexicalized HMMS
This paper presents a lexicalized HMM-based approach to Chinese text chunking. To tackle the problem of unknown words, we formalize Chinese text chunking as a tagging task on a sequence of known words. To do this, we employ the uniformly lexicalized HMMs and develop a lattice-based tagger to assign each known word a proper hybrid tag, which involves four types of information: word boundary, POS, chunk boundary and chunk type. In comparison with most previous approaches, our approach is able to integrate different features such as part-of-speech information, chunk-internal cues and contextual information for text chunking under the framework of HMMs. As a result, the performance of the system can be improved without losing its efficiency in training and tagging. Our preliminary experiments on the PolyU Shallow Treebank show that the use of lexicalization technique can substantially improve the performance of a HMM-based chunking system. © 2005 IEEE.published_or_final_versio
TCtract-A Collocation Extraction Approach for Noun Phrases Using Shallow Parsing Rules and Statistic Models
PACLIC 20 / Wuhan, China / 1-3 November, 200
MATREX: DCU machine translation system for IWSLT 2006
In this paper, we give a description of the machine translation system developed at DCU that was used for our first participation in the evaluation campaign of the International Workshop on Spoken Language Translation (2006). This system combines two types of approaches. First, we use an EBMT approach to collect aligned chunks based on two steps: deterministic chunking of both sides and chunk alignment. We use several chunking and alignment strategies. We also extract SMT-style aligned phrases, and the two types of resources are combined.
We participated in the Open Data Track for the following
translation directions: Arabic-English and Italian-English,
for which we translated both the single-best ASR hypotheses
and the text input. We report the results of the system for
the provided evaluation sets
Syntactic Nuclei in Dependency Parsing -- A Multilingual Exploration
Standard models for syntactic dependency parsing take words to be the
elementary units that enter into dependency relations. In this paper, we
investigate whether there are any benefits from enriching these models with the
more abstract notion of nucleus proposed by Tesni\`{e}re. We do this by showing
how the concept of nucleus can be defined in the framework of Universal
Dependencies and how we can use composition functions to make a
transition-based dependency parser aware of this concept. Experiments on 12
languages show that nucleus composition gives small but significant
improvements in parsing accuracy. Further analysis reveals that the improvement
mainly concerns a small number of dependency relations, including nominal
modifiers, relations of coordination, main predicates, and direct objects.Comment: Accepted at EACL-202
A hybrid extraction model for Chinese noun/verb synonym bi-gram
2011-2012 > Academic research: refereed > Refereed conference paperVersion of RecordPublishe
An Arabic CCG approach for determining constituent types from Arabic Treebank
AbstractConverting a treebank into a CCGbank opens the respective language to the sophisticated tools developed for Combinatory Categorial Grammar (CCG) and enriches cross-linguistic development. The conversion is primarily a three-step process: determining constituents’ types, binarization, and category conversion. Usually, this process involves a preprocessing step to the Treebank of choice for correcting brackets and normalizing tags for any changes that were introduced during the manual annotation, as well as extracting morpho-syntactic information that is necessary for determining constituents’ types. In this article, we describe the required preprocessing step on the Arabic Treebank, as well as how to determine Arabic constituents’ types. We conducted an experiment on parts 1 and 2 of the Penn Arabic Treebank (PATB) aimed at converting the PATB into an Arabic CCGbank. The performance of our algorithm when applied to ATB1v2.0 & ATB2v2.0 was 99% identification of head nodes and 100% coverage over the Treebank data
- …