32,473 research outputs found
A Pattern Matching method for finding Noun and Proper Noun Translations from Noisy Parallel Corpora
We present a pattern matching method for compiling a bilingual lexicon of
nouns and proper nouns from unaligned, noisy parallel texts of
Asian/Indo-European language pairs. Tagging information of one language is
used. Word frequency and position information for high and low frequency words
are represented in two different vector forms for pattern matching. New anchor
point finding and noise elimination techniques are introduced. We obtained a
73.1\% precision. We also show how the results can be used in the compilation
of domain-specific noun phrases.Comment: 8 pages, uuencoded compressed postscript file. To appear in the
Proceedings of the 33rd AC
GO-WORDS: An Entropic Approach to Semantic Decomposition of Gene Ontology Terms
The Gene Ontology (GO) has a large and growing number of terms that constitute its vocabulary. An entropy-based approach is presented to automate the characterization of the compositional semantics of GO terms. The motivation is to extend the machine-readability of GO and to offer insights for the continued maintenance and growth of GO. A proto-type implementation illustrates the benefits of the approach
A Modular and Flexible Architecture for an Integrated Corpus Query System
The paper describes the architecture of an integrated and extensible corpus
query system developed at the University of Stuttgart and gives examples of
some of the modules realized within this architecture. The modules form the
core of a corpus workbench. Within the proposed architecture, information
required for the evaluation of queries may be derived from different knowledge
sources (the corpus text, databases, on-line thesauri) and by different means:
either through direct lookup in a database or by calling external tools which
may infer the necessary information at the time of query evaluation. The
information available and the method of information access can be stated
declaratively and individually for each corpus, leading to a flexible,
extensible and modular corpus workbench.Comment: 10 pages, uuencoded gzip'ped PostScript; presented at COMPLEX'9
- …