1,841 research outputs found
Comparing a statistical and a rule-based tagger for German
In this paper we present the results of comparing a statistical tagger for
German based on decision trees and a rule-based Brill-Tagger for German. We
used the same training corpus (and therefore the same tag-set) to train both
taggers. We then applied the taggers to the same test corpus and compared their
respective behavior and in particular their error rates. Both taggers perform
similarly with an error rate of around 5%. From the detailed error analysis it
can be seen that the rule-based tagger has more problems with unknown words
than the statistical tagger. But the results are opposite for tokens that are
many-ways ambiguous. If the unknown words are fed into the taggers with the
help of an external lexicon (such as the Gertwol system) the error rate of the
rule-based tagger drops to 4.7%, and the respective rate of the statistical
taggers drops to around 3.7%. Combining the taggers by using the output of one
tagger to help the other did not lead to any further improvement.Comment: 8 page
Experiences with the GTU grammar development environment
In this paper we describe our experiences with a tool for the development and
testing of natural language grammars called GTU (German:
Grammatik-Testumgebumg; grammar test environment). GTU supports four grammar
formalisms under a window-oriented user interface. Additionally, it contains a
set of German test sentences covering various syntactic phenomena as well as
three types of German lexicons that can be attached to a grammar via an
integrated lexicon interface. What follows is a description of the experiences
we gained when we used GTU as a tutoring tool for students and as an
experimental tool for CL researchers. From these we will derive the features
necessary for a future grammar workbench.Comment: 7 pages, uses aclap.st
Combining semantic and syntactic generalization in example-based machine translation
In this paper, we report our experiments in combining two EBMT systems that rely on generalized templates, Marclator and CMU-EBMT, on an English–German translation task. Our goal was to see whether a statistically significant improvement could be achieved over the individual performances of these two systems. We observed that this was not the case. However, our system consistently outperformed a lexical EBMT baseline system
Bootstrapping parallel treebanks
This paper argues for the development of parallel treebanks. It summarizes the work done in this area and reports on experiments for building a Swedish-German treebank. And it
describes our approach for reusing resources from one language while annotating another language
Linguistische und semantische Annotation eines Zeitungskorpus
Dieser Artikel beschreibt das Vorgehen beim automatischen inkrementellen Aufbereiten eines rohen Textkorpus mit linguistischer und semantischer Information. Es wird gezeigt, wie das Erkennen von Eigennamen hilft, die Wortartenkategorisierung und partielle syntaktische Analysen zu verbessern. Eine Evaluation über ca. 1000 Sätze zeigt die Stärken und Schwachpunkte der verschiedenen Erkenner auf
MT-based sentence alignment for OCR-generated parallel texts
The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison
DiTo-Datenbank : Datendokumentation zu Funktionsverbgefügen und Relativsätzen
In dieser Arbeit werden die DiTo-Daten zu Funktionsverbgefügen und Relativsätzen beschrieben. DiTo ist ein am DFKI entwickeltes Testwerkzeug für die Fehlerdiagnose der Syntaxkomponente natürlichsprachlicher Systeme. Mit diesem Tool, das zum Ziel hat, möglichst alle wesentlichen Phänomene deutscher Syntax anhand von Testdaten zu repräsentieren, kann die Fehlerdiagnose bei Testläufen natürlichsprachlicher Systeme systematisch unterstützt werden. Bisher beinhaltet der Datenkatalog die Bereiche Verbrektion, Satzkoordination, Funktionsverbgefüge und Relativsätze. Wir arbeiten mit anderen Gruppen zusammen, die weitere Syntaxthemen entsprechend den Richtlinien unseres Ansatzes erarbeiten. Damit ausgewählte Syntaxgebiete separat getestet
werden können, sind die Daten in einer relationalen Datenbank organisiert. In den Teildokumentationen zu den beiden hier behandelten Syntaxgebieten werden die Phänomene zuerst skizzenhaft beschrieben. Dann wird die der Datensammlung zugrundeliegende Systematik erläutert. Anschließend wird gezeigt, wie die Daten in der relationalen Datenbank organisiert sind
In-Plane Focusing of Terahertz Surface Waves on a Gradient Index Metamaterial Film
We designed and implemented a gradient index metasurface for the in-plane
focusing of confined terahertz surface waves. We measured the spatial
propagation of the surface waves by two-dimensional mapping of the complex
electric field using a terahertz near-field spectroscope. The surface waves
were focused to a diameter of 500 \micro m after a focal length of approx. 2
mm. In the focus, we measured a field amplitude enhancement of a factor of 3.Comment: 6 pages, 4 figure
Binomials in Swedish corpora – ‘Ordpar 1965’ revisited
This paper describes a corpus study on Swedish binomials, a special type of multi-word expressions. Binomials are of the type "X conjunction Y" where X and Y are words, typically of the same part-of-speech. Bendz (1965) investigated the various use cases and functions of such binomials and included a list of more than 1000 candidates in his appendix. We were curious to what extent these binomials can still be found in modern corpora. We therefore checked this list against the Swedish Europarl and OpenSubtitles corpora. We found that many of the binomials are still in use today even in these diverse text genres. The relative frequency of binomials in Europarl is much higher than in OpenSubtitles
- …