Search CORE

1,841 research outputs found

Comparing a statistical and a rule-based tagger for German

Author: Schneider Gerold
Volk Martin
Publication venue
Publication date: 01/01/1998
Field of study

In this paper we present the results of comparing a statistical tagger for German based on decision trees and a rule-based Brill-Tagger for German. We used the same training corpus (and therefore the same tag-set) to train both taggers. We then applied the taggers to the same test corpus and compared their respective behavior and in particular their error rates. Both taggers perform similarly with an error rate of around 5%. From the detailed error analysis it can be seen that the rule-based tagger has more problems with unknown words than the statistical tagger. But the results are opposite for tokens that are many-ways ambiguous. If the unknown words are fed into the taggers with the help of an external lexicon (such as the Gertwol system) the error rate of the rule-based tagger drops to 4.7%, and the respective rate of the statistical taggers drops to around 3.7%. Combining the taggers by using the output of one tagger to help the other did not lead to any further improvement.Comment: 8 page

arXiv.org e-Print Archive

CiteSeerX

ZORA

Experiences with the GTU grammar development environment

Author: Richarz Dirk
Volk Martin
Publication venue
Publication date: 12/07/1997
Field of study

In this paper we describe our experiences with a tool for the development and testing of natural language grammars called GTU (German: Grammatik-Testumgebumg; grammar test environment). GTU supports four grammar formalisms under a window-oriented user interface. Additionally, it contains a set of German test sentences covering various syntactic phenomena as well as three types of German lexicons that can be attached to a grammar via an integrated lexicon interface. What follows is a description of the experiences we gained when we used GTU as a tutoring tool for students and as an experimental tool for CL researchers. From these we will derive the features necessary for a future grammar workbench.Comment: 7 pages, uses aclap.st

arXiv.org e-Print Archive

ZORA

The Automatic Translation of Film Subtitles. A Machine Translation Success Story?

Author: Volk Martin
Publication venue: German Society for Computational Linguistics and Language Technology (GSCL)
Publication date: 01/07/2009
Field of study

Journal for Language Technology and Computational Linguistics (JLCL)

Combining semantic and syntactic generalization in example-based machine translation

Author: Ebling Sarah
Kumar Naskar Sudip
Volk Martin
Way Andy
Publication venue: European Association for Machine Translation
Publication date: 30/05/2011
Field of study

In this paper, we report our experiments in combining two EBMT systems that rely on generalized templates, Marclator and CMU-EBMT, on an English–German translation task. Our goal was to see whether a statistically signiﬁcant improvement could be achieved over the individual performances of these two systems. We observed that this was not the case. However, our system consistently outperformed a lexical EBMT baseline system

CiteSeerX

Irish Universities

DCU Online Research Access Service

Bootstrapping parallel treebanks

Author: Samuelsson Y
Volk Martin
Publication venue
Publication date: 01/01/2004
Field of study

This paper argues for the development of parallel treebanks. It summarizes the work done in this area and reports on experiments for building a Swedish-German treebank. And it describes our approach for reusing resources from one language while annotating another language

CiteSeerX

ZORA

Linguistische und semantische Annotation eines Zeitungskorpus

Author: Clematide S
Volk Martin
Publication venue
Publication date: 30/03/2001
Field of study

Dieser Artikel beschreibt das Vorgehen beim automatischen inkrementellen Aufbereiten eines rohen Textkorpus mit linguistischer und semantischer Information. Es wird gezeigt, wie das Erkennen von Eigennamen hilft, die Wortartenkategorisierung und partielle syntaktische Analysen zu verbessern. Eine Evaluation über ca. 1000 Sätze zeigt die Stärken und Schwachpunkte der verschiedenen Erkenner auf

ZORA

MT-based sentence alignment for OCR-generated parallel texts

Author: Sennrich R
Volk Martin
Publication venue
Publication date: 04/11/2010
Field of study

The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison

ZORA

DiTo-Datenbank : Datendokumentation zu Funktionsverbgefügen und Relativsätzen

Author: Krenn Brigitte
Volk Martin
Publication venue: Sonstige Einrichtungen. DFKI Deutsches Forschungszentrum für Künstliche Intelligenz
Publication date: 01/01/1993
Field of study

In dieser Arbeit werden die DiTo-Daten zu Funktionsverbgefügen und Relativsätzen beschrieben. DiTo ist ein am DFKI entwickeltes Testwerkzeug für die Fehlerdiagnose der Syntaxkomponente natürlichsprachlicher Systeme. Mit diesem Tool, das zum Ziel hat, möglichst alle wesentlichen Phänomene deutscher Syntax anhand von Testdaten zu repräsentieren, kann die Fehlerdiagnose bei Testläufen natürlichsprachlicher Systeme systematisch unterstützt werden. Bisher beinhaltet der Datenkatalog die Bereiche Verbrektion, Satzkoordination, Funktionsverbgefüge und Relativsätze. Wir arbeiten mit anderen Gruppen zusammen, die weitere Syntaxthemen entsprechend den Richtlinien unseres Ansatzes erarbeiten. Damit ausgewählte Syntaxgebiete separat getestet werden können, sind die Daten in einer relationalen Datenbank organisiert. In den Teildokumentationen zu den beiden hier behandelten Syntaxgebieten werden die Phänomene zuerst skizzenhaft beschrieben. Dann wird die der Datensammlung zugrundeliegende Systematik erläutert. Anschließend wird gezeigt, wie die Daten in der relationalen Datenbank organisiert sind

Scientific publications of the Saarland University

Universaar

Scientific publications of the Saarland University

In-Plane Focusing of Terahertz Surface Waves on a Gradient Index Metamaterial Film

Author: Beigang René
Neu Jens
Rahm Marco
Reinhard Benjamin
Volk Martin F.
Publication venue: 'The Optical Society'
Publication date: 01/01/2013
Field of study

We designed and implemented a gradient index metasurface for the in-plane focusing of confined terahertz surface waves. We measured the spatial propagation of the surface waves by two-dimensional mapping of the complex electric field using a terahertz near-field spectroscope. The surface waves were focused to a diameter of 500 \micro m after a focal length of approx. 2 mm. In the focus, we measured a field amplitude enhancement of a factor of 3.Comment: 6 pages, 4 figure

arXiv.org e-Print Archive

Fraunhofer-ePrints

Binomials in Swedish corpora – ‘Ordpar 1965’ revisited

Author: Graën Johannes
Volk Martin
Publication venue: Department of Swedish, Multilingualism and Language Technology, University of Gothenburg
Publication date: 18/11/2022
Field of study

This paper describes a corpus study on Swedish binomials, a special type of multi-word expressions. Binomials are of the type "X conjunction Y" where X and Y are words, typically of the same part-of-speech. Bendz (1965) investigated the various use cases and functions of such binomials and included a list of more than 1000 candidates in his appendix. We were curious to what extent these binomials can still be found in modern corpora. We therefore checked this list against the Swedish Europarl and OpenSubtitles corpora. We found that many of the binomials are still in use today even in these diverse text genres. The relative frequency of binomials in Europarl is much higher than in OpenSubtitles

ZORA