1,812 research outputs found

    Comparing a statistical and a rule-based tagger for German

    Full text link
    In this paper we present the results of comparing a statistical tagger for German based on decision trees and a rule-based Brill-Tagger for German. We used the same training corpus (and therefore the same tag-set) to train both taggers. We then applied the taggers to the same test corpus and compared their respective behavior and in particular their error rates. Both taggers perform similarly with an error rate of around 5%. From the detailed error analysis it can be seen that the rule-based tagger has more problems with unknown words than the statistical tagger. But the results are opposite for tokens that are many-ways ambiguous. If the unknown words are fed into the taggers with the help of an external lexicon (such as the Gertwol system) the error rate of the rule-based tagger drops to 4.7%, and the respective rate of the statistical taggers drops to around 3.7%. Combining the taggers by using the output of one tagger to help the other did not lead to any further improvement.Comment: 8 page

    Experiences with the GTU grammar development environment

    Full text link
    In this paper we describe our experiences with a tool for the development and testing of natural language grammars called GTU (German: Grammatik-Testumgebumg; grammar test environment). GTU supports four grammar formalisms under a window-oriented user interface. Additionally, it contains a set of German test sentences covering various syntactic phenomena as well as three types of German lexicons that can be attached to a grammar via an integrated lexicon interface. What follows is a description of the experiences we gained when we used GTU as a tutoring tool for students and as an experimental tool for CL researchers. From these we will derive the features necessary for a future grammar workbench.Comment: 7 pages, uses aclap.st

    Combining semantic and syntactic generalization in example-based machine translation

    Get PDF
    In this paper, we report our experiments in combining two EBMT systems that rely on generalized templates, Marclator and CMU-EBMT, on an English–German translation task. Our goal was to see whether a statistically significant improvement could be achieved over the individual performances of these two systems. We observed that this was not the case. However, our system consistently outperformed a lexical EBMT baseline system

    Bootstrapping parallel treebanks

    Full text link
    This paper argues for the development of parallel treebanks. It summarizes the work done in this area and reports on experiments for building a Swedish-German treebank. And it describes our approach for reusing resources from one language while annotating another language

    Linguistische und semantische Annotation eines Zeitungskorpus

    Full text link
    Dieser Artikel beschreibt das Vorgehen beim automatischen inkrementellen Aufbereiten eines rohen Textkorpus mit linguistischer und semantischer Information. Es wird gezeigt, wie das Erkennen von Eigennamen hilft, die Wortartenkategorisierung und partielle syntaktische Analysen zu verbessern. Eine Evaluation über ca. 1000 Sätze zeigt die Stärken und Schwachpunkte der verschiedenen Erkenner auf

    MT-based sentence alignment for OCR-generated parallel texts

    Full text link
    The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison

    DiTo-Datenbank : Datendokumentation zu Funktionsverbgefügen und Relativsätzen

    Get PDF
    In dieser Arbeit werden die DiTo-Daten zu Funktionsverbgefügen und Relativsätzen beschrieben. DiTo ist ein am DFKI entwickeltes Testwerkzeug für die Fehlerdiagnose der Syntaxkomponente natürlichsprachlicher Systeme. Mit diesem Tool, das zum Ziel hat, möglichst alle wesentlichen Phänomene deutscher Syntax anhand von Testdaten zu repräsentieren, kann die Fehlerdiagnose bei Testläufen natürlichsprachlicher Systeme systematisch unterstützt werden. Bisher beinhaltet der Datenkatalog die Bereiche Verbrektion, Satzkoordination, Funktionsverbgefüge und Relativsätze. Wir arbeiten mit anderen Gruppen zusammen, die weitere Syntaxthemen entsprechend den Richtlinien unseres Ansatzes erarbeiten. Damit ausgewählte Syntaxgebiete separat getestet werden können, sind die Daten in einer relationalen Datenbank organisiert. In den Teildokumentationen zu den beiden hier behandelten Syntaxgebieten werden die Phänomene zuerst skizzenhaft beschrieben. Dann wird die der Datensammlung zugrundeliegende Systematik erläutert. Anschließend wird gezeigt, wie die Daten in der relationalen Datenbank organisiert sind

    In-Plane Focusing of Terahertz Surface Waves on a Gradient Index Metamaterial Film

    Full text link
    We designed and implemented a gradient index metasurface for the in-plane focusing of confined terahertz surface waves. We measured the spatial propagation of the surface waves by two-dimensional mapping of the complex electric field using a terahertz near-field spectroscope. The surface waves were focused to a diameter of 500 \micro m after a focal length of approx. 2 mm. In the focus, we measured a field amplitude enhancement of a factor of 3.Comment: 6 pages, 4 figure

    Binomials in Swedish corpora – ‘Ordpar 1965’ revisited

    Full text link
    This paper describes a corpus study on Swedish binomials, a special type of multi-word expressions. Binomials are of the type "X conjunction Y" where X and Y are words, typically of the same part-of-speech. Bendz (1965) investigated the various use cases and functions of such binomials and included a list of more than 1000 candidates in his appendix. We were curious to what extent these binomials can still be found in modern corpora. We therefore checked this list against the Swedish Europarl and OpenSubtitles corpora. We found that many of the binomials are still in use today even in these diverse text genres. The relative frequency of binomials in Europarl is much higher than in OpenSubtitles

    Improving Specificity in Review Response Generation with Data-Driven Data Filtering

    Full text link
    Responding to online customer reviews has become an essential part of successfully managing and growing a business both in e-commerce and the hospitality and tourism sectors. Recently, neural text generation methods intended to assist authors in composing responses have been shown to deliver highly fluent and natural looking texts. However, they also tend to learn a strong, undesirable bias towards generating overly generic, one-size-fits-all outputs to a wide range of inputs. While this often results in ‘safe’, high-probability responses, there are many practical settings in which greater specificity is preferable. In this work we examine the task of generating more specific responses for online reviews in the hospitality domain by identifying generic responses in the training data, filtering them and fine-tuning the generation model. We experiment with a range of data-driven filtering methods and show through automatic and human evaluation that, despite a 60% reduction in the amount of training data, filtering helps to derive models that are capable of generating more specific, useful responses
    corecore