Search CORE

990 research outputs found

A Support Tool for Tagset Mapping

Author: Teufel Simone
Publication venue
Publication date: 01/01/1995
Field of study

Many different tagsets are used in existing corpora; these tagsets vary according to the objectives of specific projects (which may be as far apart as robust parsing vs. spelling correction). In many situations, however, one would like to have uniform access to the linguistic information encoded in corpus annotations without having to know the classification schemes in detail. This paper describes a tool which maps unstructured morphosyntactic tags to a constraint-based, typed, configurable specification language, a ``standard tagset''. The mapping relies on a manually written set of mapping rules, which is automatically checked for consistency. In certain cases, unsharp mappings are unavoidable, and noise, i.e. groups of word forms {\sl not} conforming to the specification, will appear in the output of the mapping. The system automatically detects such noise and informs the user about it. The tool has been tested with rules for the UPenn tagset \cite{up} and the SUSANNE tagset \cite{garside}, in the framework of the EAGLES\footnote{LRE project EAGLES, cf. \cite{eagles}.} validation phase for standardised tagsets for European languages.Comment: EACL-Sigdat 95, contains 4 ps figures (minor graphic changes

arXiv.org e-Print Archive

CiteSeerX

Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers

Author: Daelemans Walter
Zavrel Jakub
Publication venue
Publication date: 01/01/2000
Field of study

This paper describes a new method, Combi-bootstrap, to exploit existing taggers and lexical resources for the annotation of corpora with new tagsets. Combi-bootstrap uses existing resources as features for a second level machine learning module, that is trained to make the mapping to the new tagset on a very small sample of annotated corpus material. Experiments show that Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii) achieves much higher accuracy (up to 44.7 % error reduction) than both the best single tagger and an ensemble tagger constructed out of the same small training sample.Comment: 4 page

arXiv.org e-Print Archive

CiteSeerX

Institutional Repository Universiteit Antwerpen

Tilburg University Repository

Mapping and Displaying Structural Transformations between XML and PDF

Author: Brailsford David F.
Hardy Matthew R. B.
Publication venue: ACM Press
Publication date: 01/01/2002
Field of study

Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving. Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'. This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window. Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document

Ensemble Morphosyntactic Analyser for Classical Arabic

Author: Alosaimy AMS
Atwell E
Publication venue
Publication date
Field of study

In Modern Standard Arabic text (MSA), there are at least seven available morphological analysers (MA). Several Part-of-Speech (POS) taggers use these MAs to improve accuracy. However, the choice between these analysers is challenging, and there is none designed for Classical Arabic. Several morphological analysers have been studied and combined to be evaluated on a common ground. The goal of our language resource is to build a freely accessible multi-component toolkit (named SAWAREF1) for part-of-speech tagging and morphological analysers that can provide a comparative evaluation, standardise the outputs of each component, combine different solutions, and analyse and vote for the best candidates. We illustrate the use of SAWAREF in tagging adjectives and shows how accuracy of tagging adjectives is still very low. This paper describes the research method and design, and discusses the key issues and obstacles

White Rose Research Online

LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit

Author: Bart Desmet
Els Lefever
Geert Coorman
Lieve Macken
Marjan Van De Kauter
Véronique Hoste
Publication venue
Publication date: 01/01/2013
Field of study

This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools. 1

CiteSeerX

Ghent University Academic Bibliography