990 research outputs found
A Support Tool for Tagset Mapping
Many different tagsets are used in existing corpora; these tagsets vary
according to the objectives of specific projects (which may be as far apart as
robust parsing vs. spelling correction). In many situations, however, one would
like to have uniform access to the linguistic information encoded in corpus
annotations without having to know the classification schemes in detail. This
paper describes a tool which maps unstructured morphosyntactic tags to a
constraint-based, typed, configurable specification language, a ``standard
tagset''. The mapping relies on a manually written set of mapping rules, which
is automatically checked for consistency. In certain cases, unsharp mappings
are unavoidable, and noise, i.e. groups of word forms {\sl not} conforming to
the specification, will appear in the output of the mapping. The system
automatically detects such noise and informs the user about it. The tool has
been tested with rules for the UPenn tagset \cite{up} and the SUSANNE tagset
\cite{garside}, in the framework of the EAGLES\footnote{LRE project EAGLES, cf.
\cite{eagles}.} validation phase for standardised tagsets for European
languages.Comment: EACL-Sigdat 95, contains 4 ps figures (minor graphic changes
Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers
This paper describes a new method, Combi-bootstrap, to exploit existing
taggers and lexical resources for the annotation of corpora with new tagsets.
Combi-bootstrap uses existing resources as features for a second level machine
learning module, that is trained to make the mapping to the new tagset on a
very small sample of annotated corpus material. Experiments show that
Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii)
achieves much higher accuracy (up to 44.7 % error reduction) than both the best
single tagger and an ensemble tagger constructed out of the same small training
sample.Comment: 4 page
Mapping and Displaying Structural Transformations between XML and PDF
Documents are often marked up in XML-based tagsets to delineate major structural components such as headings, paragraphs, figure captions and so on, without much regard to their eventual displayed appearance. And yet these same abstract documents, after many transformations and 'typesetting' processes, often emerge in the popular format of Adobe PDF, either for dissemination or archiving.
Until recently PDF has been a totally display-based document representation, relying on the underlying PostScript semantics of PDF. Early versions of PDF had no mechanism for retaining any form of abstract document structure but recent releases have now introduced an internal structure tree to create the so called 'Tagged PDF'.
This paper describes the development of a plugin for Adobe Acrobat which creates a two-window display. In one window is shown an XML document original and in the other its Tagged PDF counterpart is seen, with an internal structure tree that, in some sense, matches the one seen in XML. If a component is highlighted in either window then the corresponding structured item, with any attendant text, is also highlighted in the other window.
Important applications of correctly Tagged PDF include making PDF documents reflow intelligently on small screen devices and enabling them to be read out in correct reading order, via speech synthesiser software, for the visually impaired. By tracing structure transformation from source document to destination one can implement the repair of damaged PDF structure or the adaptation of an existing structure tree to an incrementally updated document
Ensemble Morphosyntactic Analyser for Classical Arabic
In Modern Standard Arabic text (MSA), there are at least seven available morphological analysers (MA). Several Part-of-Speech (POS) taggers use these MAs to improve accuracy. However, the choice between these analysers is challenging, and there is none designed for Classical Arabic. Several morphological analysers have been studied and combined to be evaluated on a common ground. The goal of our language resource is to build a freely accessible multi-component toolkit (named SAWAREF1) for part-of-speech tagging and morphological analysers that can provide a comparative evaluation, standardise the outputs of each component, combine different solutions, and analyse and vote for the best candidates. We illustrate the use of SAWAREF in tagging adjectives and shows how accuracy of tagging adjectives is still very low. This paper describes the research method and design, and discusses the key issues and obstacles
LeTs Preprocess: The multilingual LT3 linguistic preprocessing toolkit
This paper presents the LeTs Preprocess Toolkit, a suite of robust high-performance preprocessing modules including Part-of-Speech Taggers, Lemmatizers and Named Entity Recognizers. The currently supported languages are Dutch, English, French and German. We give a detailed description of the architecture of the LeTs Preprocess pipeline and describe the data and methods used to train each component. Ten-fold cross-validation results are also presented. To assess the performance of each module on different domains, we collected real-world textual data from companies covering various domains (a.o. automotive, dredging and human resources) for all four supported languages. For this multi-domain corpus, a manually verified gold standard was created for each of the three preprocessing steps. We present the performance of our preprocessing components on this corpus and compare it to the performance of other existing tools. 1
- …