37 research outputs found
Open Challenges in Treebanking: Some Thoughts Based on the Copenhagen Dependency Treebanks
Proceedings of the Workshop on Annotation and
Exploitation of Parallel Corpora AEPC 2010.
Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk.
NEALT Proceedings Series, Vol. 10 (2010), 1-13.
© 2010 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/15893
The DTAG treebank tool. Annotating and querying treebanks and
DTAG is a versatile annotation tool that
supports manual and semi-automatic annotation
of a wide range of linguistic phenomena,
including the annotation of syntax,
discourse, coreference, morphology,
and word alignments. It includes commands
for editing general labeled graphs
and graph alignments, comparing annotations,
managing annotation tasks, and interfacing
with a revision control system.
Its visualization component can display
graphs and alignments for entire texts in a
compact format, with a highly flexible and
configurable formatting scheme. It also
provides a powerful search-replace mechanism
with queries based on full first-order
logic, which can be used to search for
linguistic constructions and automatically
apply graph transformations to collections
of annotated graphs
A white paper
In this white paper, we review the theoretical evidence about the computational efficiency of dependency parsing and machine translation without the widely used, but linguistically questionable assumptions about projectivity and edge-factoring. On the basis of the heuristic local optimality parser proposed by (Buch-Kromann, 2006), we propose a common architecture for monolingual parsing, parallel parsing, and translation that does not make these assumptions. Finally, we describe the elementary repair operations in the model, and argue that the model is potentially interesting as a model of human translation
Hierarchy-based Partition Models: Using Classification Hierarchies to
We propose a novel machine learning
technique that can be used to estimate
probability distributions for categorical
random variables that are equipped with
a natural set of classification hierarchies,
such as words equipped with word class
hierarchies, wordnet hierarchies, and suffix
and affix hierarchies. We evaluate the
estimator on bigram language modelling
with a hierarchy based on word suffixes,
using English, Danish, and Finnish data
from the Europarl corpus with training sets
of up to 1–1.5 million words. The results
show that the proposed estimator outperforms
modified Kneser-Ney smoothing in
terms of perplexity on unseen data. This
suggests that important information is hidden
in the classification hierarchies that we
routinely use in computational linguistics,
but that we are unable to utilize this information
fully because our current statistical
techniques are either based on simple
counting models or designed for sample
spaces with a distance metric, rather than
sample spaces with a non-metric topology
given by a classification hierarchy.
Keywords: machine learning; categorical
variables; classification hierarchies; language
modelling; statistical estimatio
Recommended from our members
Smoothing survival densities in practice
Many nonparametric smoothing procedures consider independent identically distributed stochastic variables. There are also many important nonparametric smoothing applications where the data is more complicated. Survival data or filtered data, defined as following Aalen’s multiplicative hazard model and aggregated versions of this model, are considered. Aalen’s model based on counting process theory allows multiple left truncations and multiple right censoring to be present in the data. This type of filtering is omnipresent in biostatistical and demographical applications, where people can join a study, leave the study and perhaps join the study again. The estimation methodology is based on a recent class of local linear density estimators. A new stable bandwidth-selector is developed for these estimators. A data application to aggregated national mortality data is provided, where immigrations to and from the country correspond to respectively left truncation and right censoring. The aggregated mortality data study illustrates that the new practical density estimators provide an important extra element in the visual toolbox for understanding survival data
Recommended from our members
Double one-sided cross-validation of local linear hazards
This paper brings together the theory and practice of local linear kernel hazard estimation. Bandwidth selection is fully analysed, including Do-validation that is shown to have good practical and theoretical properties. Insight is provided into the choice of the weighting function in the local linear minimization and it is pointed out that classical weighting sometimes lacks stability. A new semiparametric hazard estimator transforming the survival data before smoothing is introduced and shown to have good practical properties
Recommended from our members
Bandwidth selection in marker dependent kernel hazard estimation
Practical estimation procedures for the local linear estimation of an unrestricted failure rate when more information is available than just time are developed. This extra information could be a covariate and this covariate could be a time series. Time dependent covariates are sometimes called markers, and failure rates are sometimes called hazards, intensities or mortalities. It is shown through simulations and a practical example that the fully local linear estimation procedure exhibits an excellent practical performance. Two different bandwidth selection procedures are developed. One is an adaptation of classical cross-validation, and the other one is indirect cross-validation. The simulation study concludes that classical cross-validation works well on continuous data while indirect cross-validation performs only marginally better. However, cross-validation breaks down in the practical data application to old-age mortality. Indirect cross-validation is thus shown to be superior when selecting a fully feasible estimation method for marker dependent hazard estimation
Recommended from our members
Further theoretical and practical insight to the do-validated bandwidth selector
Recent contributions to kernel smoothing show that the performance of cross-validated bandwidth selectors improves significantly from indirectness and that the recent do-validated method seems to provide the most practical alternative among these methods. In this paper we show step by step how classical cross-validation improves in theory, as well as in practice, from indirectness and that do-validated estimators improve in theory, but not in practice, from further indirectness. This paper therefore provides a strong support for the practical and theoretical properties of do-validated bandwidth selection. Do-validation is currently being introduced to survival analysis in a number of contexts and this paper provides evidence that this might be the immediate step forward
Discourse structure and language technology
This publication is with permission of the rights owner freely accessible due to an Alliance licence and a national licence (funded by the DFG, German Research Foundation) respectively.An increasing number of researchers and practitioners in Natural Language Engineering face the prospect of having to work with entire texts, rather than individual sentences. While it is clear that text must have useful structure, its nature may be less clear, making it more difficult to exploit in applications. This survey of work on discourse structure thus provides a primer on the bases of which discourse is structured along with some of their formal properties. It then lays out the current state-of-the-art with respect to algorithms for recognizing these different structures, and how these algorithms are currently being used in Language Technology applications. After identifying resources that should prove useful in improving algorithm performance across a range of languages, we conclude by speculating on future discourse structure-enabled technology.Peer Reviewe