195 research outputs found
Bootstrapping a Tagged Corpus through Combination of Existing Heterogeneous Taggers
This paper describes a new method, Combi-bootstrap, to exploit existing
taggers and lexical resources for the annotation of corpora with new tagsets.
Combi-bootstrap uses existing resources as features for a second level machine
learning module, that is trained to make the mapping to the new tagset on a
very small sample of annotated corpus material. Experiments show that
Combi-bootstrap: i) can integrate a wide variety of existing resources, and ii)
achieves much higher accuracy (up to 44.7 % error reduction) than both the best
single tagger and an ensemble tagger constructed out of the same small training
sample.Comment: 4 page
Towards Affordable Disclosure of Spoken Word Archives
This paper presents and discusses ongoing work aiming at affordable disclosure of real-world spoken word archives in general, and in particular of a collection of recorded interviews with Dutch survivors of World War II concentration camp Buchenwald. Given such collections, the least we want to be able to provide is search at different levels and a flexible way of presenting results. Strategies for automatic annotation based on speech recognition – supporting e.g., within-document search– are outlined and discussed with respect to the Buchenwald interview collection. In addition, usability aspects of the spoken word search are discussed on the basis of our experiences with the online Buchenwald web portal. It is concluded that, although user feedback is generally fairly positive, automatic annotation performance is still far from satisfactory, and requires additional research
Constrained speaker linking
In this paper we study speaker linking (a.k.a.\ partitioning) given
constraints of the distribution of speaker identities over speech recordings.
Specifically, we show that the intractable partitioning problem becomes
tractable when the constraints pre-partition the data in smaller cliques with
non-overlapping speakers. The surprisingly common case where speakers in
telephone conversations are known, but the assignment of channels to identities
is unspecified, is treated in a Bayesian way. We show that for the Dutch CGN
database, where this channel assignment task is at hand, a lightweight speaker
recognition system can quite effectively solve the channel assignment problem,
with 93% of the cliques solved. We further show that the posterior distribution
over channel assignment configurations is well calibrated.Comment: Submitted to Interspeech 2014, some typos fixe
SPOD:Syntactic Profiler of Dutch
SPOD is a tool for Dutch syntax in which a given corpus is analysed according to a large number of predefined syntactic characteristics. SPOD is an extension of the PaQu (”Parse and Query”) tool (Odijk et al. 2017). SPOD is available for a number of standard Dutch corpora and treebanks.In addition, you can upload your own texts which will then be syntactically analysed. SPOD will run a potentially large number of syntactic queries in order to show a variety of corpus properties, such as the number of main and subordinate clauses, types of main and subordinate clauses, and their frequencies, average length of clauses (per clause type: e.g. relative clauses, indirect questions, finite complement clauses, infinitival clauses, finite adverbial clauses, etc.). Other syntactic constructions include comparatives, correlatives, various types of verb clusters, separable verb prefixes, depth of embedding etc.SPOD allows linguists to obtain a quick overview of the syntactic properties of texts, for instance with the goal to find interesting differences between text types, or between authors with different backgrounds or different age. In the paper, we describe the SPOD tool in some more detail, and we provide a case study, illustrating the type of investigations which are enabled andfacilitated by SPOD. Most of the syntactic properties are implemented in SPOD by means of relatively complicated XPath 2.0 queries, and as such SPOD also provides examples of relevant syntactic queries, which may otherwise be relatively hard to define for non-technical linguists
- …