20 research outputs found
A spoken document retrieval application in the oral history domain
The application of automatic speech recognition in the broadcast news domain is well studied. Recognition performance is generally high and accordingly, spoken document retrieval can successfully be applied in this domain, as demonstrated by a number of commercial systems. In other domains, a similar recognition performance is hard to obtain, or even far out of reach, for example due to lack of suitable training material. This is a serious impediment for the successful application of spoken document retrieval techniques for other data then news. This paper outlines our first steps towards a retrieval system that can automatically be adapted to new domains. We discuss our experience with a recently implemented spoken document retrieval application attached to a web-portal that aims at the disclosure of a multimedia data collection in the oral history domain. The paper illustrates that simply deploying an off-theshelf\ud
broadcast news system in this task domain will produce error rates that are too high to be useful for retrieval tasks. By applying adaptation techniques on the acoustic level and language model level, system performance can be improved considerably, but additional research on unsupervised adaptation and search interfaces is required to create an adequate search environment based on speech transcripts
Robust audio indexing for Dutch spoken-word collections
AbstractâWhereas the growth of storage capacity is in accordance with widely acknowledged predictions, the possibilities to index and access the archives created is lagging behind. This is especially the case in the oral history domain and much of the rich content in these collections runs the risk to remain inaccessible for lack of robust search technologies. This paper addresses the history and development of robust audio indexing technology for searching Dutch spoken-word collections and compares Dutch audio indexing in the well-studied broadcast news domain with an oral-history case-study. It is concluded that despite significant advances in Dutch audio indexing technology and demonstrated applicability in several domains, further research is indispensable for successful automatic disclosure of spoken-word collections
Detecting grammatical errors in machine translation output using dependency parsing and treebank querying
Despite the recent advances in the field of machine translation (MT), MT systems cannot guarantee that the sentences they produce will be fluent and coherent in both syntax and semantics. Detecting and highlighting errors in machine-translated sentences can help post-editors to focus on the erroneous fragments that need to be corrected. This paper presents two methods for detecting grammatical errors in Dutch machine-translated text, using dependency parsing and treebank querying. We test our approach on the output of a statistical and a rule-based MT system for English-Dutch and evaluate the performance on sentence and word-level. The results show that our method can be used to detect grammatical errors with high accuracy on sentence-level in both types of MT output
Enriching a Descriptive Grammar with Treebank Queries
Abstract The Syntax of Dutch (SoD) is a descriptive and detailed grammar of Dutch, that provides data for many issues raised in linguistic theory. We present the results of a pilot project that investigated the possibility of enriching the online version of the text with links to queries that provide relevant results from syntactically annotated corpora
Acoustic Correlates of Prosodic Boundaries in French A Review of Corpus Data / Correlatos acĂșsticos de fronteiras prosĂłdicas em francĂȘs: uma revisĂŁo de dados de corpora
Abstract: In this article we investigate the acoustic correlates of prosodic boundaries in French speech. We compare the prosodic structure annotation performed by experts in two multi-genre corpora (Rhapsodie and LOCAS-F). A uniform analysis procedure is applied to both corpora. The results show that the main acoustic correlates of prosodic boundaries are silent pauses and pre-boundary syllable lengthening. Pitch movements contribute to the perception of boundaries but are essentially correlates of boundary function, rather than boundary strength. Two levels of four-level annotation of boundary strength in the Rhapsodie corpus (periods and packages) correspond to the two-levels of strength in the LOCAS-F corpus.
Keywords: prosody; speech segmentation; prosodic boundaries; corpus linguistics; French.
Resumo: Neste artigo investigamos os correlatos acĂșsticos de fronteiras prosĂłdicas da fala em lĂngua francesa. Comparamos a anotação da estrutura prosĂłdica efetuada por anotadores experts em dois corpora multigĂȘneros (Rhapsodie e LOCAS-F). Um procedimento de anĂĄlise uniforme Ă© aplicado a ambos os corpora. Os resultados indicam que os principais correlatos acĂșsticos de fronteiras prosĂłdicas sĂŁo pausa silenciosa e alongamento da sĂlaba prĂ©-fronteira. Movimentos de pitch contribuem para a percepção de fronteiras mas sĂŁo essencialmente correlatos de funçÔes de fronteira, e nĂŁo de força de fronteira. Dois dos nĂveis de anotação dos quatro nĂveis de anotação de força de fronteira do corpus Rhapsodie (perĂodos e pacotes) correspondem aos dois nĂveis de intensidade do corpus LOCAS-F.
Palavras-chave: prosĂłdia; segmentação da fala; fronteiras prosĂłdicas; linguĂstica de corpus; francĂȘs
Enriching a Scientific Grammar with Links to Linguistic Resources: The Taalportaal
Scientic research within the humanities is dierent from what it was a few decades ago. For instance, new sources of information, such as digital grammars, lexical databases and large corpora of real-language data oer new opportunities for linguistics. The Taalportaal grammatical database, with its links to other linguistic resources via the CLARIN infrastructure, is a prime example of a new type of tool for linguistic research.
Difference between written and spoken Czech::The case of verbal nouns denoting an action
Abstract
The present paper extends understanding of differences in expressing actions by verbal nouns in corpora of written vs. spoken Czech, namely in the Czech part of the Prague Czech-English Dependency Treebank and in the Prague Dependency Treebank of Spoken Czech.
We show that while the written corpus includes more complex noun phrases with more explicit expression of adnominal participants, noun phrases in the spoken corpus contain more deletions and more exophoric references. We also carried out a quantitative analysis focusing on relative frequencies of combinations of participants modifying verbal nouns; although the written corpus shows higher relative frequencies, the order of the relative frequencies of particular combinations is the same in both types of communication.</jats:p