20 research outputs found

    A spoken document retrieval application in the oral history domain

    Get PDF
    The application of automatic speech recognition in the broadcast news domain is well studied. Recognition performance is generally high and accordingly, spoken document retrieval can successfully be applied in this domain, as demonstrated by a number of commercial systems. In other domains, a similar recognition performance is hard to obtain, or even far out of reach, for example due to lack of suitable training material. This is a serious impediment for the successful application of spoken document retrieval techniques for other data then news. This paper outlines our first steps towards a retrieval system that can automatically be adapted to new domains. We discuss our experience with a recently implemented spoken document retrieval application attached to a web-portal that aims at the disclosure of a multimedia data collection in the oral history domain. The paper illustrates that simply deploying an off-theshelf\ud broadcast news system in this task domain will produce error rates that are too high to be useful for retrieval tasks. By applying adaptation techniques on the acoustic level and language model level, system performance can be improved considerably, but additional research on unsupervised adaptation and search interfaces is required to create an adequate search environment based on speech transcripts

    Unravelling the voice of Willem Frederik Hermans: an oral history indexing case study

    Get PDF

    Robust audio indexing for Dutch spoken-word collections

    Get PDF
    Abstract—Whereas the growth of storage capacity is in accordance with widely acknowledged predictions, the possibilities to index and access the archives created is lagging behind. This is especially the case in the oral history domain and much of the rich content in these collections runs the risk to remain inaccessible for lack of robust search technologies. This paper addresses the history and development of robust audio indexing technology for searching Dutch spoken-word collections and compares Dutch audio indexing in the well-studied broadcast news domain with an oral-history case-study. It is concluded that despite significant advances in Dutch audio indexing technology and demonstrated applicability in several domains, further research is indispensable for successful automatic disclosure of spoken-word collections

    Detecting grammatical errors in machine translation output using dependency parsing and treebank querying

    Get PDF
    Despite the recent advances in the field of machine translation (MT), MT systems cannot guarantee that the sentences they produce will be fluent and coherent in both syntax and semantics. Detecting and highlighting errors in machine-translated sentences can help post-editors to focus on the erroneous fragments that need to be corrected. This paper presents two methods for detecting grammatical errors in Dutch machine-translated text, using dependency parsing and treebank querying. We test our approach on the output of a statistical and a rule-based MT system for English-Dutch and evaluate the performance on sentence and word-level. The results show that our method can be used to detect grammatical errors with high accuracy on sentence-level in both types of MT output

    Enriching a Descriptive Grammar with Treebank Queries

    Get PDF
    Abstract The Syntax of Dutch (SoD) is a descriptive and detailed grammar of Dutch, that provides data for many issues raised in linguistic theory. We present the results of a pilot project that investigated the possibility of enriching the online version of the text with links to queries that provide relevant results from syntactically annotated corpora

    Acoustic Correlates of Prosodic Boundaries in French A Review of Corpus Data / Correlatos acĂșsticos de fronteiras prosĂłdicas em francĂȘs: uma revisĂŁo de dados de corpora

    Get PDF
    Abstract: In this article we investigate the acoustic correlates of prosodic boundaries in French speech. We compare the prosodic structure annotation performed by experts in two multi-genre corpora (Rhapsodie and LOCAS-F). A uniform analysis procedure is applied to both corpora. The results show that the main acoustic correlates of prosodic boundaries are silent pauses and pre-boundary syllable lengthening. Pitch movements contribute to the perception of boundaries but are essentially correlates of boundary function, rather than boundary strength. Two levels of four-level annotation of boundary strength in the Rhapsodie corpus (periods and packages) correspond to the two-levels of strength in the LOCAS-F corpus. Keywords: prosody; speech segmentation; prosodic boundaries; corpus linguistics; French. Resumo: Neste artigo investigamos os correlatos acĂșsticos de fronteiras prosĂłdicas da fala em lĂ­ngua francesa. Comparamos a anotação da estrutura prosĂłdica efetuada por anotadores experts em dois corpora multigĂȘneros (Rhapsodie e LOCAS-F). Um procedimento de anĂĄlise uniforme Ă© aplicado a ambos os corpora. Os resultados indicam que os principais correlatos acĂșsticos de fronteiras prosĂłdicas sĂŁo pausa silenciosa e alongamento da sĂ­laba prĂ©-fronteira. Movimentos de pitch contribuem para a percepção de fronteiras mas sĂŁo essencialmente correlatos de funçÔes de fronteira, e nĂŁo de força de fronteira. Dois dos nĂ­veis de anotação dos quatro nĂ­veis de anotação de força de fronteira do corpus Rhapsodie (perĂ­odos e pacotes) correspondem aos dois nĂ­veis de intensidade do corpus LOCAS-F. Palavras-chave: prosĂłdia; segmentação da fala; fronteiras prosĂłdicas; linguĂ­stica de corpus; francĂȘs

    Enriching a Scientific Grammar with Links to Linguistic Resources: The Taalportaal

    Get PDF
    Scientic research within the humanities is dierent from what it was a few decades ago. For instance, new sources of information, such as digital grammars, lexical databases and large corpora of real-language data oer new opportunities for linguistics. The Taalportaal grammatical database, with its links to other linguistic resources via the CLARIN infrastructure, is a prime example of a new type of tool for linguistic research.

    Difference between written and spoken Czech::The case of verbal nouns denoting an action

    Get PDF
    Abstract The present paper extends understanding of differences in expressing actions by verbal nouns in corpora of written vs. spoken Czech, namely in the Czech part of the Prague Czech-English Dependency Treebank and in the Prague Dependency Treebank of Spoken Czech. We show that while the written corpus includes more complex noun phrases with more explicit expression of adnominal participants, noun phrases in the spoken corpus contain more deletions and more exophoric references. We also carried out a quantitative analysis focusing on relative frequencies of combinations of participants modifying verbal nouns; although the written corpus shows higher relative frequencies, the order of the relative frequencies of particular combinations is the same in both types of communication.</jats:p
    corecore