1,826 research outputs found

    Improving the translation environment for professional translators

    Get PDF
    When using computer-aided translation systems in a typical, professional translation workflow, there are several stages at which there is room for improvement. The SCATE (Smart Computer-Aided Translation Environment) project investigated several of these aspects, both from a human-computer interaction point of view, as well as from a purely technological side. This paper describes the SCATE research with respect to improved fuzzy matching, parallel treebanks, the integration of translation memories with machine translation, quality estimation, terminology extraction from comparable texts, the use of speech recognition in the translation process, and human computer interaction and interface design for the professional translation environment. For each of these topics, we describe the experiments we performed and the conclusions drawn, providing an overview of the highlights of the entire SCATE project

    SPOD:Syntactic Profiler of Dutch

    Get PDF
    SPOD is a tool for Dutch syntax in which a given corpus is analysed according to a large number of predefined syntactic characteristics. SPOD is an extension of the PaQu (”Parse and Query”) tool (Odijk et al. 2017). SPOD is available for a number of standard Dutch corpora and treebanks.In addition, you can upload your own texts which will then be syntactically analysed. SPOD will run a potentially large number of syntactic queries in order to show a variety of corpus properties, such as the number of main and subordinate clauses, types of main and subordinate clauses, and their frequencies, average length of clauses (per clause type: e.g. relative clauses, indirect questions, finite complement clauses, infinitival clauses, finite adverbial clauses, etc.). Other syntactic constructions include comparatives, correlatives, various types of verb clusters, separable verb prefixes, depth of embedding etc.SPOD allows linguists to obtain a quick overview of the syntactic properties of texts, for instance with the goal to find interesting differences between text types, or between authors with different backgrounds or different age. In the paper, we describe the SPOD tool in some more detail, and we provide a case study, illustrating the type of investigations which are enabled andfacilitated by SPOD. Most of the syntactic properties are implemented in SPOD by means of relatively complicated XPath 2.0 queries, and as such SPOD also provides examples of relevant syntactic queries, which may otherwise be relatively hard to define for non-technical linguists

    Enriching a Scientific Grammar with Links to Linguistic Resources: The Taalportaal

    Get PDF
    Scientic research within the humanities is dierent from what it was a few decades ago. For instance, new sources of information, such as digital grammars, lexical databases and large corpora of real-language data oer new opportunities for linguistics. The Taalportaal grammatical database, with its links to other linguistic resources via the CLARIN infrastructure, is a prime example of a new type of tool for linguistic research.

    Corpus linguistics as digital scholarship : Big data, rich data and uncharted data

    Get PDF
    This introductory chapter begins by considering how the fields of corpus linguistics, digital linguistics and digital humanities overlap, intertwine and feed off each other when it comes to making use of the increasing variety of resources available for linguistic research today. We then move on to discuss the benefits and challenges of three partly overlapping approaches to the use of digital data sources: (1) increasing data size to create “big data”, (2) supplying multi-faceted co(n)textual information and analyses to produce “rich data”, and (3) adapting existing data sets to new uses by drawing on hitherto “uncharted data”. All of them also call for new digital tools and methodologies that, in Tim Hitchcock’s words, “allow us to think small; at the same time as we are generating tools to imagine big.” We conclude the chapter by briefly describing how the contributions in this volume make use of their various data sources to answer new research questions about language use and to revisit old questions in new ways.Peer reviewe

    Extending automatic transcripts in a unified data representation towards a prosodic-based metadata annotation and evaluation

    Get PDF
    This paper describes a framework that extends automatic speech transcripts in order to accommodate relevant information coming from manual transcripts, the speech signal itself, and other resources, like lexica. The proposed framework automatically collects, relates, computes, and stores all relevant information together in a self-contained data source, making it possible to easily provide a wide range of interconnected information suitable for speech analysis, training, and evaluating a number of automatic speech processing tasks. The main goal of this framework is to integrate different linguistic and paralinguistic layers of knowledge for a more complete view of their representation and interactions in several domains and languages. The processing chain is composed of two main stages, where the first consists of integrating the relevant manual annotations in the speech recognition data, and the second consists of further enriching the previous output in order to accommodate prosodic information. The described framework has been used for the identification and analysis of structural metadata in automatic speech transcripts. Initially put to use for automatic detection of punctuation marks and for capitalization recovery from speech data, it has also been recently used for studying the characterization of disfluencies in speech. It was already applied to several domains of Portuguese corpora, and also to English and Spanish Broadcast News corpora
    • …
    corecore