4,747 research outputs found

    Beyond English text: Multilingual and multimedia information retrieval.

    Get PDF
    Non

    In no uncertain terms : a dataset for monolingual and multilingual automatic term extraction from comparable corpora

    Get PDF
    Automatic term extraction is a productive field of research within natural language processing, but it still faces significant obstacles regarding datasets and evaluation, which require manual term annotation. This is an arduous task, made even more difficult by the lack of a clear distinction between terms and general language, which results in low inter-annotator agreement. There is a large need for well-documented, manually validated datasets, especially in the rising field of multilingual term extraction from comparable corpora, which presents a unique new set of challenges. In this paper, a new approach is presented for both monolingual and multilingual term annotation in comparable corpora. The detailed guidelines with different term labels, the domain- and language-independent methodology and the large volumes annotated in three different languages and four different domains make this a rich resource. The resulting datasets are not just suited for evaluation purposes but can also serve as a general source of information about terms and even as training data for supervised methods. Moreover, the gold standard for multilingual term extraction from comparable corpora contains information about term variants and translation equivalents, which allows an in-depth, nuanced evaluation

    Disrupting Digital Monolingualism: A report on multilingualism in digital theory and practice

    Get PDF
    This report is about the Disrupting Digital Monolingualism virtual workshop in June 2020. The DDM workshop sought to draw together a wide range of stakeholders active in confronting the current language bias in most of the digital platforms, tools, algorithms, methods, and datasets which we use in our study or practice, and to reverse the powerful impact this bias has on geocultural knowledge dynamics in the wider world. The workshop aimed to describe the state of the art across different academic disciplines and professional fields, and foster collaboration across diverse perspectives around four points of focus: Linguistic and geocultural diversity in digital knowledge infrastructures; Working with multilingual methods and data; Transcultural and translingual approaches to digital study; and Artificial intelligence, machine learning and NLP in language worlds. Event website https://languageacts.org/digital-mediations/event/disrupting-digital-monolingualism/ This report forms part of a series of reports produced by the Digital Mediations strand of the Language Acts & Worldmaking project, in this case in collaboration with the translingual strand of the Cross-Language Dynamics project (based at the Institute of Modern Languages Research), both funded by the UK Arts and Humanities Research Council’s Open World Research Initiative. Digital Mediations explores interactions and tensions between digital culture, multilingualism and language fields including the Modern Languages
    • 

    corecore