15 research outputs found
The Role of Corpus Pattern Analysis in Machine Translation Evaluation
This paper takes a preliminary look at the relation between verb pattern
matches in the Pattern Dictionary of English Verbs (PDEV) and translation quality
through a qualitative analysis of human-ranked sentences from 5 different
machine translation systems. The purpose of the analysis is not only to determine
whether verbs in the automatic translations and their immediate contexts match
any pre-existing semanto-syntactic pattern in PDEV, but also to establish links
between hypothesis sentences and the verbs in the reference translation. It
attempts to answer the question of whether or not the semantic and syntactic
information captured by Corpus Pattern Analysis (CPA) can indicate whether a
sentence is a “good” translation. Two human annotators manually identified the
occurrence of patterns in 50 translations and indicated whether these patterns
match any identified pattern in the corresponding reference translation. Results
indicate that CPA can be used to distinguish between well and ill-formed
sentences
Flexibility of multiword expressions and Corpus Pattern Analysis
<p>This chapter is set in the context of Corpus Pattern Analysis (CPA), a technique<br>
developed by Patrick Hanks to map meaning onto word patterns found in corpora.<br>
The main output of CPA is the Pattern Dictionary of English Verbs (PDEV), cur-<br>
rently describing patterns for over 1,600 verbs, many of which are acknowledged to<br>
be multiword expressions (MWEs) such as phrasal verbs or idioms. PDEV entries<br>
are manually produced by lexicographers, based on the analysis of a substantial<br>
sample of concordance lines from the corpus, so the construction of the resource<br>
is very time-consuming. The motivation for the work presented in this chapter is<br>
to speed up the discovery of these word patterns, using methods which can be<br>
transferred to other languages. This chapter explores the benefits of a detailed con-<br>
trastive analysis of MWEs found in English and French corpora with a view on<br>
English-French translation. The comparative analysis is conducted through a case<br>
study of the pair (bite, mordre), to illustrate both CPA and the application of sta-<br>
tistical measures for the automatic extraction of MWEs. The approach taken in<br>
this chapter takes its point of departure from the use of statistics developed ini-<br>
tially by Church & Hanks (1989). Here we look at statistical measures which have<br>
not yet been tested for their ability to discover new collocates, but are useful for<br>
characterizing verbal MWEs already found. In particular we propose measures to<br>
characterize the mean span, rigidity, diversity, and idiomaticity of a given MWE.</p>
<p>Â </p
The Financial Document Structure Extraction Shared Task (FinTOC2021):FinTOC 2021
This paper presents the FinTOC-2021 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 2nd Joint Workshop on Financial Narrative Processing (FNP 2021), held at the University of Lancaster. This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the third edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents but with a different and revised dataset compared to FinTOC’2 edition
The Financial Document Structure Extraction Shared task (FinToc 2020)
This paper presents the FinTOC-2020 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING'2020). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the second edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents
semeval 2015 task 15 a cpa dictionary entry building task
This paper describes the first SemEval task to explore the use of Natural Language Processing systems for building dictionary entries, in the framework of Corpus Pattern Analysis. CPA is a corpus-driven technique which provides tools and resources to identify and represent unambiguously the main semantic patterns in which words are used. Task 15 draws on the Pattern Dictionary of English Verbs (www.pdev.org.uk), for the targeted lexical entries, and on the British National Corpus for the input text. Dictionary entry building is split into three subtasks which all start from the same concordance sample: 1) CPA parsing, where arguments and their syntactic and semantic categories have to be identified, 2) CPA clustering, in which sentences with similar patterns have to be clustered and 3) CPA automatic lexicography where the structure of patterns have to be constructed automatically. Subtask 1 attracted 3 teams, though none could beat the baseline (rule-based system). Subtask 2 attracted 2 teams, one of which beat the baseline (majority-class classifier). Subtask 3 did not attract any participant. The task has produced a major semantic multidataset resource which includes data for 121 verbs and about 17,000 annotated sentences, and which is freely accessible
The GuanXi network: a new multilingual LLOD for Language Learning applications
Linguistic resources are essential for Language Learning applications. However, available resources are usually created in isolation, thus, they are scattered and need to be linked before they can be used for a specific task such as learning of a foreign language. To address these problems we present a new resource that link linguistic resources of multiple languages using the framework of Linguistic Linked Open Data (LLOD)
The Financial Document Structure Extraction Shared Task (FinTOC 2022)
This paper describes the FinTOC-2022 Shared Task on the structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 4th Financial Narrative Processing Workshop (FNP 2022), held jointly at The 13th Edition of the Language Resources and Evaluation Conference (LREC 2022), Marseille, France (El-Haj et al., 2022). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the forth edition of this shared task, three subtasks were presented to the participants: one with English documents, one with French documents and the other one with Spanish documents. This year, we proposed a different and revised dataset for English and French compared to the previous editions of FinTOC and a new dataset for Spanish documents was added. The task attracted 6 submissions for each language from 4 teams, and the most successful methods make use of textual, structural and visual features extracted from the documents and propose classification models for detecting titles and TOCs for all of the subtasks
Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015)
This volume documents the proceedings of the 2nd Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015), held on 1-2 July 2015 as part of the EUROPHRAS 2015 conference: "Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives" (Málaga, 29 June – 1 July 2015). The workshop was sponsored by European COST Action PARSing and Multi-word Expressions (PARSEME) under the auspices of the European Society of Phraseology (EUROPHRAS), the Special Interest Group on the Lexicon of the Association for Computational Linguistics (SIGLEX), and SIGLEX's Multiword Expressions Section (SIGLEX-MWE). The workshop was co-chaired by Gloria Corpas Pastor (Universidad de Málaga), Ruslan Mitkov (University of Wolverhampton), Johanna Monti (Università degli Studi di Sassari), and Violeta Seretan (Université de Genève). It received the support of the Advisory Board, composed of Dmitrij O. Dobrovol'skij (Russian Academy of Sciences, Moscow), Kathrin Steyer (Institut für Deutsche Sprache, Mannheim), Agata Savary (Université François Rabelais Tours), Michael Rosner (University of Malta), and Carlos Ramisch (Aix-Marseille Université).
The topic of the workshop was the integration of multi-word units in machine translation and translation technology tools. In spite of the recent progress achieved in machine translation and translation technology, the identification, interpretation and translation of multi-word units still represent open challenges, both from a theoretical and from a practical point of view. The idiosyncratic morpho-syntactic, semantic and translational properties of multi-word units poses many obstacles even to human translators, mainly because of intrinsic ambiguities, structural and lexical asymmetries between languages, and, finally, cultural differences. After a successful first edition held in Nice on 3 September 2013 as part of the Machine Translation Summit XIV, the present edition provided a forum for researchers working in the fields of Linguistics, Computational Linguistics, Translation Studies and Computational Phraseology to discuss recent advances in the area of multi-word unit processing and to coordinate research efforts across disciplines.
The workshop was attended by 53 representatives of academic and industrial organisations. The programme included 11 oral and 4 poster presentations, and featured an invited talk by Kathrin Steyer, President of EUROPHRAS. We received 23 submissions, hence the MUMTTT 2015 acceptance rate was 65.2%. The papers accepted are indicative of the current efforts of researchers and developers who are actively engaged in improving the state of the art of multi-word unit translation