Search CORE

15 research outputs found

The Role of Corpus Pattern Analysis in Machine Translation Evaluation

Author: Bechara Hanna
El-Maarouf Ismail
Hanks Patrick
Mitkov Ruslan
Moze Sara
Orasan Constantin
Publication venue: Tradulex
Publication date: 16/03/2015
Field of study

This paper takes a preliminary look at the relation between verb pattern matches in the Pattern Dictionary of English Verbs (PDEV) and translation quality through a qualitative analysis of human-ranked sentences from 5 different machine translation systems. The purpose of the analysis is not only to determine whether verbs in the automatic translations and their immediate contexts match any pre-existing semanto-syntactic pattern in PDEV, but also to establish links between hypothesis sentences and the verbs in the reference translation. It attempts to answer the question of whether or not the semantic and syntactic information captured by Corpus Pattern Analysis (CPA) can indicate whether a sentence is a “good” translation. Two human annotators manually identified the occurrence of patterns in 50 translations and indicated whether these patterns match any identified pattern in the corresponding reference translation. Results indicate that CPA can be used to distinguish between well and ill-formed sentences

Wolverhampton Intellectual Repository and E-theses

Flexibility of multiword expressions and Corpus Pattern Analysis

Author: El Maarouf Ismail (5258891)
Hanks Patrick (5258888)
Oakes Michael (5258885)
Publication venue
Publication date
Field of study

This chapter is set in the context of Corpus Pattern Analysis (CPA), a technique developed by Patrick Hanks to map meaning onto word patterns found in corpora. The main output of CPA is the Pattern Dictionary of English Verbs (PDEV), cur- rently describing patterns for over 1,600 verbs, many of which are acknowledged to be multiword expressions (MWEs) such as phrasal verbs or idioms. PDEV entries are manually produced by lexicographers, based on the analysis of a substantial sample of concordance lines from the corpus, so the construction of the resource is very time-consuming. The motivation for the work presented in this chapter is to speed up the discovery of these word patterns, using methods which can be transferred to other languages. This chapter explores the benefits of a detailed con- trastive analysis of MWEs found in English and French corpora with a view on English-French translation. The comparative analysis is conducted through a case study of the pair (bite, mordre), to illustrate both CPA and the application of sta- tistical measures for the automatic extraction of MWEs. The approach taken in this chapter takes its point of departure from the use of statistics developed ini- tially by Church & Hanks (1989). Here we look at statistical measures which have not yet been tested for their ability to discover new collocates, but are useful for characterizing verbal MWEs already found. In particular we propose measures to characterize the mean span, rigidity, diversity, and idiomaticity of a given MWE. </p

The Francis Crick Institute

The Financial Document Structure Extraction Shared Task (FinTOC2021):FinTOC 2021

Author: Azzi Abderrahim Ait
Bellato Sandra
El-Haj Mahmoud
Gan Mei
Kang Juyeon
Maarouf Ismail El
Publication venue
Publication date: 26/10/2021
Field of study

This paper presents the FinTOC-2021 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 2nd Joint Workshop on Financial Narrative Processing (FNP 2021), held at the University of Lancaster. This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the third edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents but with a different and revised dataset compared to FinTOC’2 edition

Lancaster E-Prints

The Financial Document Structure Extraction Shared task (FinToc 2020)

Author: Bentabet Najah-Imane
El Maarouf Ismail
El-Haj Mahmoud
Juge Rémi
Mouilleron Virginie
Valsamou-Stanislawski Dialekti
Publication venue: COLING
Publication date: 01/12/2020
Field of study

This paper presents the FinTOC-2020 Shared Task on structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation (FNP-FNS 2020), held at The 28th International Conference on Computational Linguistics (COLING'2020). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the second edition of this shared task, two subtasks were presented to the participants: one with English documents and the other one with French documents

Lancaster E-Prints

semeval 2015 task 15 a cpa dictionary entry building task

Author: Adam Kilgarriff
Ismail El Maarouf
Jane Bradbury
Octavian Popescu
Silvie Cinkova
Vít Baisa
Publication venue
Publication date: 01/01/2015
Field of study

This paper describes the first SemEval task to explore the use of Natural Language Processing systems for building dictionary entries, in the framework of Corpus Pattern Analysis. CPA is a corpus-driven technique which provides tools and resources to identify and represent unambiguously the main semantic patterns in which words are used. Task 15 draws on the Pattern Dictionary of English Verbs (www.pdev.org.uk), for the targeted lexical entries, and on the British National Corpus for the input text. Dictionary entry building is split into three subtasks which all start from the same concordance sample: 1) CPA parsing, where arguments and their syntactic and semantic categories have to be identified, 2) CPA clustering, in which sentences with similar patterns have to be clustered and 3) CPA automatic lexicography where the structure of patterns have to be constructed automatically. Subtask 1 attracted 3 teams, though none could beat the baseline (rule-based system). Subtask 2 attracted 2 teams, one of which beat the baseline (majority-class classifier). Subtask 3 did not attract any participant. The task has produced a major semantic multidataset resource which includes data for 121 verbs and about 17,000 annotated sentences, and which is freely accessible

Crossref

Open Access Repository

The GuanXi network: a new multilingual LLOD for Language Learning applications

Author: Alferov Eugene
Cooper Doug
Fang Zhijia
Maarouf Ismail El
Mousselly-Sergieh Hatem
Wang Haofen
Publication venue: INCOMA Ltd. Shoumen, BULGARIA
Publication date: 01/09/2015
Field of study

Linguistic resources are essential for Language Learning applications. However, available resources are usually created in isolation, thus, they are scattered and need to be linked before they can be used for a specific task such as learning of a foreign language. To address these problems we present a new resource that link linguistic resources of multiple languages using the framework of Linguistic Linked Open Data (LLOD)

TUbiblio

Review of the State of the Art in Financial Narrative Processing

Author: AbuRa'ed Ahmed
Bentabet Najah-Imane
El Maarouf Ismail
El-Haj Mahmoud
Giannakopoulos George
Labidurie Estelle
Litvak Marina
Mariko Dominique
Rayson Paul
Zmandar Nadhem
Publication venue: Tirant lo Blanch Brasil
Publication date: 13/12/2021
Field of study

Lancaster E-Prints

The Financial Document Structure Extraction Shared Task (FinTOC 2022)

Author: Ait Azzi Abderrahim
Bellato Sandra
Carbajo Coronado Blanca
El Maarouf Ismail
El-Haj Mahmoud
Gan Mei
Gisbert Clemente Ana
Kang Juyeon
Moreno Sandoval Antonio
Publication venue
Publication date: 15/06/2022
Field of study

This paper describes the FinTOC-2022 Shared Task on the structure extraction from financial documents, its participants results and their findings. This shared task was organized as part of The 4th Financial Narrative Processing Workshop (FNP 2022), held jointly at The 13th Edition of the Language Resources and Evaluation Conference (LREC 2022), Marseille, France (El-Haj et al., 2022). This shared task aimed to stimulate research in systems for extracting table-of-contents (TOC) from investment documents (such as financial prospectuses) by detecting the document titles and organizing them hierarchically into a TOC. For the forth edition of this shared task, three subtasks were presented to the participants: one with English documents, one with French documents and the other one with Spanish documents. This year, we proposed a different and revised dataset for English and French compared to the previous editions of FinTOC and a new dataset for Spanish documents was added. The task attracted 6 submissions for each language from 4 teams, and the most successful methods make use of textual, structural and visual features extracted from the documents and propose classification models for detecting titles and TOCs for all of the subtasks

Lancaster E-Prints

Biblos-e Archivo

Proceedings of the Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015)

Author: Amalia Todirascu
Angela Costa
Angelika Fotopoulou
Carla Parra Escartín
Corpas Pastor Gloria
Corpas Pastor Gloria
Gábor Csernyi
Héctor Martínez Alonso
Ismail El Maarouf
Jeevanthi Liyanapathira
Johanna Monti
Kathrin Steyer
Laurent Besacier
Maximiliano Durán
Michael Oake
Mirabela Navlea
Mitkov Ruslan
Mitkov Ruslan
Monti Johanna
Monti Johanna
Natalia Klyueva
Olivier Kraif
Seretan Violeta
Seretan Violeta
Violeta Seretan
Voula Giouli
Zied Elloumi
Publication venue: place:Geneva
Publication date: 01/01/2015
Field of study

This volume documents the proceedings of the 2nd Workshop on Multi-word Units in Machine Translation and Translation Technology (MUMTTT 2015), held on 1-2 July 2015 as part of the EUROPHRAS 2015 conference: "Computerised and Corpus-based Approaches to Phraseology: Monolingual and Multilingual Perspectives" (Málaga, 29 June – 1 July 2015). The workshop was sponsored by European COST Action PARSing and Multi-word Expressions (PARSEME) under the auspices of the European Society of Phraseology (EUROPHRAS), the Special Interest Group on the Lexicon of the Association for Computational Linguistics (SIGLEX), and SIGLEX's Multiword Expressions Section (SIGLEX-MWE). The workshop was co-chaired by Gloria Corpas Pastor (Universidad de Málaga), Ruslan Mitkov (University of Wolverhampton), Johanna Monti (Università degli Studi di Sassari), and Violeta Seretan (Université de Genève). It received the support of the Advisory Board, composed of Dmitrij O. Dobrovol'skij (Russian Academy of Sciences, Moscow), Kathrin Steyer (Institut für Deutsche Sprache, Mannheim), Agata Savary (Université François Rabelais Tours), Michael Rosner (University of Malta), and Carlos Ramisch (Aix-Marseille Université). The topic of the workshop was the integration of multi-word units in machine translation and translation technology tools. In spite of the recent progress achieved in machine translation and translation technology, the identification, interpretation and translation of multi-word units still represent open challenges, both from a theoretical and from a practical point of view. The idiosyncratic morpho-syntactic, semantic and translational properties of multi-word units poses many obstacles even to human translators, mainly because of intrinsic ambiguities, structural and lexical asymmetries between languages, and, finally, cultural differences. After a successful first edition held in Nice on 3 September 2013 as part of the Machine Translation Summit XIV, the present edition provided a forum for researchers working in the fields of Linguistics, Computational Linguistics, Translation Studies and Computational Phraseology to discuss recent advances in the area of multi-word unit processing and to coordinate research efforts across disciplines. The workshop was attended by 53 representatives of academic and industrial organisations. The programme included 11 oral and 4 poster presentations, and featured an invited talk by Kathrin Steyer, President of EUROPHRAS. We received 23 submissions, hence the MUMTTT 2015 acceptance rate was 65.2%. The papers accepted are indicative of the current efforts of researchers and developers who are actively engaged in improving the state of the art of multi-word unit translation

ARCHIVIO ISTITUZIONALE DELLA RICERCA-UNIVERSITA' DEGLI STUDI DI NAPOLI "L'ORIENTALE"