22 research outputs found

    Apertium: a free/open-source platform for rule-based machine translation

    Get PDF
    Apertium is a free/open-source platform for rule-based machine translation. It is being widely used to build machine translation systems for a variety of language pairs, especially in those cases (mainly with related-language pairs) where shallow transfer suffices to produce good quality translations, although it has also proven useful in assimilation scenarios with more distant pairs involved. This article summarises the Apertium platform: the translation engine, the encoding of linguistic data, and the tools developed around the platform. The present limitations of the platform and the challenges posed for the coming years are also discussed. Finally, evaluation results for some of the most active language pairs are presented. An appendix describes Apertium as a free/open-source project.We thank the support of the Spanish Ministry of Science and Innovation through project TIN2009-14009-C02-01. Apertium has been mainly funded by the Ministries of Industry, Tourism and Commerce, of Education and Science, and of Science and Technology of Spain, the Government of Catalonia, the Ministry of Foreign Affairs of Romania, the Universitat d’Alacant, the Universidade de Vigo, Ofis ar Brezhoneg and Google Summer of Code (2009, 2010 and 2011 editions). Many companies have also invested in it: Prompsit Language Engineering, ABC Enciklopedioj, Eleka Ingeniaritza Linguistikoa, imaxin|software, etc

    Review of coreference resolution in English and Persian

    Full text link
    Coreference resolution (CR) is one of the most challenging areas of natural language processing. This task seeks to identify all textual references to the same real-world entity. Research in this field is divided into coreference resolution and anaphora resolution. Due to its application in textual comprehension and its utility in other tasks such as information extraction systems, document summarization, and machine translation, this field has attracted considerable interest. Consequently, it has a significant effect on the quality of these systems. This article reviews the existing corpora and evaluation metrics in this field. Then, an overview of the coreference algorithms, from rule-based methods to the latest deep learning techniques, is provided. Finally, coreference resolution and pronoun resolution systems in Persian are investigated.Comment: 44 pages, 11 figures, 5 table

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%

    Proceedings of the Conference on Natural Language Processing 2010

    Get PDF
    This book contains state-of-the-art contributions to the 10th conference on Natural Language Processing, KONVENS 2010 (Konferenz zur Verarbeitung natürlicher Sprache), with a focus on semantic processing. The KONVENS in general aims at offering a broad perspective on current research and developments within the interdisciplinary field of natural language processing. The central theme draws specific attention towards addressing linguistic aspects ofmeaning, covering deep as well as shallow approaches to semantic processing. The contributions address both knowledgebased and data-driven methods for modelling and acquiring semantic information, and discuss the role of semantic information in applications of language technology. The articles demonstrate the importance of semantic processing, and present novel and creative approaches to natural language processing in general. Some contributions put their focus on developing and improving NLP systems for tasks like Named Entity Recognition or Word Sense Disambiguation, or focus on semantic knowledge acquisition and exploitation with respect to collaboratively built ressources, or harvesting semantic information in virtual games. Others are set within the context of real-world applications, such as Authoring Aids, Text Summarisation and Information Retrieval. The collection highlights the importance of semantic processing for different areas and applications in Natural Language Processing, and provides the reader with an overview of current research in this field

    Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature

    Get PDF
    The volume of archaeological reports being produced since the introduction of PG161 has significantly increased, as a result of the increased volume of archaeological investigations conducted by academic and commercial archaeology. It is highly desirable to be able to search effectively within and across such reports in order to find information that promotes quality research. A potential dissemination of information via semantic technologies offers the opportunity to improve archaeological practice, not only by enabling access to information but also by changing how information is structured and the way research is conducted. This thesis presents a method for automatic semantic indexing of archaeological greyliterature reports using rule-based Information Extraction techniques in combination with domain-specific ontological and terminological resources. This semantic annotation of contextual abstractions from archaeological grey-literature is driven by Natural Language Processing (NLP) techniques which are used to identify “rich” meaningful pieces of text, thus overcoming barriers in document indexing and retrieval imposed by the use of natural language. The semantic annotation system (OPTIMA) performs the NLP tasks of Named Entity Recognition, Relation Extraction, Negation Detection and Word Sense disambiguation using hand-crafted rules and terminological resources for associating contextual abstractions with classes of the ISO Standard (ISO 21127:2006) CIDOC Conceptual Reference Model (CRM) for cultural heritage and its archaeological extension, CRM-EH, together with concepts from English Heritage thesauri and glossaries. The results demonstrate that the techniques can deliver semantic annotations of archaeological grey literature documents with respect to the domain conceptual models. Such semantic annotations have proven capable of supporting semantic query, document study and cross-searching via web based applications. The research outcomes have provided semantic annotations for the Semantic Technologies for Archaeological Resources (STAR) project, which explored the potential of semantic technologies in the integration of archaeological digital resources. The thesis represents the first discussion on the employment of CIDOC CRM and CRM-EH in semantic annotation of grey-literature documents using rule-based Information Extraction techniques driven by a supplementary exploitation of domain-specific ontological and terminological resources. It is anticipated that the methods can be generalised in the future to the broader field of Digital Humanities

    Construction de corpus généraux et spécialisés à partir du Web

    Get PDF
    At the beginning of the first chapter the interdisciplinary setting between linguistics, corpus linguistics, and computational linguistics is introduced. Then, the notion of corpus is put into focus. Existing corpus and text definitions are discussed. Several milestones of corpus design are presented, from pre-digital corpora at the end of the 1950s to web corpora in the 2000s and 2010s. The continuities and changes between the linguistic tradition and web native corpora are exposed.In the second chapter, methodological insights on automated text scrutiny in computer science, computational linguistics and natural language processing are presented. The state of the art on text quality assessment and web text filtering exemplifies current interdisciplinary research trends on web texts. Readability studies and automated text classification are used as a paragon of methods to find salient features in order to grasp text characteristics. Text visualization exemplifies corpus processing in the digital humanities framework. As a conclusion, guiding principles for research practice are listed, and reasons are given to find a balance between quantitative analysis and corpus linguistics, in an environment which is spanned by technological innovation and artificial intelligence techniques.Third, current research on web corpora is summarized. I distinguish two main approaches to web document retrieval: restricted retrieval and web crawling. The notion of web corpus preprocessing is introduced and salient steps are discussed. The impact of the preprocessing phase on research results is assessed. I explain why the importance of preprocessing should not be underestimated and why it is an important task for linguists to learn new skills in order to confront the whole data gathering and preprocessing phase.I present my work on web corpus construction in the fourth chapter. My analyses concern two main aspects, first the question of corpus sources (or prequalification), and secondly the problem of including valid, desirable documents in a corpus (or document qualification). Last, I present work on corpus visualization consisting of extracting certain corpus characteristics in order to give indications on corpus contents and quality.Le premier chapitre s'ouvre par un description du contexte interdisciplinaire. Ensuite, le concept de corpus est présenté en tenant compte de l'état de l'art. Le besoin de disposer de preuves certes de nature linguistique mais embrassant différentes disciplines est illustré par plusieurs scénarios de recherche. Plusieurs étapes clés de la construction de corpus sont retracées, des corpus précédant l'ère digitale à la fin des années 1950 aux corpus web des années 2000 et 2010. Les continuités et changements entre la tradition en linguistique et les corpus tirés du web sont exposés.Le second chapitre rassemble des considérations méthodologiques. L'état de l'art concernant l'estimation de la qualité de textes est décrit. Ensuite, les méthodes utilisées par les études de lisibilité ainsi que par la classification automatique de textes sont résumées. Des dénominateurs communs sont isolés. Enfin, la visualisation de textes démontre l'intérêt de l'analyse de corpus pour les humanités numériques. Les raisons de trouver un équilibre entre analyse quantitative et linguistique de corpus sont abordées.Le troisième chapitre résume l'apport de la thèse en ce qui concerne la recherche sur les corpus tirés d'internet. La question de la collection des données est examinée avec une attention particulière, tout spécialement le cas des URLs sources. La notion de prétraitement des corpus web est introduite, ses étapes majeures sont brossées. L'impact des prétraitements sur le résultat est évalué. La question de la simplicité et de la reproducibilité de la construction de corpus est mise en avant.La quatrième partie décrit l'apport de la thèse du point de vue de la construction de corpus proprement dite, à travers la question des sources et le problèmes des documents invalides ou indésirables. Une approche utilisant un éclaireur léger pour préparer le parcours du web est présentée. Ensuite, les travaux concernant la sélection de documents juste avant l'inclusion dans un corpus sont résumés : il est possible d'utiliser les apports des études de lisibilité ainsi que des techniques d'apprentissage artificiel au cours de la construction du corpus. Un ensemble de caractéristiques textuelles testées sur des échantillons annotés évalue l'efficacité du procédé. Enfin, les travaux sur la visualisation de corpus sont abordés : extraction de caractéristiques à l'échelle d'un corpus afin de donner des indications sur sa composition et sa qualité

    Tagungsband der 12. Tagung Phonetik und Phonologie im deutschsprachigen Raum

    Get PDF
    corecore