691 research outputs found

    Arabic Rule-Based Named Entity Recognition Systems Progress and Challenges

    Get PDF
    Rule-based approaches are using human-made rules to extract Named Entities (NEs), it is one of the most famous ways to extract NE as well as Machine Learning.  The term Named Entity Recognition (NER) is defined as a task determined to indicate personal names, locations, organizations and many other entities. In Arabic language, Big Data challenges make Arabic NER develops rapidly and extracts useful information from texts. The current paper sheds some light on research progress in rule-based via a diagnostic comparison among linguistic resource, entity type, domain, and performance. We also highlight the challenges of the processing Arabic NEs through rule-based systems. It is expected that good performance of NER will be effective to other modern fields like semantic web searching, question answering, machine translation, information retrieval, and abstracting systems

    D6.1: Technologies and Tools for Lexical Acquisition

    Get PDF
    This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)

    Workshop Proceedings of the 12th edition of the KONVENS conference

    Get PDF
    The 2014 issue of KONVENS is even more a forum for exchange: its main topic is the interaction between Computational Linguistics and Information Science, and the synergies such interaction, cooperation and integrated views can produce. This topic at the crossroads of different research traditions which deal with natural language as a container of knowledge, and with methods to extract and manage knowledge that is linguistically represented is close to the heart of many researchers at the Institut für Informationswissenschaft und Sprachtechnologie of Universität Hildesheim: it has long been one of the institute’s research topics, and it has received even more attention over the last few years

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    The communicative theory of Terminology (CTT) applied to the development of a corpus-based specialised dictionary of the ceramics industry

    Get PDF
    Esta tesis es el resultado de un proyecto destinado a la creación de un diccionario activo, bilingüe (español-inglés; inglés-español) y especializado de la industria cerámica y azulejera con la Teoría Comunicativa de la Terminología como su pilar teórico principal. Debido al posicionamiento teórico adoptado, la investigación aquí presentada ha partido de un estudio de corpus (compilado ad hoc) en el que los términos han sido analizados in vivo y caracterizados de acuerdo al ¿habitat¿ en el que se hallan en el texto especializado. Así pues, la aproximación hecha al estudio de la terminología industrial cerámica hace pertinente el uso de la etiqueta ¿lexicografía especializada¿ a la hora de referirnos a un trabajo como éste en el que se ha tratado de ir más allá de la práctica terminográfica para dar lugar a un estudio en el que se prima el contexto, las asociaciones naturales de los términos (colocaciones) y la naturaleza comunicativa de la terminología. De este modo, en esta tesis se ha presentado de manera progresiva, además de un marco teórico detallado y coherente con el fin último de la investigación, la metodología utilizada para la elaboración del diccionario en curso, ampliamente basada en el uso de programas informáticos tanto para la explotación del corpus (WordSmith Tools 4.0), como para la creación de la base de datos terminológica (TermStar XV) y la generación de entradas finales (GENDIC).Así pues, esta tesis presenta de manera progresiva los resultados obtenidos en cada etapa del método de trabajo y 4,000 entradas finales (en este caso del inglés al español) correspondientes a las letras A, B, N, O, U y V del diccionario.This PhD dissertation is the result of an ongoing process aimed at the creation of a bilingual corpus-based specialised active dictionary of the ceramic industry, with the Communicative Theory of Terminology (CTT) as its mainstay. According to the grounding principles of the CTT, this research has departed form a corpus-based approach in which terms have been analysed in vivo and characterised from the natural habitat in which they are given in specialised communication/discourse. In this light, it has been put forward how the study of terms – made possible thanks to the activity of compiling and describing them, called terminography – may be complemented by the wider projection of specialised lexicography for the compilation and elaboration of LSP, user-oriented and user-friendly quality products in the form of dictionaries. This specialised lexicographical dimension of the work has necessarily implied the need to renew the concept of speciality language dictionaries applied to the ceramic industry and has given way to the creation of a (prospective) active dictionary in this field with a marked emphasis on context. Accordingly, the importance of pragmatic aspects in a work of this sort, has made it necessary to undertake an in-depth revision and analysis of the socio-economic context for the research in order be able to establish and solve the specific terminological needs that the ceramic industrial discourse community may find. On the basis of this theoretical framework, the method of study followed for the development of the prospective dictionary has comprised 8 broad stages: the stage of work preparation and corpus compilation, the elaboration of the field diagram, the stage of documentary corpus management, term extraction, data processing, revision and normalisation and finally, the edition stage. Two main types of results have been presented: those obtained through work in progress in the different stages of the method and final ones strictly speaking, that is, 4,000 English-Spanish entries in their final format (as they will appear in the prospective dictionary) belonging to the letters A, B, N, O, U and V of a complete dictionary which will include a total of 26,000 entries

    The communicative theory of Terminology (CTT) applied to the development of a corpus-based specialised dictionary of the ceramics industry

    Get PDF
    Esta tesis es el resultado de un proyecto destinado a la creación de un diccionario activo, bilingüe (español-inglés; inglés-español) y especializado de la industria cerámica y azulejera con la Teoría Comunicativa de la Terminología como su pilar teórico principal. Debido al posicionamiento teórico adoptado, la investigación aquí presentada ha partido de un estudio de corpus (compilado ad hoc) en el que los términos han sido analizados in vivo y caracterizados de acuerdo al ¿habitat¿ en el que se hallan en el texto especializado. Así pues, la aproximación hecha al estudio de la terminología industrial cerámica hace pertinente el uso de la etiqueta ¿lexicografía especializada¿ a la hora de referirnos a un trabajo como éste en el que se ha tratado de ir más allá de la práctica terminográfica para dar lugar a un estudio en el que se prima el contexto, las asociaciones naturales de los términos (colocaciones) y la naturaleza comunicativa de la terminología. De este modo, en esta tesis se ha presentado de manera progresiva, además de un marco teórico detallado y coherente con el fin último de la investigación, la metodología utilizada para la elaboración del diccionario en curso, ampliamente basada en el uso de programas informáticos tanto para la explotación del corpus (WordSmith Tools 4.0), como para la creación de la base de datos terminológica (TermStar XV) y la generación de entradas finales (GENDIC).Así pues, esta tesis presenta de manera progresiva los resultados obtenidos en cada etapa del método de trabajo y 4,000 entradas finales (en este caso del inglés al español) correspondientes a las letras A, B, N, O, U y V del diccionario.This PhD dissertation is the result of an ongoing process aimed at the creation of a bilingual corpus-based specialised active dictionary of the ceramic industry, with the Communicative Theory of Terminology (CTT) as its mainstay. According to the grounding principles of the CTT, this research has departed form a corpus-based approach in which terms have been analysed in vivo and characterised from the natural habitat in which they are given in specialised communication/discourse. In this light, it has been put forward how the study of terms – made possible thanks to the activity of compiling and describing them, called terminography – may be complemented by the wider projection of specialised lexicography for the compilation and elaboration of LSP, user-oriented and user-friendly quality products in the form of dictionaries. This specialised lexicographical dimension of the work has necessarily implied the need to renew the concept of speciality language dictionaries applied to the ceramic industry and has given way to the creation of a (prospective) active dictionary in this field with a marked emphasis on context. Accordingly, the importance of pragmatic aspects in a work of this sort, has made it necessary to undertake an in-depth revision and analysis of the socio-economic context for the research in order be able to establish and solve the specific terminological needs that the ceramic industrial discourse community may find. On the basis of this theoretical framework, the method of study followed for the development of the prospective dictionary has comprised 8 broad stages: the stage of work preparation and corpus compilation, the elaboration of the field diagram, the stage of documentary corpus management, term extraction, data processing, revision and normalisation and finally, the edition stage. Two main types of results have been presented: those obtained through work in progress in the different stages of the method and final ones strictly speaking, that is, 4,000 English-Spanish entries in their final format (as they will appear in the prospective dictionary) belonging to the letters A, B, N, O, U and V of a complete dictionary which will include a total of 26,000 entries

    Knowledge Expansion of a Statistical Machine Translation System using Morphological Resources

    Get PDF
    Translation capability of a Phrase-Based Statistical Machine Translation (PBSMT) system mostly depends on parallel data and phrases that are not present in the training data are not correctly translated. This paper describes a method that efficiently expands the existing knowledge of a PBSMT system without adding more parallel data but using external morphological resources. A set of new phrase associations is added to translation and reordering models; each of them corresponds to a morphological variation of the source/target/both phrases of an existing association. New associations are generated using a string similarity score based on morphosyntactic information. We tested our approach on En-Fr and Fr-En translations and results showed improvements of the performance in terms of automatic scores (BLEU and Meteor) and reduction of out-of-vocabulary (OOV) words. We believe that our knowledge expansion framework is generic and could be used to add different types of information to the model.JRC.G.2-Global security and crisis managemen

    Un environnement générique et ouvert pour le traitement des expressions polylexicales

    Get PDF
    The treatment of multiword expressions (MWEs), like take off, bus stop and big deal, is a challenge for NLP applications. This kind of linguistic construction is not only arbitrary but also much more frequent than one would initially guess. This thesis investigates the behaviour of MWEs across different languages, domains and construction types, proposing and evaluating an integrated methodological framework for their acquisition. There have been many theoretical proposals to define, characterise and classify MWEs. We adopt generic definition stating that MWEs are word combinations which must be treated as a unit at some level of linguistic processing. They present a variable degree of institutionalisation, arbitrariness, heterogeneity and limited syntactic and semantic variability. There has been much research on automatic MWE acquisition in the recent decades, and the state of the art covers a large number of techniques and languages. Other tasks involving MWEs, namely disambiguation, interpretation, representation and applications, have received less emphasis in the field. The first main contribution of this thesis is the proposal of an original methodological framework for automatic MWE acquisition from monolingual corpora. This framework is generic, language independent, integrated and contains a freely available implementation, the mwetoolkit. It is composed of independent modules which may themselves use multiple techniques to solve a specific sub-task in MWE acquisition. The evaluation of MWE acquisition is modelled using four independent axes. We underline that the evaluation results depend on parameters of the acquisition context, e.g., nature and size of corpora, language and type of MWE, analysis depth, and existing resources. The second main contribution of this thesis is the application-oriented evaluation of our methodology proposal in two applications: computer-assisted lexicography and statistical machine translation. For the former, we evaluate the usefulness of automatic MWE acquisition with the mwetoolkit for creating three lexicons: Greek nominal expressions, Portuguese complex predicates and Portuguese sentiment expressions. For the latter, we test several integration strategies in order to improve the treatment given to English phrasal verbs when translated by a standard statistical MT system into Portuguese. Both applications can benefit from automatic MWE acquisition, as the expressions acquired automatically from corpora can both speed up and improve the quality of the results. The promising results of previous and ongoing experiments encourage further investigation about the optimal way to integrate MWE treatment into other applications. Thus, we conclude the thesis with an overview of the past, ongoing and future work
    corecore