216 research outputs found

    D6.2 Integrated Final Version of the Components for Lexical Acquisition

    Get PDF
    The PANACEA project has addressed one of the most critical bottlenecks that threaten the development of technologies to support multilingualism in Europe, and to process the huge quantity of multilingual data produced annually. Any attempt at automated language processing, particularly Machine Translation (MT), depends on the availability of language-specific resources. Such Language Resources (LR) contain information about the language\u27s lexicon, i.e. the words of the language and the characteristics of their use. In Natural Language Processing (NLP), LRs contribute information about the syntactic and semantic behaviour of words - i.e. their grammar and their meaning - which inform downstream applications such as MT. To date, many LRs have been generated by hand, requiring significant manual labour from linguistic experts. However, proceeding manually, it is impossible to supply LRs for every possible pair of European languages, textual domain, and genre, which are needed by MT developers. Moreover, an LR for a given language can never be considered complete nor final because of the characteristics of natural language, which continually undergoes changes, especially spurred on by the emergence of new knowledge domains and new technologies. PANACEA has addressed this challenge by building a factory of LRs that progressively automates the stages involved in the acquisition, production, updating and maintenance of LRs required by MT systems. The existence of such a factory will significantly cut down the cost, time and human effort required to build LRs. WP6 has addressed the lexical acquisition component of the LR factory, that is, the techniques for automated extraction of key lexical information from texts, and the automatic collation of lexical information into LRs in a standardized format. The goal of WP6 has been to take existing techniques capable of acquiring syntactic and semantic information from corpus data, improving upon them, adapting and applying them to multiple languages, and turning them into powerful and flexible techniques capable of supporting massive applications. One focus for improving the scalability and portability of lexical acquisition techniques has been to extend exiting techniques with more powerful, less "supervised" methods. In NLP, the amount of supervision refers to the amount of manual annotation which must be applied to a text corpus before machine learning or other techniques are applied to the data to compile a lexicon. More manual annotation means more accurate training data, and thus a more accurate LR. However, given that it is impractical from a cost and time perspective to manually annotate the vast amounts of data required for multilingual MT across domains, it is important to develop techniques which can learn from corpora with less supervision. Less supervised methods are capable of supporting both large-scale acquisition and efficient domain adaptation, even in the domains where data is scarce. Another focus of lexical acquisition in PANACEA has been the need of LR users to tune the accuracy level of LRs. Some applications may require increased precision, or accuracy, where the application requires a high degree of confidence in the lexical information used. At other times a greater level of coverage may be required, with information about more words at the expense of some degree of accuracy. Lexical acquisition in PANACEA has investigated confidence thresholds for lexical acquisition to ensure that the ultimate users of LRs can generate lexical data from the PANACEA factory at the desired level of accuracy

    D7.1. Criteria for evaluation of resources, technology and integration.

    Get PDF
    This deliverable defines how evaluation is carried out at each integration cycle in the PANACEA project. As PANACEA aims at producing large scale resources, evaluation becomes a critical and challenging issue. Critical because it is important to assess the quality of the results that should be delivered to users. Challenging because we prospect rather new areas, and through a technical platform: some new methodologies will have to be explored or old ones to be adapted

    D4.1. Technologies and tools for corpus creation, normalization and annotation

    Get PDF
    The objectives of the Corpus Acquisition and Annotation (CAA) subsystem are the acquisition and processing of monolingual and bilingual language resources (LRs) required in the PANACEA context. Therefore, the CAA subsystem includes: i) a Corpus Acquisition Component (CAC) for extracting monolingual and bilingual data from the web, ii) a component for cleanup and normalization (CNC) of these data and iii) a text processing component (TPC) which consists of NLP tools including modules for sentence splitting, POS tagging, lemmatization, parsing and named entity recognition

    D7.4 Third evaluation report. Evaluation of PANACEA v3 and produced resources

    Get PDF
    D7.4 reports on the evaluation of the different components integrated in the PANACEA third cycle of development as well as the final validation of the platform itself. All validation and evaluation experiments follow the evaluation criteria already described in D7.1. The main goal of WP7 tasks was to test the (technical) functionalities and capabilities of the middleware that allows the integration of the various resource-creation components into an interoperable distributed environment (WP3) and to evaluate the quality of the components developed in WP5 and WP6. The content of this deliverable is thus complementary to D8.2 and D8.3 that tackle advantages and usability in industrial scenarios. It has to be noted that the PANACEA third cycle of development addressed many components that are still under research. The main goal for this evaluation cycle thus is to assess the methods experimented with and their potentials for becoming actual production tools to be exploited outside research labs. For most of the technologies, an attempt was made to re-interpret standard evaluation measures, usually in terms of accuracy, precision and recall, as measures related to a reduction of costs (time and human resources) in the current practices based on the manual production of resources. In order to do so, the different tools had to be tuned and adapted to maximize precision and for some tools the possibility to offer confidence measures that could allow a separation of the resources that still needed manual revision has been attempted. Furthermore, the extension to other languages in addition to English, also a PANACEA objective, has been evaluated. The main facts about the evaluation results are now summarized

    Language Resource Infrastructure(s)

    Get PDF
    Non esiste una sola Infrastruttura di Risorse Linguistiche, ma molte infrastrutture e tutte tra loro diverse, anche se con aspetti comuni. Il motivo del plurale, la (s), nel titolo della tesi è esattamente questo. La comunità dei linguisti è molto variegata: studiosi di scienze sociali ed umane sono linguisti, come linguisti sono quelli che direttamente si occupano di (o forniscono consulenze in) ambiti molto più tecnici come la traduzione automatica, l'estrazione di informazioni da testi, il question-answering fino ai motori di ricerca presenti sul Web. Ogni sotto comunità linguistica ha le proprie esigenze da richiedere ad una Infrastruttura di Risorse Linguistiche: disponibilità di risorse, possibilità di scaricare liberamente software normalmente a pagamento, presenza di commenti e valutazioni sulle risorse disponibili ed ancora altro. Possiamo affermare che, spesso, sono i requisiti utenti a guidare il design architetturale ed il modello delle infrastrutture, mentre le tecnologie più prettamente informatiche sono usate per trovare soluzioni a tali requisiti. A conferma di questo aspetto, possiamo citare due progetti europei, METANET e PANACEA: il primo è volto alla creazione di un network di repository di tool e dati languistici accessibili da una più ampia comunità di linguisti, mentre il secondo è una piattaforma volta alla creazione di un network di risorse linguistiche in ambito multilingue e della Machine Translation, pensato per essere usato da industrie in tali ambiti. Entrambi i progetti hanno la comunità dei linguisti come promotori (provider di servizi linguistici) ma diverse comunità di utenti esterni a cui i servizi sono rivolti (consumer). METANET ha come consumer ancora la comunità dei linguisti computazionali, mentre PANACEA ha la comunità di industrie legate alla Machine Translation come comunità consumer. La diversità degli utenti finali porta a diversi requisiti utente e, quindi, a caratteristiche dierenti nelle infrastrutture. In questa tesi descriviamo sia gli aspetti comuni che specifici delle Infrastrutture di Risorse Linguistiche e mettiamo in risalto il nostro apporto alla progettazione ad alto livello delle infrastrutture di entrambi i progetti. Nello specifico riportiamo i nostri contributi nell'ambito della definizione dei moduli architetturali connessi alla autenticazione ed autorizzazione, e più in generale alla gestione degli utenti, ed al loro accesso alle risorse linguistiche. We have added an "(s)" to the title of this thesis because there is not a single one "Language Resource Infrastructure" but many Language Resource Infrastructures. In fact, the language resource infrastructures are all partially alike, since they have many common aspects, but every single language resource infrastructure is peculiar in its own way, since it has its own distinguishing characteristics. The community of linguists is very wide-ranging: human and social science scientists are linguists, as linguists are those who work in more technical environments such as Machine Translation, Information Extraction, Question-Answering, search engines and technologies available on the Web. Each sub community wants that the Language Resource Infrastructures will address its own requirements: resource availability, free download of resources normally available for-fee, feedback, comments on language resources, evaluation of language resources and so on. We can say that user requirements drive the designing and modeling of the infrastructures more than information technology, whose experts are asked to solve issues and provide solution for the user requirements. To confirm this aspect, we can cite two European projects, METANET and PANACEA: the former aims at building a network of repositories of language resources and technologies widely available for an increasing linguistic community, while the latter is a platform designed for the lexical acquisition and managing multilingualism and Machine Translation issues for small and medium enterprises focused on such topics. Both projects have the language resource community as internal users, that is to say, as providers of language services, but a different target with respect to the consumers of language resources and services. METANET is a project made by computational linguists for (computational) linguists, while PANACEA provides services for the Machine Translation industrial community. As a consequence, different requirements have led to different language resource infrastructures. In this thesis we describe both common and specific aspects of Language Resource Infrastructures and point out our contribution to the modeling of the high level architecture of the infrastructure in both projects. In particular, we report our contribution in the area of Access and Identity Management, specifically in the user management and his/her access to language resource

    Third version (v4) of the integrated platform and documentation

    Get PDF
    The deliverable describes the third and final version of the PANACEA platform

    D6.1: Technologies and Tools for Lexical Acquisition

    Get PDF
    This report describes the technologies and tools to be used for Lexical Acquisition in PANACEA. It includes descriptions of existing technologies and tools which can be built on and improved within PANACEA, as well as of new technologies and tools to be developed and integrated in PANACEA platform. The report also specifies the Lexical Resources to be produced. Four main areas of lexical acquisition are included: Subcategorization frames (SCFs), Selectional Preferences (SPs), Lexical-semantic Classes (LCs), for both nouns and verbs, and Multi-Word Expressions (MWEs)

    D3.1. Architecture and design of the platform

    Get PDF
    This document aims to establish the requirements and the technological basis and design of the PANACEA platform. These are the main goals of the document: - Survey the different technological approaches that can be used in PANACEA. - Specify some guidelines for the metadata. - Establish the requirements for the platform. - Make a Common Interface proposal for the tools. - Propose a format for the data to be exchanged by the tools (Travelling Object). - Choose the technologies that will be used to develop the platform. - Propose a workplan

    Interoperability Framework: The FLaReNet action plan proposal

    Get PDF
    Standards are fundamental to ex-change, preserve, maintain and integrate data and language resources, and as an essential basis of any language resource infrastructure. This paper promotes an Interoperability Framework as a dynamic environment of standards and guidelines, also intended to support the provision of language-(web)service interoperability. In the past two decades, the need to define common practices and formats for linguistic resources has been increasingly recognized and sought. Today open, collaborative, shared data is at the core of a sound language strategy, and standardisation is actively on the move. This paper first describes the current landscape of standards, and presents the major barriers to their adoption; then, it describes those scenarios that critically involve the use of standards and provide a strong motivation for their adoption; lastly, a series of actions and steps needed to operationalise standards and achieve a full interoperability for Language Resources and Technologies are proposed

    A MWE Acquisition and Lexicon Builder Web Service

    Get PDF
    This paper describes the development of a web-service tool for the automatic extraction of Multi-word expressions lexicons, which has been integrated in a distributed platform for the automatic creation of linguistic resources. The main purpose of the work described is thus to provide a (computationally "light") tool that produces a full lexical resource: multi-word terms/items with relevant and useful attached information that can be used for more complex processing tasks and applications (e.g. parsing, MT, IE, query expansion, etc.). The output of our tool is a MW lexicon formatted and encoded in XML according to the Lexical Mark-up Framework. The tool is already functional and available as a service. Evaluation experiments show that the tool precision is of about 80%
    • …
    corecore