71 research outputs found

    Language Resource Infrastructure(s)

    Get PDF
    Non esiste una sola Infrastruttura di Risorse Linguistiche, ma molte infrastrutture e tutte tra loro diverse, anche se con aspetti comuni. Il motivo del plurale, la (s), nel titolo della tesi è esattamente questo. La comunità dei linguisti è molto variegata: studiosi di scienze sociali ed umane sono linguisti, come linguisti sono quelli che direttamente si occupano di (o forniscono consulenze in) ambiti molto più tecnici come la traduzione automatica, l'estrazione di informazioni da testi, il question-answering fino ai motori di ricerca presenti sul Web. Ogni sotto comunità linguistica ha le proprie esigenze da richiedere ad una Infrastruttura di Risorse Linguistiche: disponibilità di risorse, possibilità di scaricare liberamente software normalmente a pagamento, presenza di commenti e valutazioni sulle risorse disponibili ed ancora altro. Possiamo affermare che, spesso, sono i requisiti utenti a guidare il design architetturale ed il modello delle infrastrutture, mentre le tecnologie più prettamente informatiche sono usate per trovare soluzioni a tali requisiti. A conferma di questo aspetto, possiamo citare due progetti europei, METANET e PANACEA: il primo è volto alla creazione di un network di repository di tool e dati languistici accessibili da una più ampia comunità di linguisti, mentre il secondo è una piattaforma volta alla creazione di un network di risorse linguistiche in ambito multilingue e della Machine Translation, pensato per essere usato da industrie in tali ambiti. Entrambi i progetti hanno la comunità dei linguisti come promotori (provider di servizi linguistici) ma diverse comunità di utenti esterni a cui i servizi sono rivolti (consumer). METANET ha come consumer ancora la comunità dei linguisti computazionali, mentre PANACEA ha la comunità di industrie legate alla Machine Translation come comunità consumer. La diversità degli utenti finali porta a diversi requisiti utente e, quindi, a caratteristiche dierenti nelle infrastrutture. In questa tesi descriviamo sia gli aspetti comuni che specifici delle Infrastrutture di Risorse Linguistiche e mettiamo in risalto il nostro apporto alla progettazione ad alto livello delle infrastrutture di entrambi i progetti. Nello specifico riportiamo i nostri contributi nell'ambito della definizione dei moduli architetturali connessi alla autenticazione ed autorizzazione, e più in generale alla gestione degli utenti, ed al loro accesso alle risorse linguistiche. We have added an "(s)" to the title of this thesis because there is not a single one "Language Resource Infrastructure" but many Language Resource Infrastructures. In fact, the language resource infrastructures are all partially alike, since they have many common aspects, but every single language resource infrastructure is peculiar in its own way, since it has its own distinguishing characteristics. The community of linguists is very wide-ranging: human and social science scientists are linguists, as linguists are those who work in more technical environments such as Machine Translation, Information Extraction, Question-Answering, search engines and technologies available on the Web. Each sub community wants that the Language Resource Infrastructures will address its own requirements: resource availability, free download of resources normally available for-fee, feedback, comments on language resources, evaluation of language resources and so on. We can say that user requirements drive the designing and modeling of the infrastructures more than information technology, whose experts are asked to solve issues and provide solution for the user requirements. To confirm this aspect, we can cite two European projects, METANET and PANACEA: the former aims at building a network of repositories of language resources and technologies widely available for an increasing linguistic community, while the latter is a platform designed for the lexical acquisition and managing multilingualism and Machine Translation issues for small and medium enterprises focused on such topics. Both projects have the language resource community as internal users, that is to say, as providers of language services, but a different target with respect to the consumers of language resources and services. METANET is a project made by computational linguists for (computational) linguists, while PANACEA provides services for the Machine Translation industrial community. As a consequence, different requirements have led to different language resource infrastructures. In this thesis we describe both common and specific aspects of Language Resource Infrastructures and point out our contribution to the modeling of the high level architecture of the infrastructure in both projects. In particular, we report our contribution in the area of Access and Identity Management, specifically in the user management and his/her access to language resource

    The LRE Map disclosed

    Get PDF
    This paper describes a serialization of the LRE Map database according to the RDF model. Due to the peculiar nature of the LRE Map, many ontologies are necessary to model the map in RDF, including newly created and reused ontologies. The importance of having the LRE Map in RDF and its connections to other open resources is also addressed

    The LREC 2010 Map of Language Resources and Tools

    Get PDF

    A lexicon for biology and bioinformatics: the BOOTStrep experience

    Get PDF
    This paper describes the design, implementation and population of a lexical resource for biology and bioinformatics (the BioLexicon) developed within an ongoing European project. The aim of this project is text-based knowledge harvesting for support to information extraction and text mining in the biomedical domain. The BioLexicon is a large-scale lexical-terminological resource encoding different information types in one single integrated resource. In the design of the resource we follow the ISO/DIS 24613 ?Lexical Mark-up Framework? standard, which ensures reusability of the information encoded and easy exchange of both data and architecture. The design of the resource also takes into account the needs of our text mining partners who automatically extract syntactic and semantic information from texts and feed it to the lexicon. The present contribution first describes in detail the model of the BioLexicon along its three main layers: morphology, syntax and semantics; then, it briefly describes the database implementation of the model and the population strategy followed within the project, together with an example. The BioLexicon database in fact comes equipped with automatic uploading procedures based on a common exchange XML format, which guarantees that the lexicon can be properly populated with data coming from different sources

    Using LMF to Shape a Lexicon for the Biomedical Domain

    Get PDF
    This paper describes the design, implementation and population of the BioLexicon in the framework of BootStrep, an FP6 project. The BioLexicon (BL) is a lexical resource designed for text mining in the bio-domain. It has been conceived to meet both domain requirements and upcoming ISO standards for lexical representation. The data model and data categories are compliant to the ISO Lexical Markup Framework and the Data Category Registry. The BioLexicon integrates features of lexicons and terminologies: term entries (and variants) derived from existing resources are enriched with linguistic features, including sub-categorization and predicate-argument information, extracted from texts. Thus, it is an extendable resource. Furthermore, the lexical entries will be aligned to concepts in the BioOntology, the ontological resource of the project. The BL implementation is an extensible relational database with automatic population procedures. Population relies on a dedicated input data structure allowing to upload terms and their linguistic properties and ?pull-and-push? them in the database. The BioLexicon teaches that the state-of-the-art is mature enough to aim at setting up a standard in this domain. Being conformant to lexical standards, the BioLexicon is interoperable and portable to other areas

    TimeML: An ontological mapping onto UIMA Type Systems

    Get PDF
    We present TeR, an UIMA Type System (Ferrucci and Lally, 2004) for event recognition, for temporal annotation in an Italian corpus. We map each TIMEML category (Pustejovsky et al., 2006) to one or more semantic types as they have been defined in the SIMPLE-CLIPS ontology (Ruimy et al., 2003). This mapping presents some advantages, such as the orthogonal inheritance that an event can acquire when derived from the ontology and a clearer definition of semantic roles when carried out by events. The mapping is implemented by means of a FINITE STATE AUTOMATON which uses semantic information collected from the SIMPLE-CLIPS ontology to analyze natural language texts

    “Tea for two”: the Archive of the Italian Latinity of the Middle Ages Meets the CLARIN Infrastructure

    Get PDF
    This paper aims at showing how integrating the Archive of the Italian Latinity of the Middle Ages(ALIM) into the ILC4CLARIN repository can provide mutual benefits. Making ALIM availableto a large community of scholars and researchers, on the one side, represents the first step to reduce the lack of resources for Medieval Latin in CLARIN and, on the other side, constitutesan unprecedented contribution to not only linguistic investigations, but also to the studies ofthe culture and science at the basis of the Western European society. The paper describes theadopted approach aiming to keep intact the structure of the archive and its metadata, which areboth accurately mirrored into the ILC4CLARIN repository in order to maintain existing accesspractices of the users. This structure can be found in exactly the same state within the CLARIN VLO. Finally, the paper illustrates the advantages of experimenting with some ALIM data, once introduced within the CLARIN Language Resource Switchboard service: first results are shown from the analysis of some texts with the UDPipe tool suite and the distant reading tool Voyant

    UFRA: a UIMA-based Approach to Federated Language Resource Architecture

    Get PDF
    In this paper we address the issue of developing an interoperable infrastructure for language resources and technologies. In our approach, called UFRA, we extend the Federate Database Architecture System adding typical functionalities caming from UIMA. In this way, we capitalize the advantages of a federated architecture, such as autonomy, heterogeneity and distribution of components, monitored by a central authority responsible for checking both the integration of components and user rights on performing different tasks. We use the UIMA approach to manage and define one common front-end, enabling users and clients to query, retrieve and use language resources and technologies. The purpose of this paper is to show how UIMA leads from a Federated Database Architecture to a Federated Resource Architecture, adding to a registry of available components both static resources such as lexicons and corpora and dynamic ones such as tools and general purpose language technologies. At the end of the paper, we present a case-study that adopts this framework to integrate the SIMPLE lexicon and TIMEML annotation guidelines to tag natural language texts
    corecore