71 research outputs found
Language Resource Infrastructure(s)
Non esiste una sola Infrastruttura di Risorse Linguistiche, ma molte infrastrutture
e tutte tra loro diverse, anche se con aspetti comuni. Il motivo del plurale, la (s),
nel titolo della tesi è esattamente questo.
La comunità dei linguisti è molto variegata: studiosi di scienze sociali ed umane
sono linguisti, come linguisti sono quelli che direttamente si occupano di (o forniscono
consulenze in) ambiti molto più tecnici come la traduzione automatica,
l'estrazione di informazioni da testi, il question-answering fino ai motori di
ricerca presenti sul Web. Ogni sotto comunità linguistica ha le proprie esigenze da
richiedere ad una Infrastruttura di Risorse Linguistiche: disponibilità di risorse,
possibilità di scaricare liberamente software normalmente a pagamento, presenza
di commenti e valutazioni sulle risorse disponibili ed ancora altro. Possiamo affermare
che, spesso, sono i requisiti utenti a guidare il design architetturale ed
il modello delle infrastrutture, mentre le tecnologie più prettamente informatiche
sono usate per trovare soluzioni a tali requisiti. A conferma di questo aspetto,
possiamo citare due progetti europei, METANET e PANACEA: il primo è volto
alla creazione di un network di repository di tool e dati languistici accessibili da
una più ampia comunità di linguisti, mentre il secondo è una piattaforma volta
alla creazione di un network di risorse linguistiche in ambito multilingue e della
Machine Translation, pensato per essere usato da industrie in tali ambiti.
Entrambi i progetti hanno la comunità dei linguisti come promotori (provider di
servizi linguistici) ma diverse comunità di utenti esterni a cui i servizi sono rivolti
(consumer).
METANET ha come consumer ancora la comunità dei linguisti computazionali,
mentre PANACEA ha la comunità di industrie legate alla Machine Translation
come comunità consumer. La diversità degli utenti finali porta a diversi requisiti
utente e, quindi, a caratteristiche dierenti nelle infrastrutture.
In questa tesi descriviamo sia gli aspetti comuni che specifici delle Infrastrutture
di Risorse Linguistiche e mettiamo in risalto il nostro apporto alla progettazione
ad alto livello delle infrastrutture di entrambi i progetti. Nello specifico riportiamo
i nostri contributi nell'ambito della definizione dei moduli architetturali connessi
alla autenticazione ed autorizzazione, e più in generale alla gestione degli utenti,
ed al loro accesso alle risorse linguistiche.
We have added an "(s)" to the title of this thesis because there is not a single
one "Language Resource Infrastructure" but many Language Resource Infrastructures.
In fact, the language resource infrastructures are all partially alike, since
they have many common aspects, but every single language resource infrastructure
is peculiar in its own way, since it has its own distinguishing characteristics.
The community of linguists is very wide-ranging: human and social science scientists
are linguists, as linguists are those who work in more technical environments
such as Machine Translation, Information Extraction, Question-Answering, search
engines and technologies available on the Web. Each sub community wants that
the Language Resource Infrastructures will address its own requirements: resource
availability, free download of resources normally available for-fee, feedback, comments
on language resources, evaluation of language resources and so on. We can
say that user requirements drive the designing and modeling of the infrastructures
more than information technology, whose experts are asked to solve issues and
provide solution for the user requirements. To confirm this aspect, we can cite two
European projects, METANET and PANACEA: the former aims at building a
network of repositories of language resources and technologies widely available for
an increasing linguistic community, while the latter is a platform designed for the
lexical acquisition and managing multilingualism and Machine Translation issues
for small and medium enterprises focused on such topics.
Both projects have the language resource community as internal users, that is to
say, as providers of language services, but a different target with respect to the
consumers of language resources and services.
METANET is a project made by computational linguists for (computational) linguists,
while PANACEA provides services for the Machine Translation industrial
community. As a consequence, different requirements have led to different language
resource infrastructures.
In this thesis we describe both common and specific aspects of Language Resource
Infrastructures and point out our contribution to the modeling of the high level
architecture of the infrastructure in both projects. In particular, we report our
contribution in the area of Access and Identity Management, specifically in the
user management and his/her access to language resource
The LRE Map disclosed
This paper describes a serialization of the LRE Map database according to the RDF model. Due to the peculiar nature of the LRE Map, many ontologies are necessary to model the map in RDF, including newly created and reused ontologies. The importance of having the LRE Map in RDF and its connections to other open resources is also addressed
A lexicon for biology and bioinformatics: the BOOTStrep experience
This paper describes the design, implementation and population of a lexical resource for biology and bioinformatics (the BioLexicon) developed within an ongoing European project. The aim of this project is text-based knowledge harvesting for support to information extraction and text mining in the biomedical domain. The BioLexicon is a large-scale lexical-terminological resource encoding different information types in one single integrated resource. In the design of the resource we follow the ISO/DIS 24613 ?Lexical Mark-up Framework? standard, which ensures reusability of the information encoded and easy exchange of both data and architecture. The design of the resource also takes into account the needs of our text mining partners who automatically extract syntactic and semantic information from texts and feed it to the lexicon. The present contribution first describes in detail the model of the BioLexicon along its three main layers: morphology, syntax and semantics; then, it briefly describes the database implementation of the model and the population strategy followed within the project, together with an example. The BioLexicon database in fact comes equipped with automatic uploading procedures based on a common exchange XML format, which guarantees that the lexicon can be properly populated with data coming from different sources
Using LMF to Shape a Lexicon for the Biomedical Domain
This paper describes the design, implementation and population of the BioLexicon in the framework of BootStrep, an FP6 project. The BioLexicon (BL) is a lexical resource designed for text mining in the bio-domain. It has been conceived to meet both domain requirements and upcoming ISO standards for lexical representation. The data model and data categories are compliant to the ISO Lexical Markup Framework and the Data Category Registry. The BioLexicon integrates features of lexicons and terminologies: term entries (and variants) derived from existing resources are enriched with linguistic features, including sub-categorization and predicate-argument information, extracted from texts. Thus, it is an extendable resource. Furthermore, the lexical entries will be aligned to concepts in the BioOntology, the ontological resource of the project. The BL implementation is an extensible relational database with automatic population procedures. Population relies on a dedicated input data structure allowing to upload terms and their linguistic properties and ?pull-and-push? them in the database. The BioLexicon teaches that the state-of-the-art is mature enough to aim at setting up a standard in this domain. Being conformant to lexical standards, the BioLexicon is interoperable and portable to other areas
TimeML: An ontological mapping onto UIMA Type Systems
We present TeR, an UIMA Type System (Ferrucci and Lally, 2004) for event recognition, for temporal annotation in an Italian corpus. We map each TIMEML category (Pustejovsky et al., 2006) to one or more semantic types as they have been defined in the SIMPLE-CLIPS ontology (Ruimy et al., 2003). This mapping presents some advantages, such as the orthogonal inheritance that an event can acquire when derived from the ontology and a clearer definition of semantic roles when carried out by events. The mapping is implemented by means of a FINITE STATE AUTOMATON which uses semantic information collected from the SIMPLE-CLIPS ontology to analyze natural language texts
“Tea for two”: the Archive of the Italian Latinity of the Middle Ages Meets the CLARIN Infrastructure
This paper aims at showing how integrating the Archive of the Italian Latinity of the Middle Ages(ALIM) into the ILC4CLARIN repository can provide mutual benefits. Making ALIM availableto a large community of scholars and researchers, on the one side, represents the first step to reduce the lack of resources for Medieval Latin in CLARIN and, on the other side, constitutesan unprecedented contribution to not only linguistic investigations, but also to the studies ofthe culture and science at the basis of the Western European society. The paper describes theadopted approach aiming to keep intact the structure of the archive and its metadata, which areboth accurately mirrored into the ILC4CLARIN repository in order to maintain existing accesspractices of the users. This structure can be found in exactly the same state within the CLARIN VLO. Finally, the paper illustrates the advantages of experimenting with some ALIM data, once introduced within the CLARIN Language Resource Switchboard service: first results are shown from the analysis of some texts with the UDPipe tool suite and the distant reading tool Voyant
Bio-Lexicon DataBase: Achitecture, Concepts and Loading Software
No abstract availabl
UFRA: a UIMA-based Approach to Federated Language Resource Architecture
In this paper we address the issue of developing an interoperable infrastructure for language resources and technologies. In our approach, called UFRA, we extend the Federate Database Architecture System adding typical functionalities caming from UIMA. In this way, we capitalize the advantages of a federated architecture, such as autonomy, heterogeneity and distribution of components, monitored by a central authority responsible for checking both the integration of components and user rights on performing different tasks. We use the UIMA approach to manage and define one common front-end, enabling users and clients to query, retrieve and use language resources and technologies. The purpose of this paper is to show how UIMA leads from a Federated Database Architecture to a Federated Resource Architecture, adding to a registry of available components both static resources such as lexicons and corpora and dynamic ones such as tools and general purpose language technologies. At the end of the paper, we present a case-study that adopts this framework to integrate the SIMPLE lexicon and TIMEML annotation guidelines to tag natural language texts
- …