Search CORE

19 research outputs found

Qualifying Ontology-Based Visual Query Formulation

Author: A Soylu
A Sutcliffe
Akrivi Katifori
Andreas Harth
Carlos Bobed
Danica Damljanović
Esther Kaufmann
Gary Marchionini
H Lieberman
IO Popov
Josep Maria Brunetti
K.L. Siau
M. R. Kogalovsky
m.c. schraefel
M.Y.-M. Yen
Shneiderman
Thanh Tran
TIZIANA CATARCI
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 03/12/2015
Field of study

Abstract. This paper elaborates on ontology-based end-user visual query formulation, particularly for users who otherwise cannot/do not desire to use formal textual query languages to retrieve data due to the lack of technical knowledge and skills. Then, it provides a set of quality attributes and features, primarily elicited via a series of industrial end-user workshops and user studies carried out in the course of an industrial EU project, to guide the design and development of successor visual query systems

CiteSeerX

Crossref

Experiencing OptiqueVQS: A Multi-paradigm and Ontology-based Visual Query System for End Users

Author: Giese Martin
Horrocks Ian
Jimenez-Ruiz Ernesto
Soylu Ahmet
Vega Gorgojo Guillermo
Publication venue: Springer
Publication date: 01/01/2015
Field of study

This is author's post-print version, published version available on http://link.springer.com/article/10.1007%2Fs10209-015-0404-5Data access in an enterprise setting is a determining factor for value creation processes, such as sense-making, decision-making, and intelligence analysis. Particularly, in an enterprise setting, intuitive data access tools that directly engage domain experts with data could substantially increase competitiveness and profitability. In this respect, the use of ontologies as a natural communication medium between end users and computers has emerged as a prominent approach. To this end, this article introduces a novel ontology-based visual query system, named OptiqueVQS, for end users. OptiqueVQS is built on a powerful and scalable data access platform and has a user-centric design supported by a widget-based flexible and extensible architecture allowing multiple coordinated representation and interaction paradigms to be employed. The results of a usability experiment performed with non-expert users suggest that OptiqueVQS provides a decent level of expressivity and high usability and hence is quite promising

NORA - Norwegian Open Research Archives

Recommended from our members

OptiqueVQS: A visual query system over ontologies for industry

Author: Arenas
Bevan
Bobed
Brunetti
Calvanese
Catarci
Catarci
Civili
Cuenca Grau
Damljanovic
Dey
Epstein
Giese
Glimm
Hogan
Kaufmann
Khalili
Kharlamov
Kharlamov
Kogalovsky
López
Marchionini
Nikolaou
Poggi
Qiu
Romero
Schraefel
Shneiderman
Skjæveland
Soylu
Soylu
Soylu
Soylu
Soylu
Soylu
Soylu
Spanos
Sutcliffe
ter Hofstede
Tran
Uren
Vega-Gorgojo
Vega-Gorgojo
Publication venue: 'IOS Press'
Publication date: 01/01/2017
Field of study

An important application of semantic technologies in industry has been the formalisation of information models using OWL 2 ontologies and the use of RDF for storing and exchanging application data. Moreover, legacy data can be virtualised as RDF using ontologies following the ontology-based data access (OBDA) approach. In all these applications, it is important to provide domain experts with query formulation tools for expressing their information needs in terms of queries over ontologies. In this work, we present such a tool, OptiqueVQS, which is designed based on our experience with OBDA applications in Statoil and Siemens and on best HCI practices for interdisciplinary engineering environments. OptiqueVQS implements a number of unique techniques distinguishing it from analogous query formulation systems. In particular, it exploits ontology projection techniques to enable graph-based navigation over an ontology during query construction. Secondly, while OptiqueVQS is primarily ontology driven, it exploits sampled data to enhance selection of data values for some data attributes. Finally, OptiqueVQS is built on well-grounded requirements, design rationale, and quality attributes. We evaluated OptiqueVQS with both domain experts and casual users and qualitatively compared our system against prominent visual systems for ontology-driven query formulation and exploration of semantic data. OptiqueVQS is available online and can be downloaded together with an example OBDA scenario

City Research Online

Crossref

NORA - Norwegian Open Research Archives

Recommended from our members

Ontology-based end-user visual query formulation: Why, what, who, how, and which?

Author: A Cali
A D’Ulizia
A Gomez-Perez
A Harth
A Jimeno-Yepes
A Katifori
A McAfee
A Segev
A Soylu
A Soylu
A Soylu
A Soylu
A Soylu
AHM Hofstede Ter
Ahmet Soylu
AK Dey
AS Dadzie
B Glimm
B Henderson-Sellers
B Shneiderman
B Shneiderman
BC Grau
BR Gaines
C Beshers
C Bettini
C Bizer
C Bobed
C Civili
C Martinez-Cruz
D Braga
D Damljanovic
D Howe
DE Spanos
Dmitriy Zheleznyakov
E Kapetanios
E Kaufmann
EF Codd
EF Codd
EF Codd
Ernesto Jimenez-Ruiz
Evgeny Kharlamov
F Benzi
F Fonseca
F Ham van
G Allen
G Lindgaard
G Marchionini
G Marchionini
G Tummarello
GL Lohse
H Kondylakis
H Storrle
HJ Levesque
Ian Horrocks
J Claussen
J Coutaz
J Gersh
J Kawash
J Mackinlay
J Minker
J Nielsen
J Nielsen
JA Gallud
JA Konstan
JF Sequeda
JM Brunetti
K Munir
K Siau
K Zheng
KL Siau
KY Whang
L Certo
L Cinque
LJ Campbell
M Angelaccioa
M Erwig
M Giese
M Kifer
M Latapy
M Salehie
M Turk
MA Hearst
Martin Giese
MC Schraefel
ML Wilson
MM Burnett
MM Zloof
MR Kogalovsky
MYM Yen
N Bevan
NH Balkir
O Kolomiyets
P Besnard
P Ingwersen
PD Bruza
PK Chen
PK Robertson
R Baeza-Yates
R Cassino
R Stevens
R Studer
RG Epstein
RM Friedhoff
RN Cuff
RW White
S Krivov
S Lederman
S Madden
S Philippi
S Spiekermann
T Berners-Lee
T Catarci
T Catarci
T Eiter
T Halpin
T Tran
TR Gruber
V Lopez
V Uren
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

Value creation in an organisation is a time-sensitive and data-intensive process, yet it is often delayed and bounded by the reliance on IT experts extracting data for domain experts. Hence, there is a need for providing people who are not professional developers with the flexibility to pose relatively complex and ad hoc queries in an easy and intuitive way. In this respect, visual methods for query formulation undertake the challenge of making querying independent of users’ technical skills and the knowledge of the underlying textual query language and the structure of data. An ontology is more promising than the logical schema of the underlying data for guiding users in formulating queries, since it provides a richer vocabulary closer to the users’ understanding. However, on the one hand, today the most of world’s enterprise data reside in relational databases rather than triple stores, and on the other, visual query formulation has become more compelling due to ever-increasing data size and complexity—known as Big Data. This article presents and argues for ontology-based visual query formulation for end-users; discusses its feasibility in terms of ontology-based data access, which virtualises legacy relational databases as RDF, and the dimensions of Big Data; presents key conceptual aspects and dimensions, challenges, and requirements; and reviews, categorises, and discusses notable approaches and systems

City Research Online

Crossref

NORA - Norwegian Open Research Archives

Generic semantics-based task-oriented dialogue system framework for human-machine interaction in industrial scenarios

Author: Aceta Moreno Cristina
Publication venue
Publication date: 22/07/2022
Field of study

285 p.En Industria 5.0, los trabajadores y su bienestar son cruciales en el proceso de producción. En estecontexto, los sistemas de diálogo orientados a tareas permiten que los operarios deleguen las tareas mássencillas a los sistemas industriales mientras trabajan en otras más complejas. Además, la posibilidad deinteractuar de forma natural con estos sistemas reduce la carga cognitiva para usarlos y genera aceptaciónpor parte de los usuarios. Sin embargo, la mayoría de las soluciones existentes no permiten unacomunicación natural, y las técnicas actuales para obtener dichos sistemas necesitan grandes cantidadesde datos para ser entrenados, que son escasos en este tipo de escenarios. Esto provoca que los sistemas dediálogo orientados a tareas en el ámbito industrial sean muy específicos, lo que limita su capacidad de sermodificados o reutilizados en otros escenarios, tareas que están ligadas a un gran esfuerzo en términos detiempo y costes. Dados estos retos, en esta tesis se combinan Tecnologías de la Web Semántica contécnicas de Procesamiento del Lenguaje Natural para desarrollar KIDE4I, un sistema de diálogo orientadoa tareas semántico para entornos industriales que permite una comunicación natural entre humanos ysistemas industriales. Los módulos de KIDE4I están diseñados para ser genéricos para una sencillaadaptación a nuevos casos de uso. La ontología modular TODO es el núcleo de KIDE4I, y se encarga demodelar el dominio y el proceso de diálogo, además de almacenar las trazas generadas. KIDE4I se haimplementado y adaptado para su uso en cuatro casos de uso industriales, demostrando que el proceso deadaptación para ello no es complejo y se beneficia del uso de recursos

Archivo Digital para la Docencia y la Investigación

Modelo de acesso a fontes em linguagem natural no governo electrónico

Author: Rodrigues Mário Jorge Ferreira
Publication venue: Universidade de Aveiro
Publication date: 30/11/2015
Field of study

Doutoramento em Engenharia InformáticaFor the actual existence of e-government it is necessary and crucial to provide public information and documentation, making its access simple to citizens. A portion, not necessarily small, of these documents is in an unstructured form and in natural language, and consequently outside of which the current search systems are generally able to cope and effectively handle. Thus, in thesis, it is possible to improve access to these contents using systems that process natural language and create structured information, particularly if supported in semantics. In order to put this thesis to test, this work was developed in three major phases: (1) design of a conceptual model integrating the creation of structured information and making it available to various actors, in line with the vision of e-government 2.0; (2) definition and development of a prototype instantiating the key modules of this conceptual model, including ontology based information extraction supported by examples of relevant information, knowledge management and access based on natural language; (3) assessment of the usability and acceptability of querying information as made possible by the prototype - and in consequence of the conceptual model - by users in a realistic scenario, that included comparison with existing forms of access. In addition to this evaluation, at another level more related to technology assessment and not to the model, evaluations were made on the performance of the subsystem responsible for information extraction. The evaluation results show that the proposed model was perceived as more effective and useful than the alternatives. Associated with the performance of the prototype to extract information from documents, comparable to the state of the art, results demonstrate the feasibility and advantages, with current technology, of using natural language processing and integration of semantic information to improve access to unstructured contents in natural language. The conceptual model and the prototype demonstrator intend to contribute to the future existence of more sophisticated search systems that are also more suitable for e-government. To have transparency in governance, active citizenship, greater agility in the interaction with the public administration, among others, it is necessary that citizens and businesses have quick and easy access to official information, even if it was originally created in natural language.Para a efectiva existência de governo electrónico é necessário e crucial a disponibilização de informação e documentação pública e tornar simples o acesso a esta pelos cidadãos. Uma parte, não necessariamente pequena, destes documentos encontra-se sob uma forma não estruturada e em linguagem natural e, consequentemente, fora do que os sistemas de pesquisa actuais conseguem em geral suportar e disponibilizar eficazmente. Assim, em tese, é possível melhorar o acesso a estes conteúdos com recurso a sistemas que processem linguagem natural e que sejam capazes de criar informação estruturada, em especial se suportados numa semântica. Com o objectivo de colocar esta tese à prova, o desenvolvimento deste trabalho integrou três grandes fases ou vertentes: (1) Criação de um modelo conceptual integrando a criação de informação estruturada e a sua disponibilização para vários actores, alinhado com a visão do governo electrónico 2.0; (2) Definição e desenvolvimento de um protótipo instanciando os módulos essenciais deste modelo conceptual, nomeadamente a extracção de informação suportada em ontologias e exemplos de informação relevante, gestão de conhecimento e acesso baseado em linguagem natural; (3) Uma avaliação de usabilidade e aceitabilidade da consulta à informação tornada possível pelo protótipo – e em consequência do modelo conceptual - por utilizadores num cenário realista e que incluiu comparação com formas de acesso existentes. Além desta avaliação, a outro nível, mais relacionado com avaliação de tecnologias e não do modelo, foram efectuadas avaliações do desempenho do subsistema responsável pela extracção de informação. Os resultados da avaliação mostram que o modelo proposto foi percepcionado como mais eficaz e mais útil que as alternativas. Associado ao desempenho do protótipo a extrair informação dos documentos, comparável com o estado da arte, os resultados obtidos mostram a viabilidade e as vantagens, com a tecnologia actual, de utilizar processamento de linguagem natural e integração de informação semântica para melhorar acesso a conteúdos em linguagem natural e não estruturados. O modelo conceptual e o protótipo demonstrador pretendem contribuir para a existência futura de sistemas de pesquisa mais sofisticados e adequados ao governo electrónico. Para existir transparência na governação, cidadania activa, maior agilidade na interacção com a administração pública, entre outros, é necessário que cidadãos e empresas tenham acesso rápido e fácil a informação oficial, mesmo que ela tenha sido originalmente criada em linguagem natural

Repositório Institucional da Universidade de Aveiro

D'un langage de haut niveau à des requêtes graphes permettant d'interroger le web sémantique

Author: Pradel Camille
Publication venue
Publication date: 12/12/2013
Field of study

Les modèles graphiques sont de bons candidats pour la représentation de connaissances sur le Web, où tout est graphes : du graphe de machines connectées via Internet au "Giant Global Graph" de Tim Berners-Lee, en passant par les triplets RDF et les ontologies. Dans ce contexte, le problème crucial de l'interrogation ontologique est le suivant : est-ce qu'une base de connaissances composée d'une partie terminologique et d'une partie assertionnelle implique la requête, autrement dit, existe-t-il une réponse à la question ? Ces dernières années, des logiques de description ont été proposées dans lesquelles l'expressivité de l'ontologie est réduite de façon à rendre l'interrogation calculable (familles DL-Lite et EL). OWL 2 restreint OWL-DL dans ce sens en se fondant sur ces familles. Nous nous inscrivons dans le contexte d'utilisation de formalismes graphiques pour la représentation (RDF, RDFS et OWL) et l'interrogation (SPARQL) de connaissances. Alors que les langages d'interrogation fondés sur des graphes sont présentés par leurs promoteurs comme étant naturels et intuitifs, les utilisateurs ne pensent pas leurs requêtes en termes de graphes. Les utilisateurs souhaitent des langages simples, proches de la langue naturelle, voire limités à des mots-clés. Nous proposons de définir un moyen générique permettant de transformer une requête exprimée en langue naturelle vers une requête exprimée dans le langage de graphe SPARQL, à l'aide de patrons de requêtes. Le début de ce travail coïncide avec les actions actuelles du W3C visant à préparer une nouvelle version de RDF, ainsi qu'avec le processus de standardisation de SPARQL 1.1 gérant l'implication dans les requêtes.Graph models are suitable candidates for KR on the Web, where everything is a graph, from the graph of machines connected to the Internet, the "Giant Global Graph" as described by Tim Berners-Lee, to RDF graphs and ontologies. In that context, the ontological query answering problem is the following: given a knowledge base composed of a terminological component and an assertional component and a query, does the knowledge base implies the query, i.e. is there an answer to the query in the knowledge base? Recently, new description logic languages have been proposed where the ontological expressivity is restricted so that query answering becomes tractable. The most prominent members are the DL-Lite and the EL families. In the same way, the OWL-DL language has been restricted and this has led to OWL2, based on the DL-Lite and EL families. We work in the framework of using graph formalisms for knowledge representation (RDF, RDF-S and OWL) and interrogation (SPARQL). Even if interrogation languages based on graphs have long been presented as a natural and intuitive way of expressing information needs, end-users do not think their queries in terms of graphs. They need simple languages that are as close as possible to natural language, or at least mainly limited to keywords. We propose to define a generic way of translating a query expressed in a high-level language into the SPARQL query language, by means of query patterns. The beginning of this work coincides with the current activity of the W3C that launches an initiative to prepare a possible new version of RDF and is in the process of standardizing SPARQL 1.1 with entailments

Thèses en ligne de l'Université Toulouse III - Paul Sabatier

Context-awareness for adaptive information retrieval systems

Author: Agbele Kehinde Kayode
Publication venue: 'University of the Western Cape Library Service'
Publication date: 01/01/2014
Field of study

Philosophiae Doctor - PhDThis research study investigates optimization of IRS to individual information needs in order of relevance. The research addressed development of algorithms that optimize the ranking of documents retrieved from IRS. In this thesis, we present two aspects of context-awareness in IR. Firstly, the design of context of information. The context of a query determines retrieved information relevance. Thus, executing the same query in diverse contexts often leads to diverse result rankings. Secondly, the relevant context aspects should be incorporated in a way that supports the knowledge domain representing users’ interests. In this thesis, the use of evolutionary algorithms is incorporated to improve the effectiveness of IRS. A context-based information retrieval system is developed whose retrieval effectiveness is evaluated using precision and recall metrics. The results demonstrate how to use attributes from user interaction behaviour to improve the IR effectivenes

UWC Theses and Dissertations

Knowledge Extraction for Hybrid Question Answering

Author: Usbeck Ricardo
Publication venue
Publication date: 18/05/2017
Field of study

Since the proposal of hypertext by Tim Berners-Lee to his employer CERN on March 12, 1989 the World Wide Web has grown to more than one billion Web pages and still grows. With the later proposed Semantic Web vision,Berners-Lee et al. suggested an extension of the existing (Document) Web to allow better reuse, sharing and understanding of data. Both the Document Web and the Web of Data (which is the current implementation of the Semantic Web) grow continuously. This is a mixed blessing, as the two forms of the Web grow concurrently and most commonly contain different pieces of information. Modern information systems must thus bridge a Semantic Gap to allow a holistic and unified access to information about a particular information independent of the representation of the data. One way to bridge the gap between the two forms of the Web is the extraction of structured data, i.e., RDF, from the growing amount of unstructured and semi-structured information (e.g., tables and XML) on the Document Web. Note, that unstructured data stands for any type of textual information like news, blogs or tweets. While extracting structured data from unstructured data allows the development of powerful information system, it requires high-quality and scalable knowledge extraction frameworks to lead to useful results. The dire need for such approaches has led to the development of a multitude of annotation frameworks and tools. However, most of these approaches are not evaluated on the same datasets or using the same measures. The resulting Evaluation Gap needs to be tackled by a concise evaluation framework to foster fine-grained and uniform evaluations of annotation tools and frameworks over any knowledge bases. Moreover, with the constant growth of data and the ongoing decentralization of knowledge, intuitive ways for non-experts to access the generated data are required. Humans adapted their search behavior to current Web data by access paradigms such as keyword search so as to retrieve high-quality results. Hence, most Web users only expect Web documents in return. However, humans think and most commonly express their information needs in their natural language rather than using keyword phrases. Answering complex information needs often requires the combination of knowledge from various, differently structured data sources. Thus, we observe an Information Gap between natural-language questions and current keyword-based search paradigms, which in addition do not make use of the available structured and unstructured data sources. Question Answering (QA) systems provide an easy and efficient way to bridge this gap by allowing to query data via natural language, thus reducing (1) a possible loss of precision and (2) potential loss of time while reformulating the search intention to transform it into a machine-readable way. Furthermore, QA systems enable answering natural language queries with concise results instead of links to verbose Web documents. Additionally, they allow as well as encourage the access to and the combination of knowledge from heterogeneous knowledge bases (KBs) within one answer. Consequently, three main research gaps are considered and addressed in this work: First, addressing the Semantic Gap between the unstructured Document Web and the Semantic Gap requires the development of scalable and accurate approaches for the extraction of structured data in RDF. This research challenge is addressed by several approaches within this thesis. This thesis presents CETUS, an approach for recognizing entity types to populate RDF KBs. Furthermore, our knowledge base-agnostic disambiguation framework AGDISTIS can efficiently detect the correct URIs for a given set of named entities. Additionally, we introduce REX, a Web-scale framework for RDF extraction from semi-structured (i.e., templated) websites which makes use of the semantics of the reference knowledge based to check the extracted data. The ongoing research on closing the Semantic Gap has already yielded a large number of annotation tools and frameworks. However, these approaches are currently still hard to compare since the published evaluation results are calculated on diverse datasets and evaluated based on different measures. On the other hand, the issue of comparability of results is not to be regarded as being intrinsic to the annotation task. Indeed, it is now well established that scientists spend between 60% and 80% of their time preparing data for experiments. Data preparation being such a tedious problem in the annotation domain is mostly due to the different formats of the gold standards as well as the different data representations across reference datasets. We tackle the resulting Evaluation Gap in two ways: First, we introduce a collection of three novel datasets, dubbed N3, to leverage the possibility of optimizing NER and NED algorithms via Linked Data and to ensure a maximal interoperability to overcome the need for corpus-specific parsers. Second, we present GERBIL, an evaluation framework for semantic entity annotation. The rationale behind our framework is to provide developers, end users and researchers with easy-to-use interfaces that allow for the agile, fine-grained and uniform evaluation of annotation tools and frameworks on multiple datasets. The decentral architecture behind the Web has led to pieces of information being distributed across data sources with varying structure. Moreover, the increasing the demand for natural-language interfaces as depicted by current mobile applications requires systems to deeply understand the underlying user information need. In conclusion, the natural language interface for asking questions requires a hybrid approach to data usage, i.e., simultaneously performing a search on full-texts and semantic knowledge bases. To close the Information Gap, this thesis presents HAWK, a novel entity search approach developed for hybrid QA based on combining structured RDF and unstructured full-text data sources

Qucosa

HSSS - Hochschulschriftenserver der SLUB

Qucosa - Publikationsserver der Universität Leipzig

Bootstrapping named entity resources for adaptive question answering systems

Author: Pablo Sánchez César de
Publication venue
Publication date: 22/10/2010
Field of study

Los Sistemas de Búsqueda de Respuestas (SBR) amplían las capacidades de un buscador de información tradicional con la capacidad de encontrar respuestas precisas a las preguntas del usuario. El objetivo principal es facilitar el acceso a la información y disminuir el tiempo y el esfuerzo que el usuario debe emplear para encontrar una información concreta en una lista de documentos relevantes. En esta investigación se han abordado dos trabajos relacionados con los SBR. La primera parte presenta una arquitectura para SBR en castellano basada en la combinación y adaptación de diferentes técnicas de Recuperación y de Extracción de Información. Esta arquitectura está integrada por tres módulos principales que incluyen el análisis de la pregunta, la recuperación de pasajes relevantes y la extracción y selección de respuestas. En ella se ha prestado especial atención al tratamiento de las Entidades Nombradas puesto que, con frecuencia, son el tema de las preguntas o son buenas candidatas como respuestas. La propuesta se ha encarnado en el SBR del grupo MIRACLE que ha sido evaluado de forma independiente durante varias ediciones en la tarea compartida CLEF@QA, parte del foro de evaluación competitiva Cross-Language Evaluation Forum (CLEF). Se describen aquí las participaciones y los resultados obtenidos entre 2004 y 2007. El SBR de MIRACLE ha obtenido resultados moderados en el desempeño de la tarea con tasas de respuestas correctas entre el 20% y el 30%. Entre los resultados obtenidos destacan los de la tarea principal de 2005 y la tarea piloto de Búsqueda de Respuestas en tiempo real de 2006, RealTimeQA. Esta última tarea, además de requerir respuestas correctas incluía el tiempo de respuesta como un factor adicional en la evaluación. Estos resultados respaldan la validez de la arquitectura propuesta como una alternativa viable para los SBR sobre colecciones textuales y también corrobora resultados similares para el inglés y otras lenguas. Por otro lado, el análisis de los resultados a lo largo de las diferentes ediciones de CLEF así como la comparación con otros SBR apunta nuevos problemas y retos. Según nuestra experiencia, los sistemas de QA son más complicados de adaptar a otros dominios y lenguas que los sistemas de Recuperación de Información. Este problema viene heredado del uso de herramientas complejas de análisis de lenguaje como analizadores morfológicos, sintácticos y semánticos. Entre estos últimos se cuentan las herramientas para el Reconocimiento y Clasificación de Entidades Nombradas (NERC en inglés) así como para la Detección y Clasificación de Relaciones (RDC en inglés). Debido a la di cultad de adaptación del SBR a distintos dominios y colecciones, en la segunda parte de esta tesis se investiga una propuesta diferente basada en la adquisición de conocimiento mediante métodos de aprendizaje ligeramente supervisado. El objetivo de esta investigación es adquirir recursos semánticos útiles para las tareas de NERC y RDC usando colecciones de textos no anotados. Además, se trata de eliminar la dependencia de herramientas de análisis lingüístico con el fin de facilitar que las técnicas sean portables a diferentes dominios e idiomas. En primer lugar, se ha realizado un estudio de diferentes algoritmos para NERC y RDC de forma semisupervisada a partir de unos pocos ejemplos (bootstrapping). Este trabajo propone primero una arquitectura común y compara diferentes funciones que se han usado en la evaluación y selección de resultados intermedios, tanto instancias como patrones. La principal propuesta es un nuevo algoritmo que permite la adquisición simultánea e iterativa de instancias y patrones asociados a una relación. Incluye también la posibilidad de adquirir varias relaciones de forma simultánea y mediante el uso de la hipótesis de exclusividad obtener mejores resultados. Como característica distintiva el algoritmo explora la colección de textos con una estrategia basada en indización, que permite adquirir conocimiento de grandes colecciones. La estrategia de selección de candidatos y la evaluación se basan en la construcción de un grafo de instancias y patrones, que justifica nuestro método para la selección de candidatos. Este procedimiento es semejante al frente de exploración de una araña web y permite encontrar las instancias más parecidas a las semillas con las evidencias disponibles. Este algoritmo se ha implementado en el sistema SPINDEL y para su evaluación se ha comenzado con el caso concreto de la adquisición de recursos para las clases de Entidades Nombradas más comunes, Persona, Lugar y Organización. El objetivo es adquirir nombres asociados a cada una de las categorías así como patrones contextuales que permitan detectar menciones asociadas a una clase. Se presentan resultados para la adquisición de dos idiomas distintos, castellano e inglés, y para el castellano, en dos dominios diferentes, noticias y textos de una enciclopedia colaborativa, Wikipedia. En ambos casos el uso de herramientas de análisis lingüístico se ha limitado de acuerdo con el objetivo de avanzar hacia la independencia de idioma. Las listas adquiridas mediante bootstrapping parten de menos de 40 semillas por clase y obtienen del orden de 30.000 instancias de calidad variable. Además se obtienen listas de patrones indicativos asociados a cada clase de entidad. La evaluación indirecta confirma la utilidad de ambos recursos en la clasificación de Entidades Nombradas usando un enfoque simple basado únicamente en diccionarios. La mejor configuración obtiene para la clasificación en castellano una medida F de 67,17 y para inglés de 55,99. Además se confirma la utilidad de los patrones adquiridos que en ambos casos ayudan a mejorar la cobertura. El módulo requiere menor esfuerzo de desarrollo que los enfoques supervisados, si incluimos la necesidad de anotación, aunque su rendimiento es inferior por el momento. En definitiva, esta investigación constituye un primer paso hacia el desarrollo de aplicaciones semánticas como los SBR que requieran menos esfuerzo de adaptación a un dominio o lenguaje nuevo.-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Question Answering (QA) systems add new capabilities to traditional search engines with the ability to find precise answers to user questions. Their objective is to enable easier information access by reducing the time and effort that the user requires to find a concrete information among a list of relevant documents. In this thesis we have carried out two works related with QA systems. The first part introduces an architecture for QA systems for Spanish which is based on the combination and adaptation of different techniques from Information Retrieval (IR) and Information Extraction (IE). This architecture is composed by three modules that include question analysis, relevant passage retrieval and answer extraction and selection. The appropriate processing of Named Entities (NE) has received special attention because of their importance as question themes and candidate answers. The proposed architecture has been implemented as part of the MIRACLE QA system. This system has taken part in independent evaluations like the CLEF@QA track in the Cross-Language Evaluation Forum (CLEF). Results from 2004 to 2007 campaigns as well as the details and the evolution of the system have been described in deep. The MIRACLE QA system has obtained moderate performance with a first answer accuracy ranging between 20% and 30%. Nevertheless, it is important to highlight the results obtained in the 2005 main QA task and the RealTimeQA pilot task in 2006. The last one included response time as an important additional variable of the evaluation. These results back the proposed architecture as an option for QA from textual collection and confirm similar findings obtained for English and other languages. On the other hand, the analysis of the results along evaluation campaigns and the comparison with other QA systems point problems with current systems and new challenges. According to our experience, it is more dificult to tailor QA systems to different domains and languages than IR systems. The problem is inherited by the use of complex language analysis tools like POS taggers, parsers and other semantic analyzers, like NE Recognition and Classification (NERC) and Relation Detection and Characterization (RDC) tools. The second part of this thesis tackles this problem and proposes a different approach to adapting QA systems for di erent languages and collections. The proposal focuses on acquiring knowledge for the semantic analyzers based on lightly supervised approaches. The goal is to obtain useful resources that help to perform NERC or RDC using as few annotated resources as possible. Besides, we try to avoid dependencies from other language analysis tools with the purpose that these methods apply to different languages and domains. First of all, we have study previous work on building NERC and RDC modules with few supervision, particularly bootstrapping methods. We propose a common framework for different bootstrapping systems that help to unify different evaluation functions for intermediate results. The main proposal is a new algorithm that is able to simultaneously acquire instances and patterns associated to a relation of interest. It also uses mutual exclusion among relations to reduce concept drift and achieve better results. A distinctive characteristic is that it uses a query based exploration strategy of the text collection which enables their use for larger collections. Candidate selection and evaluation are based on incrementally building a graph of instances and patterns which also justifies our evaluation function. The discovery approach is analogous to the front of exploration in a web crawler and it is able to find the most similar instances to the available seeds. This algorithm has been implemented in the SPINDEL system. We have selected for evaluation the task of acquiring resources for the most common NE classes, Person, Location and Organization. The objective is to acquire name instances that belong to any of the classes as well as contextual patterns that help to detect mentions of NE that belong to that class. We present results for the acquisition of resources from raw text from two different languages, Spanish and English. We also performed experiments for Spanish in two different collections, news and texts from a collaborative encyclopedia, Wikipedia. Both cases are tackled with limited language analysis tools and resources. With an initial list of 40 instance seeds, the bootstrapping process is able to acquire large name lists containing up to 30.000 instances with a variable quality. Besides, large lists of indicative patterns are obtained too. Our indirect evaluation confirms the utility of both resources to classify NE using a simple dictionary recognition approach. Best results for Spanish obtained a F-score of 67,17 and for English this value is 55,99. The module requires much less development effort than annotation for supervised algorithms although the performance is not in pair yet. This research is a first step towards the development of semantic applications like QA for a new language or domain with no annotated corpora that requires less adaptation effort

Universidad Carlos III de Madrid e-Archivo