354 research outputs found

    Enabling Complex Semantic Queries to Bioinformatics Databases through Intuitive Search Over Data

    Get PDF
    Data integration promises to be one of the main catalysts in enabling new insights to be drawn from the wealth of biological data already available publicly. However, the heterogene- ity of the existing data sources still poses significant challenges for achieving interoperability among biological databases. Furthermore, merely solving the technical challenges of data in- tegration, for example through the use of common data representation formats, leaves open the larger problem. Namely, the steep learning curve required for understanding the data models of each public source, as well as the technical language through which the sources can be queried and joined. As a consequence, most of the available biological data remain practically unexplored today. In this thesis, we address these problems jointly, by first introducing an ontology-based data integration solution in order to mitigate the data source heterogeneity problem. We illustrate through the concrete example of Bgee, a gene expression data source, how relational databases can be exposed as virtual Resource Description Framework (RDF) graphs, through relational-to-RDF mappings. This has the important advantage that the original data source can remain unmodified, while still becoming interoperable with external RDF sources. We complement our methods with applied case studies designed to guide domain experts in formulating expressive federated queries targeting the integrated data across the domains of evolutionary relationships and gene expression. More precisely, we introduce two com- parative analyses, first within the same domain (using orthology data from multiple, inter- operable, data sources) and second across domains, in order to study the relation between expression change and evolution rate following a duplication event. Finally, in order to bridge the semantic gap between users and data, we design and im- plement Bio-SODA, a question answering system over domain knowledge graphs, that does not require training data for translating user questions to SPARQL. Bio-SODA uses a novel ranking approach that combines syntactic and semantic similarity, while also incorporating node centrality metrics to rank candidate matches for a given user question. Our results in testing Bio-SODA across several real-world databases that span multiple domains (both within and outside bioinformatics) show that it can answer complex, multi-fact queries, be- yond the current state-of-the-art in the more well-studied open-domain question answering. -- L’intégration des données promet d’être l’un des principaux catalyseurs permettant d’extraire des nouveaux aperçus de la richesse des données biologiques déjà disponibles publiquement. Cependant, l’hétérogénéité des sources de données existantes pose encore des défis importants pour parvenir à l’interopérabilité des bases de données biologiques. De plus, en surmontant seulement les défis techniques de l’intégration des données, par exemple grâce à l’utilisation de formats standard de représentation de données, on laisse ouvert un problème encore plus grand. À savoir, la courbe d’apprentissage abrupte nécessaire pour comprendre la modéli- sation des données choisie par chaque source publique, ainsi que le langage technique par lequel les sources peuvent être interrogés et jointes. Par conséquent, la plupart des données biologiques publiquement disponibles restent pratiquement inexplorés aujourd’hui. Dans cette thèse, nous abordons l’ensemble des deux problèmes, en introduisant d’abord une solution d’intégration de données basée sur ontologies, afin d’atténuer le problème d’hété- rogénéité des sources de données. Nous montrons, à travers l’exemple de Bgee, une base de données d’expression de gènes, une approche permettant les bases de données relationnelles d’être publiés sous forme de graphes RDF (Resource Description Framework) virtuels, via des correspondances relationnel-vers-RDF (« relational-to-RDF mappings »). Cela présente l’important avantage que la source de données d’origine peut rester inchangé, tout en de- venant interopérable avec les sources RDF externes. Nous complétons nos méthodes avec des études de cas appliquées, conçues pour guider les experts du domaine dans la formulation de requêtes fédérées expressives, ciblant les don- nées intégrées dans les domaines des relations évolutionnaires et de l’expression des gènes. Plus précisément, nous introduisons deux analyses comparatives, d’abord dans le même do- maine (en utilisant des données d’orthologie provenant de plusieurs sources de données in- teropérables) et ensuite à travers des domaines interconnectés, afin d’étudier la relation entre le changement d’expression et le taux d’évolution suite à une duplication de gène. Enfin, afin de mitiger le décalage sémantique entre les utilisateurs et les données, nous concevons et implémentons Bio-SODA, un système de réponse aux questions sur des graphes de connaissances domaine-spécifique, qui ne nécessite pas de données de formation pour traduire les questions des utilisateurs vers SPARQL. Bio-SODA utilise une nouvelle ap- proche de classement qui combine la similarité syntactique et sémantique, tout en incorporant des métriques de centralité des nœuds, pour classer les possibles candidats en réponse à une question utilisateur donnée. Nos résultats suite aux tests effectués en utilisant Bio-SODA sur plusieurs bases de données à travers plusieurs domaines (tantôt liés à la bioinformatique qu’extérieurs) montrent que Bio-SODA réussit à répondre à des questions complexes, en- gendrant multiples entités, au-delà de l’état actuel de la technique en matière de systèmes de réponses aux questions sur les données structures, en particulier graphes de connaissances

    Experiencing OptiqueVQS: A Multi-paradigm and Ontology-based Visual Query System for End Users

    Get PDF
    This is author's post-print version, published version available on http://link.springer.com/article/10.1007%2Fs10209-015-0404-5Data access in an enterprise setting is a determining factor for value creation processes, such as sense-making, decision-making, and intelligence analysis. Particularly, in an enterprise setting, intuitive data access tools that directly engage domain experts with data could substantially increase competitiveness and profitability. In this respect, the use of ontologies as a natural communication medium between end users and computers has emerged as a prominent approach. To this end, this article introduces a novel ontology-based visual query system, named OptiqueVQS, for end users. OptiqueVQS is built on a powerful and scalable data access platform and has a user-centric design supported by a widget-based flexible and extensible architecture allowing multiple coordinated representation and interaction paradigms to be employed. The results of a usability experiment performed with non-expert users suggest that OptiqueVQS provides a decent level of expressivity and high usability and hence is quite promising

    Support for taxonomic data in systematics

    Get PDF
    The Systematics community works to increase our understanding of biological diversity through identifying and classifying organisms and using phylogenies to understand the relationships between those organisms. It has made great progress in the building of phylogenies and in the development of algorithms. However, it has insufficient provision for the preservation of research outcomes and making those widely accessible and queriable, and this is where database technologies can help. This thesis makes a contribution in the area of database usability, by addressing the query needs present in the community, as supported by the analysis of query logs. It formulates clearly the user requirements in the area of phylogeny and classification queries. It then reports on the use of warehousing techniques in the integration of data from many sources, to satisfy those requirements. It shows how to perform query expansion with synonyms and vernacular names, and how to implement hierarchical query expansion effectively. A detailed analysis of the improvements offered by those query expansion techniques is presented. This is supported by the exposition of the database techniques underlying this development, and of the user and programming interfaces (web services) which make this novel development available to both end-users and programs

    PowerAqua: Open Question Answering on the Semantic Web

    Get PDF
    With the rapid growth of semantic information in the Web, the processes of searching and querying these very large amounts of heterogeneous content have become increasingly challenging. This research tackles the problem of supporting users in querying and exploring information across multiple and heterogeneous Semantic Web (SW) sources. A review of literature on ontology-based Question Answering reveals the limitations of existing technology. Our approach is based on providing a natural language Question Answering interface for the SW, PowerAqua. The realization of PowerAqua represents a considerable advance with respect to other systems, which restrict their scope to an ontology-specific or homogeneous fraction of the publicly available SW content. To our knowledge, PowerAqua is the only system that is able to take advantage of the semantic data available on the Web to interpret and answer user queries posed in natural language. In particular, PowerAqua is uniquely able to answer queries by combining and aggregating information, which can be distributed across heterogeneous semantic resources. Here, we provide a complete overview of our work on PowerAqua, including: the research challenges it addresses; its architecture; the techniques we have realised to map queries to semantic data, to integrate partial answers drawn from different semantic resources and to rank alternative answers; and the evaluation studies we have performed, to assess the performance of PowerAqua. We believe our experiences can be extrapolated to a variety of end-user applications that wish to open up to large scale and heterogeneous structured datasets, to be able to exploit effectively what possibly is the greatest wealth of data in the history of Artificial Intelligence

    Using natural language processing for question answering in closed and open domains

    Get PDF
    With regard to the growth in the amount of social, environmental, and biomedical information available digitally, there is a growing need for Question Answering (QA) systems that can empower users to master this new wealth of information. Despite recent progress in QA, the quality of interpretation and extraction of the desired answer is not adequate. We believe that striving for higher accuracy in QA systems is subject to on-going research, i.e., it is better to have no answer is better than wrong answers. However, there are diverse queries, which the state of the art QA systems cannot interpret and answer properly. The problem of interpreting a question in a way that could preserve its syntactic-semantic structure is considered as one of the most important challenges in this area. In this work we focus on the problems of semantic-based QA systems and analyzing the effectiveness of NLP techniques, query mapping, and answer inferencing both in closed (first scenario) and open (second scenario) domains. For this purpose, the architecture of Semantic-based closed and open domain Question Answering System (hereafter “ScoQAS”) over ontology resources is presented with two different prototyping: Ontology-based closed domain and an open domain under Linked Open Data (LOD) resource. The ScoQAS is based on NLP techniques combining semantic-based structure-feature patterns for question classification and creating a question syntactic-semantic information structure (QSiS). The QSiS provides an actual potential by building constraints to formulate the related terms on syntactic-semantic aspects and generating a question graph (QGraph) which facilitates making inference for getting a precise answer in the closed domain. In addition, our approach provides a convenient method to map the formulated comprehensive information into SPARQL query template to crawl in the LOD resources in the open domain. The main contributions of this dissertation are as follows: 1. Developing ScoQAS architecture integrated with common and specific components compatible with closed and open domain ontologies. 2. Analysing user’s question and building a question syntactic-semantic information structure (QSiS), which is constituted by several processes of the methodology: question classification, Expected Answer Type (EAT) determination, and generated constraints. 3. Presenting an empirical semantic-based structure-feature pattern for question classification and generalizing heuristic constraints to formulate the relations between the features in the recognized pattern in terms of syntactical and semantical. 4. Developing a syntactic-semantic QGraph for representing core components of the question. 5. Presenting an empirical graph-based answer inference in the closed domain. In a nutshell, a semantic-based QA system is presented which provides some experimental results over the closed and open domains. The efficiency of the ScoQAS is evaluated using measures such as precision, recall, and F-measure on LOD challenges in the open domain. We focus on quantitative evaluation in the closed domain scenario. Due to the lack of predefined benchmark(s) in the first scenario, we define measures that demonstrate the actual complexity of the problem and the actual efficiency of the solutions. The results of the analysis corroborate the performance and effectiveness of our approach to achieve a reasonable accuracy.Con respecto al crecimiento en la cantidad de información social, ambiental y biomédica disponible digitalmente, existe una creciente necesidad de sistemas de la búsqueda de la respuesta (QA) que puedan ofrecer a los usuarios la gestión de esta nueva cantidad de información. A pesar del progreso reciente en QA, la calidad de interpretación y extracción de la respuesta deseada no es la adecuada. Creemos que trabajar para lograr una mayor precisión en los sistemas de QA es todavía un campo de investigación abierto. Es decir, es mejor no tener respuestas que tener respuestas incorrectas. Sin embargo, existen diversas consultas que los sistemas de QA en el estado del arte no pueden interpretar ni responder adecuadamente. El problema de interpretar una pregunta de una manera que podría preservar su estructura sintáctica-semántica es considerado como uno de los desafíos más importantes en esta área. En este trabajo nos centramos en los problemas de los sistemas de QA basados en semántica y en el análisis de la efectividad de las técnicas de PNL, y la aplicación de consultas e inferencia respuesta tanto en dominios cerrados (primer escenario) como abiertos (segundo escenario). Para este propósito, la arquitectura del sistema de búsqueda de respuestas en dominios cerrados y abiertos basado en semántica (en adelante "ScoQAS") sobre ontologías se presenta con dos prototipos diferentes: en dominio cerrado basado en el uso de ontologías y un dominio abierto dirigido a repositorios de Linked Open Data (LOD). El ScoQAS se basa en técnicas de PNL que combinan patrones de características de estructura semánticas para la clasificación de preguntas y la creación de una estructura de información sintáctico-semántica de preguntas (QSiS). El QSiS proporciona una manera la construcción de restricciones para formular los términos relacionados en aspectos sintáctico-semánticos y generar un grafo de preguntas (QGraph) el cual facilita derivar inferencias para obtener una respuesta precisa en el dominio cerrado. Además, nuestro enfoque proporciona un método adecuado para aplicar la información integral formulada en la plantilla de consulta SPARQL para navegar en los recursos LOD en el dominio abierto. Las principales contribuciones de este trabajo son los siguientes: 1. El desarrollo de la arquitectura ScoQAS integrada con componentes comunes y específicos compatibles con ontologías de dominio cerrado y abierto. 2. El análisis de la pregunta del usuario y la construcción de una estructura de información sintáctico-semántica de las preguntas (QSiS), que está constituida por varios procesos de la metodología: clasificación de preguntas, determinación del Tipo de Respuesta Esperada (EAT) y las restricciones generadas. 3. La presentación de un patrón empírico basado en la estructura semántica para clasificar las preguntas y generalizar las restricciones heurísticas para formular las relaciones entre las características en el patrón reconocido en términos sintácticos y semánticos. 4. El desarrollo de un QGraph sintáctico-semántico para representar los componentes centrales de la pregunta. 5. La presentación de la respuesta inferida a partir de un grafo empírico en el dominio cerrado. En pocas palabras, se presenta un sistema semántico de QA que proporciona algunos resultados experimentales sobre los dominios cerrados y abiertos. La eficiencia del ScoQAS se evalúa utilizando medidas tales como una precisión, cobertura y la medida-F en desafíos LOD para el dominio abierto. Para el dominio cerrado, nos centramos en la evaluación cuantitativa; su precisión se analiza en una ontología empresarial. La falta de un banco la pruebas predefinidas es uno de los principales desafíos de la evaluación en el primer escenario. Por lo tanto, definimos medidas que demuestran la complejidad real del problema y la eficiencia real de las soluciones. Los resultados del análisis corroboran el rendimient

    Reasoning with Contexts in Description Logics

    Get PDF
    Harmelen, F.A.H. van [Promotor]Schlobach, K.S. [Copromotor
    corecore