Search CORE

59 research outputs found

From Text to Knowledge

Author: Bundschus Markus
Publication venue: Ludwig-Maximilians-Universität München
Publication date: 21/07/2010
Field of study

The global information space provided by the World Wide Web has changed dramatically the way knowledge is shared all over the world. To make this unbelievable huge information space accessible, search engines index the uploaded contents and provide efficient algorithmic machinery for ranking the importance of documents with respect to an input query. All major search engines such as Google, Yahoo or Bing are keyword-based, which is indisputable a very powerful tool for accessing information needs centered around documents. However, this unstructured, document-oriented paradigm of the World Wide Web has serious drawbacks, when searching for specific knowledge about real-world entities. When asking for advanced facts about entities, today's search engines are not very good in providing accurate answers. Hand-built knowledge bases such as Wikipedia or its structured counterpart DBpedia are excellent sources that provide common facts. However, these knowledge bases are far from being complete and most of the knowledge lies still buried in unstructured documents. Statistical machine learning methods have the great potential to help to bridge the gap between text and knowledge by (semi-)automatically transforming the unstructured representation of the today's World Wide Web to a more structured representation. This thesis is devoted to reduce this gap with Probabilistic Graphical Models. Probabilistic Graphical Models play a crucial role in modern pattern recognition as they merge two important fields of applied mathematics: Graph Theory and Probability Theory. The first part of the thesis will present a novel system called Text2SemRel that is able to (semi-)automatically construct knowledge bases from textual document collections. The resulting knowledge base consists of facts centered around entities and their relations. Essential part of the system is a novel algorithm for extracting relations between entity mentions that is based on Conditional Random Fields, which are Undirected Probabilistic Graphical Models. In the second part of the thesis, we will use the power of Directed Probabilistic Graphical Models to solve important knowledge discovery tasks in semantically annotated large document collections. In particular, we present extensions of the Latent Dirichlet Allocation framework that are able to learn in an unsupervised way the statistical semantic dependencies between unstructured representations such as documents and their semantic annotations. Semantic annotations of documents might refer to concepts originating from a thesaurus or ontology but also to user-generated informal tags in social tagging systems. These forms of annotations represent a first step towards the conversion to a more structured form of the World Wide Web. In the last part of the thesis, we prove the large-scale applicability of the proposed fact extraction system Text2SemRel. In particular, we extract semantic relations between genes and diseases from a large biomedical textual repository. The resulting knowledge base contains far more potential disease genes exceeding the number of disease genes that are currently stored in curated databases. Thus, the proposed system is able to unlock knowledge currently buried in the literature. The literature-derived human gene-disease network is subject of further analysis with respect to existing curated state of the art databases. We analyze the derived knowledge base quantitatively by comparing it with several curated databases with regard to size of the databases and properties of known disease genes among other things. Our experimental analysis shows that the facts extracted from the literature are of high quality

Digitale Hochschulschriften der LMU

From Text to Knowledge with Graphs: modelling, querying and exploiting textual content

Author: Alves Mirian Halfeld Ferrari
Forst Anne-Lyse Minard
Vargas-Solar Genoveva
Publication venue
Publication date: 09/10/2023
Field of study

This paper highlights the challenges, current trends, and open issues related to the representation, querying and analytics of content extracted from texts. The internet contains vast text-based information on various subjects, including commercial documents, medical records, scientific experiments, engineering tests, and events that impact urban and natural environments. Extracting knowledge from this text involves understanding the nuances of natural language and accurately representing the content without losing information. This allows knowledge to be accessed, inferred, or discovered. To achieve this, combining results from various fields, such as linguistics, natural language processing, knowledge representation, data storage, querying, and analytics, is necessary. The vision in this paper is that graphs can be a well-suited text content representation once annotated and the right querying and analytics techniques are applied. This paper discusses this hypothesis from the perspective of linguistics, natural language processing, graph models and databases and artificial intelligence provided by the panellists of the DOING session in the MADICS Symposium 2022

arXiv.org e-Print Archive

Capturing flight system test engineering expertise: Lessons learned

Author: Woerner Irene Wong
Publication venue
Publication date
Field of study

Within a few years, JPL will be challenged by the most active mission set in history. Concurrently, flight systems are increasingly more complex. Presently, the knowledge to conduct integration and test of spacecraft and large instruments is held by a few key people, each with many years of experience. JPL is in danger of losing a significant amount of this critical expertise, through retirement, during a period when demand for this expertise is rapidly increasing. The most critical issue at hand is to collect and retain this expertise and develop tools that would ensure the ability to successfully perform the integration and test of future spacecraft and large instruments. The proposed solution was to capture and codity a subset of existing knowledge, and to utilize this captured expertise in knowledge-based systems. First year results and activities planned for the second year of this on-going effort are described. Topics discussed include lessons learned in knowledge acquisition and elicitation techniques, life-cycle paradigms, and rapid prototyping of a knowledge-based advisor (Spacecraft Test Assistant) and a hypermedia browser (Test Engineering Browser). The prototype Spacecraft Test Assistant supports a subset of integration and test activities for flight systems. Browser is a hypermedia tool that allows users easy perusal of spacecraft test topics. A knowledge acquisition tool called ConceptFinder which was developed to search through large volumes of data for related concepts is also described and is modified to semi-automate the process of creating hypertext links

NASA Technical Reports Server

Ontology Lexicalisation: The lemon Perspective

Author: Buitelaar Paul
Cimiano Philipp
Declerck Thierry
McCrae J.
Montiel-Ponsoda Elena
Publication venue: Facultad de Informática (UPM)
Publication date: 10/11/2011
Field of study

Ontologies (Guarino1998) capture knowledge but fail to capture the structure and use of terms in expressing and referring to this knowledge in natural language. The structure and use of terms is the concern of terminology as well as lexicology. In recent years, the relevance of terminology in knowledge representation has been recognized again (for example the advent of SKOS1) but less consideration has been given to lexical and linguistic issues in knowledge representation (Buitelaar2010)

Archivo Digital UPM

Recommended from our members

Evaluation of DEFINDER: A System to Mine Definitions from Consumer-oriented Medical Text

Author: Klavans Judith L.
Muresan Smaranda
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

In this paper we present DEFINDER, a rule-based system that mines cons umer-oriented full text articles in order to extract definitions and the terms they define. This research is part of Digital Library Project at Columbia University, entitled PERSIVAL (PErsonalized Retrieval and Summarization of Image, Video and Language resources). One goal of the project is to present information to patients in language they can understand. A key component of this stage is to provide accurate and readable lay definitions for technical terms, which may be present in articles of intermediate complexity. The focus of this short paper is on quantitative and qualitative evaluation of the DEFINDER system. Our basis for comparison was definitions from Unified Medical Language System (UMLS), On-line Medical Dictionary (OMD) and Glossary of Popular and Technical Medical Terms (GPTMT). Quantitative evaluations show that DEFINDER obtained 87% precision and 75% recall and reveal the incompleteness of existing resources and the ability of DEFINDER to address gaps. Qualitative evaluation shows that the definitions extracted by our system are ranked higher in terms of user-based criteria of usability and readability than definitions from on-line specialized dictionaries. Thus the output of DEFINDER can be used to enhance existing specialized dictionaries, and also as a key feature in summarizing technical articles for non-specialist users

Columbia University Academic Commons

Ontology Lexicalization: The lemon perspective

Author: Buitelaar Paul
Cimiano Philipp
Declerck Thierry
McCrae John
Montiel-Ponsoda Elena
Publication venue
Publication date: 01/01/2011
Field of study

Buitelaar P, Cimiano P, McCrae J, Montiel-Ponsoda E, Declerck T. Ontology Lexicalization: The lemon perspective. In: Proceedings of the Workshops - 9th International Conference on Terminology and Artificial Intelligence (TIA 2011). 2011: 33-36

Publications at Bielefeld University

Impact of standards in European open data catalogues: a multilingual perspective of DCAT

Author: Montiel Ponsoda Elena
Villazón-Terrazas Boris
Publication venue: E.T.S. de Ingenieros Informáticos (UPM)
Publication date: 01/01/2014
Field of study

Within the European Union, member states are setting up official data catalogues as entry points to access PSI (Public Sector Information). In this context, it is important to describe the metadata of these data portals, i.e., of data catalogs, and allow for interoperability among them. To tackle these issues, the Government Linked Data Working Group developed DCAT (Data Catalog Vocabulary), an RDF vocabulary for describing the metadata of data catalogs. This topic report analyzes the current use of the DCAT vocabulary in several European data catalogs and proposes some recommendations to deal with an inconsistent use of the metadata across countries. The enrichment of such metadata vocabularies with multilingual descriptions, as well as an account for cultural divergences, is seen as a necessary step to guarantee interoperability and ensure wider adoption

Archivo Digital UPM

Ontologies et Recherche d'Information : une application au diagnostic automobile

Author: Aussenac-Gilles Nathalie
Reymonet Axel
Thomas Jérôme
Publication venue: Ecole des Mines d'Alès
Publication date: 08/06/2010
Field of study

International audienceCet article décrit les principes fondateurs et le fonctionnement global de TextViz, outil de recherche d'information (RI) sémantique utilisé dans le domaine du diagnostic automobile. Les bases d'incidents (répertoriant un ensemble de pannes connues) ont toujours été prisées par les constructeurs automobiles et les garagistes : elles permettent d'abord de capitaliser de la connaissance afin de pouvoir y accéder par la suite. Toutefois, avec une architecture des véhicules toujours plus complexe, les causes possibles d'une panne se sont vite multipliées, ce qui rend crucial le processus de RI. Se fondant sur un modèle limité de connaissances du diagnostic automobile, notre logiciel a pour but de faciliter les tâches de stockage et de recherche sémantiques d'information parmi un grand nombre de cas de pannes connus

Scientific Publications of the University of Toulouse II Le Mirail

HAL Descartes

Recommended from our members

Evaluation of the DEFINDER System for Fully Automatic Glossary Construction

Author: Klavans Judith L.
Muresan Smaranda
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2001
Field of study

In this paper we present a quantitative and qualitative evaluation of DEFINDER, a rule-based system that mines consumer-oriented full text articles in order to extract definitions and the terms they define. The quantitative evaluation shows that in terms of precision and recall as measured against human performance, DEFINDER obtained 87% and 75% respectively, thereby revealing the incompleteness of existing resources and the ability of DEFINDER to address these gaps. Our basis for comparison is definitions from on-line dictionaries, including the UMLS Metathesaurus. Qualitative evaluation shows that the definitions extracted by our system are ranked higher in terms of user-centered criteria of usability and readability than are definitions from on-line specialized dictionaries. The output of DEFINDER can be used to enhance these dictionaries. DEFINDER output is being incorporated in a system to clarify technical terms for non-specialist users in understandable non-technical language

Columbia University Academic Commons

PubMed Central

Convertir des dérivations TAG en dépendances

Author: Villemonte de La Clergerie Éric
Publication venue: HAL CCSD
Publication date: 01/01/2010
Field of study

International audienceLes structures de dépendances syntaxiques sont importantes et bien adaptées comme point de départ de diverses applications. Dans le cadre de l'analyseur TAG FRMG, nous présentons les détails d'un processus de conversion de forêts partagées de dérivations en forêts partagées de dépendances. Des éléments d'information sont fournis sur un algorithme de désambiguisation sur ces forêts de dépendances

INRIA a CCSD electronic archive server

Hal-Diderot