3 research outputs found

    CoDoSA: A Lightweight, XML-Based Framework for Integrating Unstructured Textual Information

    Get PDF
    One of the most fundamental dimensions of information quality is access. For many organizations, a large part of their information assets is locked away in Unstructured Textual Information (UTI) in the form of email, letters, contracts, call notes, and spreadsheet. In addition to internal UTI, there is also a wealth of publicly available UTI on websites, in newspapers, courthouse records and other sources that can add value when combined with internally managed information. This paper describes a system called Compressed Document Set Architecture (CoDoSA) designed to facilitate the integration of UTI into a structured database environment where it can be more readily accessed and manipulated. The CoDoSA Framework comprises an XML-based metadata standard and an associated Application Program Interface (API). It further describes how CoDoSA can facilitate the storage and management of information during the ETL (Extract, Transform, and Load) process to integrate unstructured UTI information. It also explains how CoDoSA promotes higher information quality by providing several features that simplify the governance of metadata standards and enforcement of data quality constraints across different UTI applications and development teams. In addition, CoDoSA provides a mechanism for inserting semantic tags into captured UTI, tags that can be used in later steps to drive semantic-mediated queries and processes

    Relation extraction for information extraction from free text

    Get PDF

    Recherche d'information sémantique et extraction automatique d'ontologie du domaine

    Get PDF
    Il peut s'avérer ardu, même pour une organisation de petite taille, de se retrouver parmi des centaines, voir des milliers de documents électroniques. Souvent, les techniques employées par les moteurs de recherche dans Internet sont utilisées par les entreprises voulant faciliter la recherche d'information dans leur intranet. Ces techniques reposent sur des méthodes statistiques et ne permettent pas de traiter la sémantique contenue dans la requête de l'usager ainsi que dans les documents. Certaines approches ont été développées pour extraire cette sémantique et ainsi, mieux répondre à des requêtes faites par les usagers. Par contre, la plupart de ces techniques ont été conçues pour s'appliquer au Web en entier et non pas sur un domaine en particulier. Il pourrait être intéressant d'utiliser une ontologie pour représenter un domaine spécifique et ainsi, être capable de mieux répondre aux questions posées par un usager. Ce mémoire présente notre approche proposant l'utilisation du logiciel Text- To-Onto pour créer automatiquement une ontologie décrivant un domaine. Cette même ontologie est par la suite utilisée par le logiciel Sesei, qui est un filtre sémantique pour les moteurs de recherche conventionnels. Cette méthode permet ainsi d'améliorer la pertinence des documents envoyés à l'usager.It can prove to be diffcult, even for a small size organization, to find information among hundreds, even thousands of electronic documents. Most often, the methods employed by search engines on the Internet are used by companies wanting to improve information retrieval on their intranet. These techniques rest on statistical methods and do not make it possible neither to evaluate the semantics contained in the user requests, nor in the documents. Certain methods were developed to extract this semantics and thus, to improve the answer given to requests. On the other hand, the majority of these techniques were conceived to be applied on the entire World Wide Web and not on a particular field of knowledge, like corporative data. It could be interesting to use domain specific ontologies in trying to link a specific query to related documents and thus, to be able to better answer these queries. This thesis presents our approach which proposes the use of the Text-To-Onto software to automatically create an ontology describing a particular field. Thereafter, this ontology is used by the Sesei software, which is a semantic filter for conventional search engines. This method makes it possible to improve the relevance of documents returned to the user