9 research outputs found

    Ontologies and Information Extraction

    Full text link
    This report argues that, even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and keywords, because the extracted pieces of texts are interpreted with respect to a predefined partial domain model. This report shows that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved. This report is mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE

    Pattern extraction from the world wide web

    Full text link
    The World Wide Web is a source of huge amount of unlabeled information spread across different sources in varied formats. This presents us with both opportunities and challenges in leveraging such large amount of unstructured data to build knowledge bases and to extract relevant information. As part of this thesis, a semi-supervised logistic regression model called “Dual Iterative Pattern Relation Extraction” proposed by Sergey Brin is selected for further investigation. DIPRE presents a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample. This project built in JAVA using Google AJAX Search API includes designing, implementing and testing DIPRE approach in extracting various relationships from the web

    Acquisition of Domain-specific Patterns for Single Document Summarization and Information Extraction

    Get PDF
    AIPR 2015.Single-document summarization aims to reduce the size of a text document while preserving the most important information. Much work has been done on open-domain summarization. This paper presents an automatic way to mine domain-specific patterns from text documents. With a small amount of effort required for manual selection, these patterns can be used for domain-specific scenario-based document summarization and information extraction. Our evaluation shows that scenario-based document summarization can both filter irrelevant documents and create summaries for relevant documents within the specified domain.Peer reviewe

    Acquiring information extraction patterns from unannotated corpora

    Get PDF
    Information Extraction (IE) can be defined as the task of automatically extracting preespecified kind of information from a text document. The extracted information is encoded in the required format and then can be used, for example, for text summarization or as accurate index to retrieve new documents.The main issue when building IE systems is how to obtain the knowledge needed to identify relevant information in a document. Today, IE systems are commonly based on extraction rules or IE patterns to represent the kind of information to be extracted. Most approaches to IE pattern acquisition require expert human intervention in many steps of the acquisition process. This dissertation presents a novel method for acquiring IE patterns, Essence, that significantly reduces the need for human intervention. The method is based on ELA, a specifically designed learning algorithm for acquiring IE patterns from unannotated corpora.The distinctive features of Essence and ELA are that 1) they permit the automatic acquisition of IE patterns from unrestricted and untagged text representative of the domain, due to 2) their ability to identify regularities around semantically relevant concept-words for the IE task by 3) using non-domain-specific lexical knowledge tools such as WordNet and 4) restricting the human intervention to defining the task, and validating and typifying the set of IE patterns obtained.Since Essence does not require a corpus annotated with the type of information to be extracted and it does makes use of a general purpose ontology and widely applied syntactic tools, it reduces the expert effort required to build an IE system and therefore also reduces the effort of porting the method to any domain.In order to Essence be validated we conducted a set of experiments to test the performance of the method. We used Essence to generate IE patterns for a MUC-like task. Nevertheless, the evaluation procedure for MUC competitions does not provide a sound evaluation of IE systems, especially of learning systems. For this reason, we conducted an exhaustive set of experiments to further test the abilities of Essence.The results of these experiments indicate that the proposed method is able to learn effective IE patterns

    Unsupervised Discovery of Scenario-Level Patterns for Information Extraction

    No full text
    Information Extraction (IE) systems are commonly based on pattern matching. Adapting an IE system to a new scenario entails the construction of a new pattern base -- a timeconsuming and expensive process. We have implemented a system for. finding patterns automatically from un-annotated text. Starting with a small initial set of seed patterns proposed by the user, the system applies an incremental discovery procedure to identify new patterns. We present experiments with evaluations which show that the resulting patterns exhibit high precision and recall

    Unsupervised relation extraction for e-learning applications

    Get PDF
    In this modern era many educational institutes and business organisations are adopting the e-Learning approach as it provides an effective method for educating and testing their students and staff. The continuous development in the area of information technology and increasing use of the internet has resulted in a huge global market and rapid growth for e-Learning. Multiple Choice Tests (MCTs) are a popular form of assessment and are quite frequently used by many e-Learning applications as they are well adapted to assessing factual, conceptual and procedural information. In this thesis, we present an alternative to the lengthy and time-consuming activity of developing MCTs by proposing a Natural Language Processing (NLP) based approach that relies on semantic relations extracted using Information Extraction to automatically generate MCTs. Information Extraction (IE) is an NLP field used to recognise the most important entities present in a text, and the relations between those concepts, regardless of their surface realisations. In IE, text is processed at a semantic level that allows the partial representation of the meaning of a sentence to be produced. IE has two major subtasks: Named Entity Recognition (NER) and Relation Extraction (RE). In this work, we present two unsupervised RE approaches (surface-based and dependency-based). The aim of both approaches is to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In the surface-based approach, we examined different surface pattern types, each implementing different assumptions about the linguistic expression of semantic relations between named entities while in the dependency-based approach we explored how dependency relations based on dependency trees can be helpful in extracting relations between named entities. Our findings indicate that the presented approaches are capable of achieving high precision rates. Our experiments make use of traditional, manually compiled corpora along with similar corpora automatically collected from the Web. We found that an automatically collected web corpus is still unable to ensure the same level of topic relevance as attained in manually compiled traditional corpora. Comparison between the surface-based and the dependency-based approaches revealed that the dependency-based approach performs better. Our research enabled us to automatically generate questions regarding the important concepts present in a domain by relying on unsupervised relation extraction approaches as extracted semantic relations allow us to identify key information in a sentence. The extracted patterns (semantic relations) are then automatically transformed into questions. In the surface-based approach, questions are automatically generated from sentences matched by the extracted surface-based semantic pattern which relies on a certain set of rules. Conversely, in the dependency-based approach questions are automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic patterns. The MCQ systems produced from these surface-based and dependency-based semantic patterns were extrinsically evaluated by two domain experts in terms of questions and distractors readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The evaluation results revealed that the MCQ system based on dependency-based semantic relations performed better than the surface-based one. A major outcome of this work is an integrated system for MCQ generation that has been evaluated by potential end users.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

    Moving towards the semantic web: enabling new technologies through the semantic annotation of social contents.

    Get PDF
    La Web Social ha causat un creixement exponencial dels continguts disponibles deixant enormes quantitats de recursos textuals electrĂČnics que sovint aclaparen els usuaris. Aquest volum d’informaciĂł Ă©s d’interĂšs per a la comunitat de mineria de dades. Els algorismes de mineria de dades exploten caracterĂ­stiques de les entitats per tal de categoritzar-les, agrupar-les o classificar-les segons la seva semblança. Les dades per si mateixes no aporten cap mena de significat: han de ser interpretades per esdevenir informaciĂł. Els mĂštodes tradicionals de mineria de dades no tenen com a objectiu “entendre” el contingut d’un recurs, sinĂł que extreuen valors numĂšrics els quals esdevenen models en aplicar-hi cĂ lculs estadĂ­stics, que nomĂ©s cobren sentit sota l’anĂ lisi manual d’un expert. Els darrers anys, motivat per la Web SemĂ ntica, molts investigadors han proposat mĂštodes semĂ ntics de classificaciĂł de dades capaços d’explotar recursos textuals a nivell conceptual. Malgrat aixĂČ, normalment aquests mĂštodes depenen de recursos anotats prĂšviament per poder interpretar semĂ nticament el contingut d’un document. L’Ășs d’aquests mĂštodes estĂ  estretament relacionat amb l’associaciĂł de dades i el seu significat. Aquest treball es centra en el desenvolupament d’una metodologia genĂšrica capaç de detectar els trets mĂ©s rellevants d’un recurs textual descobrint la seva associaciĂł semĂ ntica, es a dir, enllaçant-los amb conceptes modelats a una ontologia, i detectant els principals temes de discussiĂł. Els mĂštodes proposats sĂłn no supervisats per evitar el coll d’ampolla generat per l’anotaciĂł manual, independents del domini (aplicables a qualsevol Ă rea de coneixement) i flexibles (capaços d’analitzar recursos heterogenis: documents textuals o documents semi-estructurats com els articles de la ViquipĂšdia o les publicacions de Twitter). El treball ha estat avaluat en els Ă mbits turĂ­stic i mĂšdic. Per tant, aquesta dissertaciĂł Ă©s un primer pas cap a l'anotaciĂł semĂ ntica automĂ tica de documents necessĂ ria per possibilitar el camĂ­ cap a la visiĂł de la Web SemĂ ntica.La Web Social ha provocado un crecimiento exponencial de los contenidos disponibles, dejando enormes cantidades de recursos electrĂłnicos que a menudo abruman a los usuarios. Tal volumen de informaciĂłn es de interĂ©s para la comunidad de minerĂ­a de datos. Los algoritmos de minerĂ­a de datos explotan caracterĂ­sticas de las entidades para categorizarlas, agruparlas o clasificarlas segĂșn su semejanza. Los datos por sĂ­ mismos no aportan ningĂșn significado: deben ser interpretados para convertirse en informaciĂłn. Los mĂ©todos tradicionales no tienen como objetivo "entender" el contenido de un recurso, sino que extraen valores numĂ©ricos que se convierten en modelos tras aplicar cĂĄlculos estadĂ­sticos, los cuales cobran sentido bajo el anĂĄlisis manual de un experto. Actualmente, motivados por la Web SemĂĄntica, muchos investigadores han propuesto mĂ©todos semĂĄnticos de clasificaciĂłn de datos capaces de explotar recursos textuales a nivel conceptual. Sin embargo, generalmente estos mĂ©todos dependen de recursos anotados previamente para poder interpretar semĂĄnticamente el contenido de un documento. El uso de estos mĂ©todos estĂĄ estrechamente relacionado con la asociaciĂłn de datos y su significado. Este trabajo se centra en el desarrollo de una metodologĂ­a genĂ©rica capaz de detectar los rasgos mĂĄs relevantes de un recurso textual descubriendo su asociaciĂłn semĂĄntica, es decir, enlazĂĄndolos con conceptos modelados en una ontologĂ­a, y detectando los principales temas de discusiĂłn. Los mĂ©todos propuestos son no supervisados para evitar el cuello de botella generado por la anotaciĂłn manual, independientes del dominio (aplicables a cualquier ĂĄrea de conocimiento) y flexibles (capaces de analizar recursos heterogĂ©neos: documentos textuales o documentos semi-estructurados, como artĂ­culos de la Wikipedia o publicaciones de Twitter). El trabajo ha sido evaluado en los ĂĄmbitos turĂ­stico y mĂ©dico. Esta disertaciĂłn es un primer paso hacia la anotaciĂłn semĂĄntica automĂĄtica de documentos necesaria para posibilitar el camino hacia la visiĂłn de la Web SemĂĄntica.Social Web technologies have caused an exponential growth of the documents available through the Web, making enormous amounts of textual electronic resources available. Users may be overwhelmed by such amount of contents and, therefore, the automatic analysis and exploitation of all this information is of interest to the data mining community. Data mining algorithms exploit features of the entities in order to characterise, group or classify them according to their resemblance. Data by itself does not carry any meaning; it needs to be interpreted to convey information. Classical data analysis methods did not aim to “understand” the content and the data were treated as meaningless numbers and statistics were calculated on them to build models that were interpreted manually by human domain experts. Nowadays, motivated by the Semantic Web, many researchers have proposed semantic-grounded data classification and clustering methods that are able to exploit textual data at a conceptual level. However, they usually rely on pre-annotated inputs to be able to semantically interpret textual data such as the content of Web pages. The usability of all these methods is related to the linkage between data and its meaning. This work focuses on the development of a general methodology able to detect the most relevant features of a particular textual resource finding out their semantics (associating them to concepts modelled in ontologies) and detecting its main topics. The proposed methods are unsupervised (avoiding the manual annotation bottleneck), domain-independent (applicable to any area of knowledge) and flexible (being able to deal with heterogeneous resources: raw text documents, semi-structured user-generated documents such Wikipedia articles or short and noisy tweets). The methods have been evaluated in different fields (Tourism, Oncology). This work is a first step towards the automatic semantic annotation of documents, needed to pave the way towards the Semantic Web vision

    Arcabouço para seleção de indicadores de desempenho para a busca e seleção de parceiros para organizaçÔes virtuais

    Get PDF
    Tese(doutorado) - Universidade Federal de Santa Catarina, Centro TecnolĂłgico. Programa de PĂłs-Graduação em Engenharia ElĂ©trica.No mundo competitivo em que as organizaçÔes estĂŁo inseridas atualmente, nĂŁo hĂĄ tempo para postergar idĂ©ias e oportunidades pela falta de suporte a rĂĄpida formação de grupos de organizaçÔes que possam trabalhar colaborativamente. Assuntos relacionados a este tema vĂȘm sendo tratados pela ĂĄrea de pesquisa chamada "Redes Colaborativas de OrganizaçÔes". Dentre as pesquisas promovidas por essa disciplina destaca-se o estudo das "OrganizaçÔes Virtuais" (OVs). Um dos aspectos crĂ­ticos relacionados Ă  formação de OVs Ă© a seleção de seus parceiros, ou seja, como selecionar as organizaçÔes mais aptas a participar de uma OV. Neste contexto, um dos assuntos relacionados diz respeito aos critĂ©rios utilizados para essa seleção, mais especificamente, aos indicadores de desempenho a serem aplicados como critĂ©rios para a seleção de parceiros para OVs. Considerando a complexidade desta tarefa, este trabalho apresenta um arcabouço desenvolvido para auxiliar o usuĂĄrio na identificação e seleção dos indicadores de desempenho apropriados para comparar e sugerir organizaçÔes que sejam capazes de satisfazer os requisitos da oportunidade de colaboração. Este arcabouço compreende uma metodologia que utiliza recuperação de informação baseada em semĂąntica para selecionar os indicadores, suportada por uma ontologia desenvolvida especialmente para este propĂłsito. As vantagens de tal arcabouço sĂŁo, principalmente, o suporte ao usuĂĄrio na seleção de indicadores atravĂ©s da automatização de algumas partes do processo, assim como no auxilio ao entendimento do processo como um todo, deixando mais claros quais sĂŁo os elementos envolvidos, entradas, saĂ­das, recursos necessĂĄrios, interdependĂȘncias entre atividades, bem como a correta seqĂŒĂȘncia de ativação de cada uma. In the current dynamic world where organizations are inserted in, there is no time to postpone ideas or businesses, even less to lose them, due to the lack of support that helps them in a fast connection establishment to form groups of organizations prepared to work collaboratively. Issues related to this subject have been addressed by the research field called "Collaborative Networked Organizations". Within the researches provided by this discipline there are those related to "Virtual Organizations" (VOs). One of the critical aspects related to VO creation is the selection of its partners, it means, how to select the more suitable organizations to take part of a VO. In this context, one of the big issues concerns the criteria used to perform the partners' selection, more specifically, the performance indicators to be applied as criteria to select OV partners. Considering the complexity of this task, this work presents a framework developed to aid the user identifying and selecting the properly performance indicators to compare and suggest organizations that are able to fulfill the collaboration opportunity requirements. This framework comprises a methodology that uses semantic information retrieval technique to select the indicators, supported by an ontology developed especially for that purpose. The advantages of such approach are, mainly, the user's support in the selection of indicators through the automation of some parts of the process, as well as contributing to the process understanding at all, letting more clear which are the involved elements, inputs, outputs, resources required, interdependency among activities and the correct sequence of activation of each one of them
    corecore