Search CORE

9 research outputs found

Ontologies and Information Extraction

Author: Nazarenko Adeline
Nédellec Claire
Publication venue
Publication date: 01/01/2005
Field of study

This report argues that, even in the simplest cases, IE is an ontology-driven process. It is not a mere text filtering method based on simple pattern matching and keywords, because the extracted pieces of texts are interpreted with respect to a predefined partial domain model. This report shows that depending on the nature and the depth of the interpretation to be done for extracting the information, more or less knowledge must be involved. This report is mainly illustrated in biology, a domain in which there are critical needs for content-based exploration of the scientific literature and which becomes a major application domain for IE

arXiv.org e-Print Archive

HAL Descartes

HAL-Paris 13

Pattern extraction from the world wide web

Author: Mettu Praveena
Publication venue: Digital Scholarship@UNLV
Publication date: 01/12/2010
Field of study

The World Wide Web is a source of huge amount of unlabeled information spread across different sources in varied formats. This presents us with both opportunities and challenges in leveraging such large amount of unstructured data to build knowledge bases and to extract relevant information. As part of this thesis, a semi-supervised logistic regression model called “Dual Iterative Pattern Relation Extraction” proposed by Sergey Brin is selected for further investigation. DIPRE presents a technique which exploits the duality between sets of patterns and relations to grow the target relation starting from a small sample. This project built in JAVA using Google AJAX Search API includes designing, implementing and testing DIPRE approach in extracting various relationships from the web

University of Nevada, Las Vegas Repository

Acquisition of Domain-specific Patterns for Single Document Summarization and Information Extraction

Author: Du Mian
Yangarber Roman
Publication venue: 'The Society of Digital Information and Wireless Communications (SDIWC)'
Publication date: 17/04/2015
Field of study

AIPR 2015.Single-document summarization aims to reduce the size of a text document while preserving the most important information. Much work has been done on open-domain summarization. This paper presents an automatic way to mine domain-specific patterns from text documents. With a small amount of effort required for manual selection, these patterns can be used for domain-specific scenario-based document summarization and information extraction. Our evaluation shows that scenario-based document summarization can both filter irrelevant documents and create summaries for relevant documents within the specified domain.Peer reviewe

CiteSeerX

Helsingin yliopiston digitaalinen arkisto

Specification of knowledge acquisition and modeling of the process of the consensus

Author: Bonifacio Matteo
Dasiopoulou Stamatia
Dieng-Kuntz Rose
Euzenat Jérôme
Laera Loredana
Manzano-Macho David
Maynard Diana
Ponte Diego
Tamma Valentina
Zhdanova Anna
Publication venue: HAL CCSD
Publication date: 31/12/2004
Field of study

zhdanova2004aIn this deliverable, specification of knowledge acquisition and modeling of the process of consensus is provided

Hal - Université Grenoble Alpes

INRIA a CCSD electronic archive server

HAL-Rennes 1

Acquiring information extraction patterns from unannotated corpora

Author: Català Roig Neus
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2003
Field of study

Information Extraction (IE) can be defined as the task of automatically extracting preespecified kind of information from a text document. The extracted information is encoded in the required format and then can be used, for example, for text summarization or as accurate index to retrieve new documents.The main issue when building IE systems is how to obtain the knowledge needed to identify relevant information in a document. Today, IE systems are commonly based on extraction rules or IE patterns to represent the kind of information to be extracted. Most approaches to IE pattern acquisition require expert human intervention in many steps of the acquisition process. This dissertation presents a novel method for acquiring IE patterns, Essence, that significantly reduces the need for human intervention. The method is based on ELA, a specifically designed learning algorithm for acquiring IE patterns from unannotated corpora.The distinctive features of Essence and ELA are that 1) they permit the automatic acquisition of IE patterns from unrestricted and untagged text representative of the domain, due to 2) their ability to identify regularities around semantically relevant concept-words for the IE task by 3) using non-domain-specific lexical knowledge tools such as WordNet and 4) restricting the human intervention to defining the task, and validating and typifying the set of IE patterns obtained.Since Essence does not require a corpus annotated with the type of information to be extracted and it does makes use of a general purpose ontology and widely applied syntactic tools, it reduces the expert effort required to build an IE system and therefore also reduces the effort of porting the method to any domain.In order to Essence be validated we conducted a set of experiments to test the performance of the method. We used Essence to generate IE patterns for a MUC-like task. Nevertheless, the evaluation procedure for MUC competitions does not provide a sound evaluation of IE systems, especially of learning systems. For this reason, we conducted an exhaustive set of experiments to further test the abilities of Essence.The results of these experiments indicate that the proposed method is able to learn effective IE patterns

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Secretaría de Estado de Cultura

Unsupervised Discovery of Scenario-Level Patterns for Information Extraction

Author: Pasi Tapanainen
Ralph Grishman
Roman Yangarber
Silja Huttunen
Publication venue
Publication date: 01/01/2000
Field of study

Information Extraction (IE) systems are commonly based on pattern matching. Adapting an IE system to a new scenario entails the construction of a new pattern base -- a timeconsuming and expensive process. We have implemented a system for. finding patterns automatically from un-annotated text. Starting with a small initial set of seed patterns proposed by the user, the system applies an incremental discovery procedure to identify new patterns. We present experiments with evaluations which show that the resulting patterns exhibit high precision and recall

CiteSeerX

Crossref

Unsupervised relation extraction for e-learning applications

Author: Afzal Naveed
Mitkov Ruslan
Publication venue
Publication date: 01/01/2011
Field of study

In this modern era many educational institutes and business organisations are adopting the e-Learning approach as it provides an effective method for educating and testing their students and staff. The continuous development in the area of information technology and increasing use of the internet has resulted in a huge global market and rapid growth for e-Learning. Multiple Choice Tests (MCTs) are a popular form of assessment and are quite frequently used by many e-Learning applications as they are well adapted to assessing factual, conceptual and procedural information. In this thesis, we present an alternative to the lengthy and time-consuming activity of developing MCTs by proposing a Natural Language Processing (NLP) based approach that relies on semantic relations extracted using Information Extraction to automatically generate MCTs. Information Extraction (IE) is an NLP field used to recognise the most important entities present in a text, and the relations between those concepts, regardless of their surface realisations. In IE, text is processed at a semantic level that allows the partial representation of the meaning of a sentence to be produced. IE has two major subtasks: Named Entity Recognition (NER) and Relation Extraction (RE). In this work, we present two unsupervised RE approaches (surface-based and dependency-based). The aim of both approaches is to identify the most important semantic relations in a document without assigning explicit labels to them in order to ensure broad coverage, unrestricted to predefined types of relations. In the surface-based approach, we examined different surface pattern types, each implementing different assumptions about the linguistic expression of semantic relations between named entities while in the dependency-based approach we explored how dependency relations based on dependency trees can be helpful in extracting relations between named entities. Our findings indicate that the presented approaches are capable of achieving high precision rates. Our experiments make use of traditional, manually compiled corpora along with similar corpora automatically collected from the Web. We found that an automatically collected web corpus is still unable to ensure the same level of topic relevance as attained in manually compiled traditional corpora. Comparison between the surface-based and the dependency-based approaches revealed that the dependency-based approach performs better. Our research enabled us to automatically generate questions regarding the important concepts present in a domain by relying on unsupervised relation extraction approaches as extracted semantic relations allow us to identify key information in a sentence. The extracted patterns (semantic relations) are then automatically transformed into questions. In the surface-based approach, questions are automatically generated from sentences matched by the extracted surface-based semantic pattern which relies on a certain set of rules. Conversely, in the dependency-based approach questions are automatically generated by traversing the dependency tree of extracted sentence matched by the dependency-based semantic patterns. The MCQ systems produced from these surface-based and dependency-based semantic patterns were extrinsically evaluated by two domain experts in terms of questions and distractors readability, usefulness of semantic relations, relevance, acceptability of questions and distractors and overall MCQ usability. The evaluation results revealed that the MCQ system based on dependency-based semantic relations performed better than the surface-based one. A major outcome of this work is an integrated system for MCQ generation that has been evaluated by potential end users.EThOS - Electronic Theses Online ServiceGBUnited Kingdo

OpenGrey Repository

Moving towards the semantic web: enabling new technologies through the semantic annotation of social contents.

Author: Vicient Monllaó Carlos
Publication venue: 'Universitat Rovira I Virgili'
Publication date: 01/01/2015
Field of study

La Web Social ha causat un creixement exponencial dels continguts disponibles deixant enormes quantitats de recursos textuals electrònics que sovint aclaparen els usuaris. Aquest volum d’informació és d’interès per a la comunitat de mineria de dades. Els algorismes de mineria de dades exploten característiques de les entitats per tal de categoritzar-les, agrupar-les o classificar-les segons la seva semblança. Les dades per si mateixes no aporten cap mena de significat: han de ser interpretades per esdevenir informació. Els mètodes tradicionals de mineria de dades no tenen com a objectiu “entendre” el contingut d’un recurs, sinó que extreuen valors numèrics els quals esdevenen models en aplicar-hi càlculs estadístics, que només cobren sentit sota l’anàlisi manual d’un expert. Els darrers anys, motivat per la Web Semàntica, molts investigadors han proposat mètodes semàntics de classificació de dades capaços d’explotar recursos textuals a nivell conceptual. Malgrat això, normalment aquests mètodes depenen de recursos anotats prèviament per poder interpretar semànticament el contingut d’un document. L’ús d’aquests mètodes està estretament relacionat amb l’associació de dades i el seu significat. Aquest treball es centra en el desenvolupament d’una metodologia genèrica capaç de detectar els trets més rellevants d’un recurs textual descobrint la seva associació semàntica, es a dir, enllaçant-los amb conceptes modelats a una ontologia, i detectant els principals temes de discussió. Els mètodes proposats són no supervisats per evitar el coll d’ampolla generat per l’anotació manual, independents del domini (aplicables a qualsevol àrea de coneixement) i flexibles (capaços d’analitzar recursos heterogenis: documents textuals o documents semi-estructurats com els articles de la Viquipèdia o les publicacions de Twitter). El treball ha estat avaluat en els àmbits turístic i mèdic. Per tant, aquesta dissertació és un primer pas cap a l'anotació semàntica automàtica de documents necessària per possibilitar el camí cap a la visió de la Web Semàntica.La Web Social ha provocado un crecimiento exponencial de los contenidos disponibles, dejando enormes cantidades de recursos electrónicos que a menudo abruman a los usuarios. Tal volumen de información es de interés para la comunidad de minería de datos. Los algoritmos de minería de datos explotan características de las entidades para categorizarlas, agruparlas o clasificarlas según su semejanza. Los datos por sí mismos no aportan ningún significado: deben ser interpretados para convertirse en información. Los métodos tradicionales no tienen como objetivo "entender" el contenido de un recurso, sino que extraen valores numéricos que se convierten en modelos tras aplicar cálculos estadísticos, los cuales cobran sentido bajo el análisis manual de un experto. Actualmente, motivados por la Web Semántica, muchos investigadores han propuesto métodos semánticos de clasificación de datos capaces de explotar recursos textuales a nivel conceptual. Sin embargo, generalmente estos métodos dependen de recursos anotados previamente para poder interpretar semánticamente el contenido de un documento. El uso de estos métodos está estrechamente relacionado con la asociación de datos y su significado. Este trabajo se centra en el desarrollo de una metodología genérica capaz de detectar los rasgos más relevantes de un recurso textual descubriendo su asociación semántica, es decir, enlazándolos con conceptos modelados en una ontología, y detectando los principales temas de discusión. Los métodos propuestos son no supervisados para evitar el cuello de botella generado por la anotación manual, independientes del dominio (aplicables a cualquier área de conocimiento) y flexibles (capaces de analizar recursos heterogéneos: documentos textuales o documentos semi-estructurados, como artículos de la Wikipedia o publicaciones de Twitter). El trabajo ha sido evaluado en los ámbitos turístico y médico. Esta disertación es un primer paso hacia la anotación semántica automática de documentos necesaria para posibilitar el camino hacia la visión de la Web Semántica.Social Web technologies have caused an exponential growth of the documents available through the Web, making enormous amounts of textual electronic resources available. Users may be overwhelmed by such amount of contents and, therefore, the automatic analysis and exploitation of all this information is of interest to the data mining community. Data mining algorithms exploit features of the entities in order to characterise, group or classify them according to their resemblance. Data by itself does not carry any meaning; it needs to be interpreted to convey information. Classical data analysis methods did not aim to “understand” the content and the data were treated as meaningless numbers and statistics were calculated on them to build models that were interpreted manually by human domain experts. Nowadays, motivated by the Semantic Web, many researchers have proposed semantic-grounded data classification and clustering methods that are able to exploit textual data at a conceptual level. However, they usually rely on pre-annotated inputs to be able to semantically interpret textual data such as the content of Web pages. The usability of all these methods is related to the linkage between data and its meaning. This work focuses on the development of a general methodology able to detect the most relevant features of a particular textual resource finding out their semantics (associating them to concepts modelled in ontologies) and detecting its main topics. The proposed methods are unsupervised (avoiding the manual annotation bottleneck), domain-independent (applicable to any area of knowledge) and flexible (being able to deal with heterogeneous resources: raw text documents, semi-structured user-generated documents such Wikipedia articles or short and noisy tweets). The methods have been evaluated in different fields (Tourism, Oncology). This work is a first step towards the automatic semantic annotation of documents, needed to pave the way towards the Semantic Web vision

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Tesis Doctorals en Xarxa

Arcabouço para seleção de indicadores de desempenho para a busca e seleção de parceiros para organizações virtuais

Author: Baldo Fabiano
Publication venue: Florianópolis, SC
Publication date: 01/01/2008
Field of study

Tese(doutorado) - Universidade Federal de Santa Catarina, Centro Tecnológico. Programa de Pós-Graduação em Engenharia Elétrica.No mundo competitivo em que as organizações estão inseridas atualmente, não há tempo para postergar idéias e oportunidades pela falta de suporte a rápida formação de grupos de organizações que possam trabalhar colaborativamente. Assuntos relacionados a este tema vêm sendo tratados pela área de pesquisa chamada "Redes Colaborativas de Organizações". Dentre as pesquisas promovidas por essa disciplina destaca-se o estudo das "Organizações Virtuais" (OVs). Um dos aspectos críticos relacionados à formação de OVs é a seleção de seus parceiros, ou seja, como selecionar as organizações mais aptas a participar de uma OV. Neste contexto, um dos assuntos relacionados diz respeito aos critérios utilizados para essa seleção, mais especificamente, aos indicadores de desempenho a serem aplicados como critérios para a seleção de parceiros para OVs. Considerando a complexidade desta tarefa, este trabalho apresenta um arcabouço desenvolvido para auxiliar o usuário na identificação e seleção dos indicadores de desempenho apropriados para comparar e sugerir organizações que sejam capazes de satisfazer os requisitos da oportunidade de colaboração. Este arcabouço compreende uma metodologia que utiliza recuperação de informação baseada em semântica para selecionar os indicadores, suportada por uma ontologia desenvolvida especialmente para este propósito. As vantagens de tal arcabouço são, principalmente, o suporte ao usuário na seleção de indicadores através da automatização de algumas partes do processo, assim como no auxilio ao entendimento do processo como um todo, deixando mais claros quais são os elementos envolvidos, entradas, saídas, recursos necessários, interdependências entre atividades, bem como a correta seqüência de ativação de cada uma. In the current dynamic world where organizations are inserted in, there is no time to postpone ideas or businesses, even less to lose them, due to the lack of support that helps them in a fast connection establishment to form groups of organizations prepared to work collaboratively. Issues related to this subject have been addressed by the research field called "Collaborative Networked Organizations". Within the researches provided by this discipline there are those related to "Virtual Organizations" (VOs). One of the critical aspects related to VO creation is the selection of its partners, it means, how to select the more suitable organizations to take part of a VO. In this context, one of the big issues concerns the criteria used to perform the partners' selection, more specifically, the performance indicators to be applied as criteria to select OV partners. Considering the complexity of this task, this work presents a framework developed to aid the user identifying and selecting the properly performance indicators to compare and suggest organizations that are able to fulfill the collaboration opportunity requirements. This framework comprises a methodology that uses semantic information retrieval technique to select the indicators, supported by an ontology developed especially for that purpose. The advantages of such approach are, mainly, the user's support in the selection of indicators through the automation of some parts of the process, as well as contributing to the process understanding at all, letting more clear which are the involved elements, inputs, outputs, resources required, interdependency among activities and the correct sequence of activation of each one of them

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositório Institucional da UFSC

RCAAP - Repositório Científico de Acesso Aberto de Portugal