Search CORE

209,329 research outputs found

Implementasi Information Extraction pada Domain Buku dengan Metode Supervised Learning of Extraction Patterns and Rules dengan HTML Text processing

Author: Ummi Hamidah
Publication venue: Universitas Telkom
Publication date: 01/01/2013
Field of study

ABSTRAKSI: Seiring cepatnya pertambahan data pada internet, internet kini dimanfaatkan menjadi sumber data bagi berbagai keperluan. Automatic Cataloging (ACat) adalah sistem IE yang digunakan untuk otomasi proses pengkatalogkan buku dengan input data dari internet yang berupa halaman offline html. Dengan menggunakan rule yang dibentuk dari learning corpus menggunakan natural language tools, informasi tentang buku dapat diambil dari suatu halaman html. Nilai precision dan recall dari penggunaan rule hasil learning dipengaruhi oleh nilai maksimum dan minimum slot filler length serta penghilangan uncoupled tag. Motode yang digunakan adalah Supervised Learning of Extraction Patterns and Rules di mana learning corpus perlu dibuat sesuai dengan domain yang diharapkan, dalam kasus ini merupakan domain buku. Sistem IE dibuat bekerja sebagai tagger yang berfungsi memberikan tag pada informasi relevant yang akan diekstrak.Kata Kunci : tagger, information extraction, supervised learning, rule, natural language, POS taggerABSTRACT: As fast as the adding of information in internet, nowadays internet is used for data resource for many purposes. Automatic Cataloging (ACat) is IE system for automaton process in book cataloging using html offline page as the input data. Using the rule that made from learning corpus using natural language tools, book information can be found from a html page. Precision and recall values in the tagging process using the rule is depend on the value of minimum and maximum slot filler length and the uncoupled tag removal. Supervised Learning of Extraction Patterns and Rules is the method in used, where learning corpus is needed to be made based on the domain, in this case in book domain. IE system is made to be a tagger that for tagging the relevant information that will be extracted.Keyword: tagger, information extraction, supervised learning, rule, natural language, POS tagge

Open Library

Automatic grammar rule extraction and ranking for definitions

Author: Borg Claudia
LREC 2010
Pace Gordon J.
Rosner Mike
Publication venue: University of Malta. Faculty of Information and Communication Technology
Publication date: 01/01/2010
Field of study

Learning texts contain much implicit knowledge which is ideally presented to the learner in a structured manner - a typical example being definitions of terms in the text, which would ideally be presented separately as a glossary for easy access. The problem is that manual extraction of such information can be tedious and time consuming. In this paper we describe two experiments carried out to enable the automated extraction of definitions from non-technical learning texts using evolutionary algorithms. A genetic programming approach is used to learn grammatical rules helpful in discriminating between definitions and non-definitions, after which, a genetic algorithm is used to learn the relative importance of these features, thus enabling the ranking of candidate sentences in order of confidence. The results achieved are promising, and we show that it is possible for a Genetic Program to automatically learn similar rules derived by a human linguistic expert and for a Genetic Algorithm to then give a weighted score to those rules so as to rank extracted definitions in order of confidence in an effective manner.peer-reviewe

OAR@UM

Using dependency parsing and machine learning for factoid question answering on spoken documents

Author: Comas Umbert Pere Ramon
Màrquez Villodre Lluís
Turmo Borras Jorge
Publication venue
Publication date: 01/01/2010
Field of study

This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score robust similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. Moreover, this results show also that a dependency parser can be useful for speech transcripts even if it was trained with written text data from a news collection. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 track on QA on speech transcripts (QAst).Peer ReviewedPostprint (author’s final draft

UPCommons. Portal del coneixement obert de la UPC

On improving FOIL Algorithm

Author: Arjona José L.
Jiménez Aguirre Patricia
Álvarez J.L.
Publication venue: CSREA Press
Publication date: 01/01/2011
Field of study

FOIL is an Inductive Logic Programming Algorithm to discover first order rules to explain the patterns involved in a domain of knowledge. Domains as Information Retrieval or Information Extraction are handicaps for FOIL due to the huge amount of information it needs manage to devise the rules. Current solutions to problems in these domains are restricted to devising ad hoc domain dependent inductive algorithms that use a less-expressive formalism to code rules. We work on optimising FOIL learning process to deal with such complex domain problems while retaining expressiveness. Our hypothesis is that changing the information gain scoring function, used by FOIL to decide how rules are learnt, can reduce the number of steps the algorithm performs. We have analysed 15 scoring functions, normalised them into a common notation and checked a test in which they are computed. The learning process will be evaluated according to its efficiency, and the quality of the rules according to their precision, recall, complexity and specificity. The results reinforce our hypothesis, demonstrating that replacing the information gain can optimise both the FOIL algorithm execution and the learnt rules.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-

idUS. Depósito de Investigación Universidad de Sevilla

The 2022 n2c2/UW Shared Task on Extracting Social Determinants of Health

Author: Lybarger Kevin
Uzuner Özlem
Yetisgen Meliha
Publication venue: 'Oxford University Press (OUP)'
Publication date: 13/02/2023
Field of study

Objective: The n2c2/UW SDOH Challenge explores the extraction of social determinant of health (SDOH) information from clinical notes. The objectives include the advancement of natural language processing (NLP) information extraction techniques for SDOH and clinical information more broadly. This paper presents the shared task, data, participating teams, performance results, and considerations for future work. Materials and Methods: The task used the Social History Annotated Corpus (SHAC), which consists of clinical text with detailed event-based annotations for SDOH events such as alcohol, drug, tobacco, employment, and living situation. Each SDOH event is characterized through attributes related to status, extent, and temporality. The task includes three subtasks related to information extraction (Subtask A), generalizability (Subtask B), and learning transfer (Subtask C). In addressing this task, participants utilized a range of techniques, including rules, knowledge bases, n-grams, word embeddings, and pretrained language models (LM). Results: A total of 15 teams participated, and the top teams utilized pretrained deep learning LM. The top team across all subtasks used a sequence-to-sequence approach achieving 0.901 F1 for Subtask A, 0.774 F1 Subtask B, and 0.889 F1 for Subtask C. Conclusions: Similar to many NLP tasks and domains, pretrained LM yielded the best performance, including generalizability and learning transfer. An error analysis indicates extraction performance varies by SDOH, with lower performance achieved for conditions, like substance use and homelessness, that increase health risks (risk factors) and higher performance achieved for conditions, like substance abstinence and living with family, that reduce health risks (protective factors)

arXiv.org e-Print Archive

Rules and fuzzy rules in text: concept, extraction and usage

Author: Kraft D. H.
Martín Bautista María José
Sánchez Fernández Daniel
Publication venue: 'Elsevier BV'
Publication date: 02/09/2003
Field of study

Several concepts and techniques have been imported from other disciplines such as Machine Learning and Artificial Intelligence to the field of textual data. In this paper, we focus on the concept of rule and the management of uncertainty in text applications. The different structures considered for the construction of the rules, the extraction of the knowledge base and the applications and usage of these rules are detailed. We include a review of the most relevant works of the different types of rules based on their representation and their application to most of the common tasks of Information Retrieval such as categorization, indexing and classification

Repositorio Institucional Universidad de Granada

Visualising Arabic sentiments and association rules in financial text

Author: AL-Rubaiee Hamed Saad
Li Dayou
Qiu Renxi
Publication venue: 'The Science and Information Organization'
Publication date: 28/02/2017
Field of study

Text mining methods involve various techniques, such as text categorization, summarisation, information retrieval, document clustering, topic detection, and concept extraction. In addition, because of the difficulties involved in text mining, visualisation techniques can play a paramount role in the analysis and pre-processing of textual data. This paper will present two novel frameworks for the classification and extraction of the association rules and the visualisation of financial Arabic text in order to realize both the general structure and the sentiment within an accumulated corpus. However, mining unstructured data with natural language processing (NLP) and machine learning techniques can be arduous, especially where the Arabic language is concerned, because of limited research in this area. The results show that our frameworks can readily classify Arabic tweets. Furthermore, they can handle many antecedent text association rules for the positive class and the negative class

University of Bedfordshire Repository