209,329 research outputs found
Implementasi Information Extraction pada Domain Buku dengan Metode Supervised Learning of Extraction Patterns and Rules dengan HTML Text processing
ABSTRAKSI: Seiring cepatnya pertambahan data pada internet, internet kini dimanfaatkan menjadi sumber data bagi berbagai keperluan. Automatic Cataloging (ACat) adalah sistem IE yang digunakan untuk otomasi proses pengkatalogkan buku dengan input data dari internet yang berupa halaman offline html. Dengan menggunakan rule yang dibentuk dari learning corpus menggunakan natural language tools, informasi tentang buku dapat diambil dari suatu halaman html. Nilai precision dan recall dari penggunaan rule hasil learning dipengaruhi oleh nilai maksimum dan minimum slot filler length serta penghilangan uncoupled tag. Motode yang digunakan adalah Supervised Learning of Extraction Patterns and Rules di mana learning corpus perlu dibuat sesuai dengan domain yang diharapkan, dalam kasus ini merupakan domain buku. Sistem IE dibuat bekerja sebagai tagger yang berfungsi memberikan tag pada informasi relevant yang akan diekstrak.Kata Kunci : tagger, information extraction, supervised learning, rule, natural language, POS taggerABSTRACT: As fast as the adding of information in internet, nowadays internet is used for data resource for many purposes. Automatic Cataloging (ACat) is IE system for automaton process in book cataloging using html offline page as the input data. Using the rule that made from learning corpus using natural language tools, book information can be found from a html page. Precision and recall values in the tagging process using the rule is depend on the value of minimum and maximum slot filler length and the uncoupled tag removal. Supervised Learning of Extraction Patterns and Rules is the method in used, where learning corpus is needed to be made based on the domain, in this case in book domain. IE system is made to be a tagger that for tagging the relevant information that will be extracted.Keyword: tagger, information extraction, supervised learning, rule, natural language, POS tagge
Automatic grammar rule extraction and ranking for definitions
Learning texts contain much implicit knowledge which is ideally presented to the learner in a structured manner - a
typical example being definitions of terms in the text, which would ideally be presented separately as a glossary for
easy access. The problem is that manual extraction of such information can be tedious and time consuming. In this
paper we describe two experiments carried out to enable the automated extraction of definitions from non-technical
learning texts using evolutionary algorithms. A genetic programming approach is used to learn grammatical rules
helpful in discriminating between definitions and non-definitions, after which, a genetic algorithm is used to learn the
relative importance of these features, thus enabling the ranking of candidate sentences in order of confidence. The
results achieved are promising, and we show that it is possible for a Genetic Program to automatically learn similar
rules derived by a human linguistic expert and for a Genetic Algorithm to then give a weighted score to those rules so
as to rank extracted definitions in order of confidence in an effective manner.peer-reviewe
Using dependency parsing and machine learning for factoid question answering on spoken documents
This paper presents our experiments in question answering for speech corpora. These experiments focus on improving the answer extraction step of the QA process. We present two approaches to answer extraction in question answering for speech corpora that apply machine learning to improve the coverage and precision of the extraction. The first one is a reranker that uses only lexical information, the second one uses dependency parsing to score robust similarity between syntactic structures. Our experimental results show that the proposed learning models improve our previous results using only hand-made ranking rules with small syntactic information. Moreover, this results show also that a dependency parser can be useful for speech transcripts even if it was trained with written text data from a
news collection. We evaluate the system on manual transcripts of speech from EPPS English corpus and a set of questions transcribed from spontaneous oral questions. This data belongs to the CLEF 2009 track on QA on speech transcripts (QAst).Peer ReviewedPostprint (author’s final draft
On improving FOIL Algorithm
FOIL is an Inductive Logic Programming Algorithm
to discover first order rules to explain the patterns involved
in a domain of knowledge. Domains as Information Retrieval
or Information Extraction are handicaps for FOIL due to the
huge amount of information it needs manage to devise the rules.
Current solutions to problems in these domains are restricted to
devising ad hoc domain dependent inductive algorithms that use
a less-expressive formalism to code rules.
We work on optimising FOIL learning process to deal with
such complex domain problems while retaining expressiveness.
Our hypothesis is that changing the information gain scoring
function, used by FOIL to decide how rules are learnt, can reduce
the number of steps the algorithm performs. We have analysed 15
scoring functions, normalised them into a common notation and
checked a test in which they are computed. The learning process
will be evaluated according to its efficiency, and the quality of
the rules according to their precision, recall, complexity and
specificity. The results reinforce our hypothesis, demonstrating
that replacing the information gain can optimise both the FOIL
algorithm execution and the learnt rules.Ministerio de Educación y Ciencia TIN2007-64119Junta de Andalucía P07-TIC-2602Junta de Andalucía P08-TIC-4100Ministerio de Ciencia e Innovación TIN2008-04718-
The 2022 n2c2/UW Shared Task on Extracting Social Determinants of Health
Objective: The n2c2/UW SDOH Challenge explores the extraction of social
determinant of health (SDOH) information from clinical notes. The objectives
include the advancement of natural language processing (NLP) information
extraction techniques for SDOH and clinical information more broadly. This
paper presents the shared task, data, participating teams, performance results,
and considerations for future work.
Materials and Methods: The task used the Social History Annotated Corpus
(SHAC), which consists of clinical text with detailed event-based annotations
for SDOH events such as alcohol, drug, tobacco, employment, and living
situation. Each SDOH event is characterized through attributes related to
status, extent, and temporality. The task includes three subtasks related to
information extraction (Subtask A), generalizability (Subtask B), and learning
transfer (Subtask C). In addressing this task, participants utilized a range of
techniques, including rules, knowledge bases, n-grams, word embeddings, and
pretrained language models (LM).
Results: A total of 15 teams participated, and the top teams utilized
pretrained deep learning LM. The top team across all subtasks used a
sequence-to-sequence approach achieving 0.901 F1 for Subtask A, 0.774 F1
Subtask B, and 0.889 F1 for Subtask C.
Conclusions: Similar to many NLP tasks and domains, pretrained LM yielded the
best performance, including generalizability and learning transfer. An error
analysis indicates extraction performance varies by SDOH, with lower
performance achieved for conditions, like substance use and homelessness, that
increase health risks (risk factors) and higher performance achieved for
conditions, like substance abstinence and living with family, that reduce
health risks (protective factors)
Rules and fuzzy rules in text: concept, extraction and usage
Several concepts and techniques have been imported from other disciplines such as
Machine Learning and Artificial Intelligence to the field of textual data. In this paper,
we focus on the concept of rule and the management of uncertainty in text applications.
The different structures considered for the construction of the rules, the extraction of the
knowledge base and the applications and usage of these rules are detailed. We include a
review of the most relevant works of the different types of rules based on their representation
and their application to most of the common tasks of Information Retrieval
such as categorization, indexing and classification
Visualising Arabic sentiments and association rules in financial text
Text mining methods involve various techniques, such as text categorization, summarisation, information retrieval, document clustering, topic detection, and concept extraction. In addition, because of the difficulties involved in text mining, visualisation techniques can play a paramount role in the analysis and pre-processing of textual data. This paper will present two novel frameworks for the classification and extraction of the association rules and the visualisation of financial Arabic text in order to realize both the general structure and the sentiment within an accumulated corpus. However, mining unstructured data with natural language processing (NLP) and machine learning techniques can be arduous, especially where the Arabic language is concerned, because of limited research in this area. The results show that our frameworks can readily classify Arabic tweets. Furthermore, they can handle many antecedent text association rules for the positive class and the negative class
- …