6,364 research outputs found
A Joint Model for Definition Extraction with Syntactic Connection and Semantic Consistency
Definition Extraction (DE) is one of the well-known topics in Information
Extraction that aims to identify terms and their corresponding definitions in
unstructured texts. This task can be formalized either as a sentence
classification task (i.e., containing term-definition pairs or not) or a
sequential labeling task (i.e., identifying the boundaries of the terms and
definitions). The previous works for DE have only focused on one of the two
approaches, failing to model the inter-dependencies between the two tasks. In
this work, we propose a novel model for DE that simultaneously performs the two
tasks in a single framework to benefit from their inter-dependencies. Our model
features deep learning architectures to exploit the global structures of the
input sentences as well as the semantic consistencies between the terms and the
definitions, thereby improving the quality of the representation vectors for
DE. Besides the joint inference between sentence classification and sequential
labeling, the proposed model is fundamentally different from the prior work for
DE in that the prior work has only employed the local structures of the input
sentences (i.e., word-to-word relations), and not yet considered the semantic
consistencies between terms and definitions. In order to implement these novel
ideas, our model presents a multi-task learning framework that employs graph
convolutional neural networks and predicts the dependency paths between the
terms and the definitions. We also seek to enforce the consistency between the
representations of the terms and definitions both globally (i.e., increasing
semantic consistency between the representations of the entire sentences and
the terms/definitions) and locally (i.e., promoting the similarity between the
representations of the terms and the definitions)
Thesaurus-based index term extraction for agricultural documents
This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction
Annotating patient clinical records with syntactic chunks and named entities: the Harvey corpus
The free text notes typed by physicians during patient consultations contain valuable information for the study of disease and treatment. These notes are difficult to process by existing natural language analysis tools since they are highly telegraphic (omitting many words), and contain many spelling mistakes, inconsistencies in punctuation, and non-standard word order. To support information extraction and classification tasks over such text, we describe a de-identified corpus of free text notes, a shallow syntactic and named entity annotation scheme for this kind of text, and an approach to training domain specialists with no linguistic background to annotate the text. Finally, we present a statistical chunking system for such clinical text with a stable learning rate and good accuracy, indicating that the manual annotation is consistent and that the annotation scheme is tractable for machine learning
- …